FASTA parser

I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.

Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.

The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish.

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
    
  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
            
  print "%d sequences found" % len(order)
  return order, sequences

Comments

Hi Sir,

Can you explain this code in details?

Could you pls attach comments for the command lines (alternative commands, why those commands are needed and examples you think works best)

Post new comment

The content of this field is kept private and will not be shown publicly.