[bip] Loading sequences from FASTA files

Thu Nov 19 07:52:18 PST 2009

Hi all,

Alex Nolley in my lab has been working on an approach to speedily index FASTA
and FASTQ files and retrieve arbitrary sequences from them by ID, and we're
looking for information on what other people use.  So far we've only
compared it to pygr's SequenceFileDB class.

The goal is to be able to quickly retrieve sequences from a file by ID, e.g.

   sequence_db = SequenceFileDB('large_file.fasta')
   seq = sequence_db['some_sequence']

without iterating over the file or doing indexing of it more than once.

The BioPython wiki has a statement,

   For larger files, it isn't possible to hold everything in memory, so
   Bio.SeqIO.to_dict() is not suitable. Biopython 1.52 will include an indexing
   function for this situation, but you might also consider BioSQL. 

which seems to imply that there's something in 1.52 but it's not yet
documented?

Are there any other Python (or C-based) APIs that people are aware of?

Ours is called 'screed'; it uses a write-once-read-many indexing strategy
that appears to be reasonably fast. It's available here,

	http://github.com/acr/screed

although it has not yet been released, so downloader beware ;).  I'll
announce it here once it has been released.

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu