[bip] Loading sequences from FASTA files

Thu Nov 19 08:06:55 PST 2009

On Thu, Nov 19, 2009 at 3:52 PM, C. Titus Brown <ctb at msu.edu> wrote:
>
> Hi all,
>
> Alex Nolley in my lab has been working on an approach to speedily index FASTA
> and FASTQ files and retrieve arbitrary sequences from them by ID, and we're
> looking for information on what other people use.  So far we've only
> compared it to pygr's SequenceFileDB class.
>
> The goal is to be able to quickly retrieve sequences from a file by ID, e.g.
>
>   sequence_db = SequenceFileDB('large_file.fasta')
>   seq = sequence_db['some_sequence']
>
> without iterating over the file or doing indexing of it more than once.
>
> The BioPython wiki has a statement,
>
>   For larger files, it isn't possible to hold everything in memory, so
>   Bio.SeqIO.to_dict() is not suitable. Biopython 1.52 will include an indexing
>   function for this situation, but you might also consider BioSQL.
>
> which seems to imply that there's something in 1.52 but it's not yet
> documented?

It is in Biopython 1.52, documented in the docstrings and tutorial:
http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

We even had a blog post on this functionality:
http://news.open-bio.org/news/2009/09/biopython-seqio-index/

The list of SeqIO supported file formats given at the top of that page
includes a column showing which can be indexed (not just FASTA
and FASTQ): http://biopython.org/wiki/SeqIO

However -  that line on the wiki you quoted was out of date (given
Biopython 1.52 has been out almost two months). Thanks for
pointing that out, it is fixed now.

You'll see the usage is quite similar to the pygr example you gave.

Regards,

Peter