[bip] Loading sequences from FASTA files

Mon Nov 23 16:49:09 PST 2009

2009/11/19 C. Titus Brown <ctb at msu.edu>:
> Hi all,
>
> Alex Nolley in my lab has been working on an approach to speedily index FASTA
> and FASTQ files and retrieve arbitrary sequences from them by ID, and we're
> looking for information on what other people use.  So far we've only
> compared it to pygr's SequenceFileDB class.
>
> The goal is to be able to quickly retrieve sequences from a file by ID, e.g.
>
>   sequence_db = SequenceFileDB('large_file.fasta')
>   seq = sequence_db['some_sequence']
>
> without iterating over the file or doing indexing of it more than once.
>
...
> Are there any other Python (or C-based) APIs that people are aware of?

I have been convinced by the 'sqlite3 as the mp3 of data' idea for a
while now: e.g.
http://blog.gobansaor.com/2009/03/14/sqlite-as-the-mp3-of-data/

This 'find a FASTA record quickly' problem is a classic example of the
tradeoffs you make in bioinformatics as it is done: a proper database
would make your life a lot easier, but in reality you have a lot of
huge files that you are unwilling to manage in a proper RDMS just to
answer a few simple queries.  Also, no one can agree on their
favourite database so you have to support a database independent text
format, in case someone else doesn't have yours installed. So instead,
the tools around these files gradually creep towards database
operations (in this case an indexed indentifier).

This is where sqlite comes in - you get all the benefits of a database
but in a *portable* flat file.  You can send it, copy it or whatever
and it still looks the same.  Even better, you can attach two sqlite
databases at once and do joins (think a fasta sqlite file and a gff3
sqlite file to extract sequences for certain features instantly).  Of
course, since it is a database you get the benefit of fast aggregate
operations.

A really simple demo is an sqlite file to solve this FASTA indexing
problem.  Even better, sqlite will allow *subsequence* access of the
fasta file for free. I loaded 4.3GB of DNA from here:
ftp://ftp.ensembl.org/pub/release-56/fasta/homo_sapiens/dna/
into a sqlite database that has a table sequence with two text
columns: accession (which is primary key) and sequence.  You may be
interested to see that these load fine - the default maximum length of
a string is a billion characters: http://www.sqlite.org/limits.html

Running /usr/bin/time  sqlite3 human.db "select accession,
substr(sequence,100000,100) from sequence;" to select 100 bases in 34
sequences takes about 15 seconds on my machine.  Loading a titanium
454 plate with 973448 reads and accessing 1000 random accessions takes
around 0.09 seconds.

Since python (and I believe perl) have great links to sqlite, most
people out there could start using this straight away.

Another thing I'd love to see is virtual table implementations of
bioinformatic indexes (virtual table provide an extension point for
custom indexes).  So you could use virtual tables to hook in the
NCList indexes from pygr.  See http://www.ddj.com/database/202802959
for more info on this.

James
-
http://www.machine-envy.com/blog/