[bip] Loading sequences from FASTA files

Tue Nov 24 03:32:25 PST 2009

2009/11/24 C. Titus Brown <ctb at msu.edu>:
> On Tue, Nov 24, 2009 at 12:49:09AM +0000, James Casbon wrote:
>> 2009/11/19 C. Titus Brown <ctb at msu.edu>:
>> > Alex Nolley in my lab has been working on an approach to speedily index FASTA
>> > and FASTQ files and retrieve arbitrary sequences from them by ID, and we're
>> > looking for information on what other people use. ?So far we've only
>> > compared it to pygr's SequenceFileDB class.
>> >
>> > The goal is to be able to quickly retrieve sequences from a file by ID, e.g.
>> >
>> > ? sequence_db = SequenceFileDB('large_file.fasta')
>> > ? seq = sequence_db['some_sequence']
>> >
>> > without iterating over the file or doing indexing of it more than once.
>> >
>> ...
>> > Are there any other Python (or C-based) APIs that people are aware of?
>>
>> I have been convinced by the 'sqlite3 as the mp3 of data' idea for a
>> while now: e.g.
>> http://blog.gobansaor.com/2009/03/14/sqlite-as-the-mp3-of-data/
>>
>> This 'find a FASTA record quickly' problem is a classic example of the
>> tradeoffs you make in bioinformatics as it is done: a proper database
>> would make your life a lot easier, but in reality you have a lot of
>> huge files that you are unwilling to manage in a proper RDMS just to
>> answer a few simple queries.  Also, no one can agree on their
>> favourite database so you have to support a database independent text
>> format, in case someone else doesn't have yours installed. So instead,
>> the tools around these files gradually creep towards database
>> operations (in this case an indexed indentifier).
>>
>> This is where sqlite comes in - you get all the benefits of a database
>> but in a *portable* flat file.  You can send it, copy it or whatever
>> and it still looks the same.  Even better, you can attach two sqlite
>> databases at once and do joins (think a fasta sqlite file and a gff3
>> sqlite file to extract sequences for certain features instantly).  Of
>> course, since it is a database you get the benefit of fast aggregate
>> operations.
>
> [ ... ]
>
> A couple of thoughts --
>
> I started out trying to use SQL for bioinformatic data storage about ...
> 8 years ago.  It sucked then, and it still sucks :).  Now that I have pygr as
> an alternative, I am thoroughly convinced that a relational database doesn't
> fit my needs (which basically involve doing overlap queries and storing
> relationships between objects -- hence invoking the well-known
> object-relational impedance problem that blocks using a SQL db).
>
> It might suffice for a sequence storage file, however.  There are two
> countervailing concerns.

My point is not to try and convince you that you should use SQL for
your data.  It's more that if we all use sqlite as 'text file plus'
for sending data around it would make our lives easier.

>
> First, SQL doesn't provide any standard way to get a substring of a
> sequence record, so if I want to get an arbitrary slice from chr1 of
> hg17, I have to read all of chr1 into memory and then get my slice from
> that in-memory string.  I believe this is what motivated Chris Lee's
> original seqdb implementation for pygr... and it's easy to add to
> screed.

See Peter's comment on substring.

> Second, I don't think it's likely that a read-write relational database like
> sqlite will, in the end, be faster than a read-only indexed flat file.  This
> is of little concern for small data sets like 454 :).  However, I asked
> Alex to design screed with a billion-sequence database in mind, from e.g.
> the next generation of Illumina sequencers... and screed retrieval seems to be
> constant with database size, so that's a good sign.  I don't have a crystal
> ball but it seems clear that our sequencing bonanza will only continue to
> expand and I'd like to plan ahead a year or two.

Still doing that 'low input, high throughput, no output' science ;) -
http://www.mc.vanderbilt.edu/reporter/index.html?ID=5027

I would hold that a corollorary of Greenspun's 10th rule is that any
indexed data format will contain an ad hoc informally-specified
bug-ridden slow implementation of half of SQL.   You would only be
using the read and not the write parts of sql here.

> I am also completely unconcerned about compatibility with heathens who have not
> Seen the Light of Python, so I don't care too much about cross-language
> compatibility.  Language bigotry can really simplify certain matters :)

Ok, but my point is that sqlite is the opposite of pygr.   Available
to eveyone despite their enlightment, and ideal for the lowest levels
of data interchange.

> All of this having been said, Alex is going to benchmark screed against sqlite
> to see if we are outperforming it, and either way pygr is planning to move to
> sqlite for several other bits of functionality.

Making the data import fast takes a bit of tuning.  This will help:
http://erictheturtle.blogspot.com/2009/05/fastest-bulk-import-into-sqlite.html

James
-
http://www.machine-envy.com/blog/