[bip] Loading sequences from FASTA files

Mon Nov 23 19:04:46 PST 2009

On Tue, Nov 24, 2009 at 12:49:09AM +0000, James Casbon wrote:
> 2009/11/19 C. Titus Brown <ctb at msu.edu>:
> > Alex Nolley in my lab has been working on an approach to speedily index FASTA
> > and FASTQ files and retrieve arbitrary sequences from them by ID, and we're
> > looking for information on what other people use. ?So far we've only
> > compared it to pygr's SequenceFileDB class.
> >
> > The goal is to be able to quickly retrieve sequences from a file by ID, e.g.
> >
> > ? sequence_db = SequenceFileDB('large_file.fasta')
> > ? seq = sequence_db['some_sequence']
> >
> > without iterating over the file or doing indexing of it more than once.
> >
> ...
> > Are there any other Python (or C-based) APIs that people are aware of?
> 
> I have been convinced by the 'sqlite3 as the mp3 of data' idea for a
> while now: e.g.
> http://blog.gobansaor.com/2009/03/14/sqlite-as-the-mp3-of-data/
> 
> This 'find a FASTA record quickly' problem is a classic example of the
> tradeoffs you make in bioinformatics as it is done: a proper database
> would make your life a lot easier, but in reality you have a lot of
> huge files that you are unwilling to manage in a proper RDMS just to
> answer a few simple queries.  Also, no one can agree on their
> favourite database so you have to support a database independent text
> format, in case someone else doesn't have yours installed. So instead,
> the tools around these files gradually creep towards database
> operations (in this case an indexed indentifier).
> 
> This is where sqlite comes in - you get all the benefits of a database
> but in a *portable* flat file.  You can send it, copy it or whatever
> and it still looks the same.  Even better, you can attach two sqlite
> databases at once and do joins (think a fasta sqlite file and a gff3
> sqlite file to extract sequences for certain features instantly).  Of
> course, since it is a database you get the benefit of fast aggregate
> operations.

[ ... ]

A couple of thoughts --

I started out trying to use SQL for bioinformatic data storage about ...
8 years ago.  It sucked then, and it still sucks :).  Now that I have pygr as
an alternative, I am thoroughly convinced that a relational database doesn't
fit my needs (which basically involve doing overlap queries and storing
relationships between objects -- hence invoking the well-known
object-relational impedance problem that blocks using a SQL db).

It might suffice for a sequence storage file, however.  There are two
countervailing concerns.

First, SQL doesn't provide any standard way to get a substring of a
sequence record, so if I want to get an arbitrary slice from chr1 of
hg17, I have to read all of chr1 into memory and then get my slice from
that in-memory string.  I believe this is what motivated Chris Lee's
original seqdb implementation for pygr... and it's easy to add to
screed.

Second, I don't think it's likely that a read-write relational database like
sqlite will, in the end, be faster than a read-only indexed flat file.  This
is of little concern for small data sets like 454 :).  However, I asked
Alex to design screed with a billion-sequence database in mind, from e.g.
the next generation of Illumina sequencers... and screed retrieval seems to be
constant with database size, so that's a good sign.  I don't have a crystal
ball but it seems clear that our sequencing bonanza will only continue to
expand and I'd like to plan ahead a year or two.

I am also completely unconcerned about compatibility with heathens who have not
Seen the Light of Python, so I don't care too much about cross-language
compatibility.  Language bigotry can really simplify certain matters :)

All of this having been said, Alex is going to benchmark screed against sqlite
to see if we are outperforming it, and either way pygr is planning to move to
sqlite for several other bits of functionality.

> Since python (and I believe perl) have great links to sqlite, most
> people out there could start using this straight away.

It's even better -- Python comes with sqlite as of 2.5!  And I think it's
the best option for large-scale data storage that is packaged with Python.

> Another thing I'd love to see is virtual table implementations of
> bioinformatic indexes (virtual table provide an extension point for
> custom indexes).  So you could use virtual tables to hook in the
> NCList indexes from pygr.  See http://www.ddj.com/database/202802959
> for more info on this.

OK, now that's really neat!  Thanks!

--titus
-- 
C. Titus Brown, ctb at msu.edu