[bip] Indexing big sequence databases

Mon Mar 29 05:40:18 PDT 2010

On Mon, Mar 29, 2010 at 09:28:30AM +0100, Peter wrote:
> On Mon, Mar 29, 2010 at 3:46 AM, C. Titus Brown <ctb at msu.edu> wrote:
> >
> > Hi all,
> >
> > with reference to this earlier thread in November about indexing FASTA
> > and FASTQ files,
> >
> > ?http://lists.idyll.org/pipermail/biology-in-python/2009-November/000499.html
> >
> > I posted an update:
> >
> > ?http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html
> >
> > Basically, taking James Casbon's advice, we've switched to using sqlite as our
> > backend for the dirty work of storing sequences.
> >
> > Comments & random thoughts welcome, as always.
> 
> Hi Titus,
> 
> Thanks for the update - interesting that SQLite was able to
> beat your custom code.

Yep.  And by a *lot*, like a factor of two.  Kind of depressing.

> Regarding your reference to Biopython's indexing which stores
> the file offsets in memory, we are aware that will only scale so
> far (although as always it depends on what you are trying to do
> - this works perfectly well for fairly simple tasks like resorting
> FASTQ files). We have already looked at storing the read name
> mapping to file offsets in SQLite:
> 
> http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html
> 
> I'd be interested to see how you guys have handled flushing
> the index to disk (how often you do a commit) while building
> the index, and other issues to speed up writing the index.

We haven't put much effort into optimizing that yet, since *anything*
related to these 10gb files is annoying and so 5 min vs 10 min on
indexing isn't a big deal.

> P.S.: Note that the Biopython code works with many different
> file formats, not just FASTA and FASTQ but also things like
> GenBank, EMBL, SwissProt, and SFF. This means that trying
> to encode the entire file into an SQLite database is probably
> not a good idea for us in general. Hence why we are focusing
> on storing file offsets.

Sure, and I recognize the different philosophies of BioPython
and pygr, too.  The problem is that we're faced with ever-increasing
fastq file size, and not so much the rest... seems like a good idea
to optimize specifically for that!  Actually, I would be surprised if the
additional parsing involved in handling each record dynamically didn't have a
pretty large impact during execution of scripts dealing with over a few hundred
thousand records.

The screed format supports all sorts of data (as long as it's sequence-linked).
Maybe we'll try SFF or SwissProt.

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu