[bip] Indexing big sequence databases

C. Titus Brown ctb at msu.edu
Mon Mar 29 08:48:31 PDT 2010


On Mon, Mar 29, 2010 at 11:45:14AM -0400, Paul Davis wrote:
> On Sun, Mar 28, 2010 at 10:46 PM, C. Titus Brown <ctb at msu.edu> wrote:
> > Hi all,
> >
> > with reference to this earlier thread in November about indexing FASTA
> > and FASTQ files,
> >
> > ?http://lists.idyll.org/pipermail/biology-in-python/2009-November/000499.html
> >
> > I posted an update:
> >
> > ?http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html
> >
> > Basically, taking James Casbon's advice, we've switched to using sqlite as our
> > backend for the dirty work of storing sequences.
> >
> > Comments & random thoughts welcome, as always.
> 
> You might also be interested in testing Tokyo Cabinet if your queries
> are limited to "fetch by name" and "iterate over everything." Its
> treated me pretty well but I've never gone out of my way to benchmark
> it against other solutions as it was always fast enough.

Yep, there are LOTS of choices now -- see my comments at the bottom of my post.

CDB is one I'm particularly interested in looking at.  The challenge is finding
something that's fast, supported, really easy to install, and mature.  sqlite
seems to be a good compromise so far, esp since it's (surprisingly!) faster
than the dumb-as-bricks approach we tried out.

Now, if someone shows that we can get a 10x speedup for random access over
sqlite, I will figure out how to solve the installation problems :)

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu



More information about the biology-in-python mailing list