[bip] Indexing big sequence databases

Mon Mar 29 09:16:18 PDT 2010

On Mon, Mar 29, 2010 at 8:48 AM, C. Titus Brown <ctb at msu.edu> wrote:
> On Mon, Mar 29, 2010 at 11:45:14AM -0400, Paul Davis wrote:
>> On Sun, Mar 28, 2010 at 10:46 PM, C. Titus Brown <ctb at msu.edu> wrote:
>> > Hi all,
>> >
>> > with reference to this earlier thread in November about indexing FASTA
>> > and FASTQ files,
>> >
>> > ?http://lists.idyll.org/pipermail/biology-in-python/2009-November/000499.html
>> >
>> > I posted an update:
>> >
>> > ?http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html
>> >
>> > Basically, taking James Casbon's advice, we've switched to using sqlite as our
>> > backend for the dirty work of storing sequences.
>> >
>> > Comments & random thoughts welcome, as always.
>>
>> You might also be interested in testing Tokyo Cabinet if your queries
>> are limited to "fetch by name" and "iterate over everything." Its
>> treated me pretty well but I've never gone out of my way to benchmark
>> it against other solutions as it was always fast enough.
>
> Yep, there are LOTS of choices now -- see my comments at the bottom of my post.
>
> CDB is one I'm particularly interested in looking at.  The challenge is finding
> something that's fast, supported, really easy to install, and mature.  sqlite
> seems to be a good compromise so far, esp since it's (surprisingly!) faster
> than the dumb-as-bricks approach we tried out.
>
> Now, if someone shows that we can get a 10x speedup for random access over
> sqlite, I will figure out how to solve the installation problems :)
>
> cheers,
> --titus
> --
> C. Titus Brown, ctb at msu.edu
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>

hi, this looks very interesting. thanks for making it available
separately from pygr.
(re peter) regarding random access to fastq, i have need random access
for mapping converted reads (artifically converting C to T or G to A)
following an alignment on bisulfite treated reads [1]. that requires
recovering the original, non-converted sequence in random order. up to
now, i do that by fseek'ing to a file position in the raw sequence
file (only sequence, no fastq headers or quality info) which is
possible because all the reads are the same length. but it was
discarding quality info which bowtie would otherwise use.

i had started an indexing scheme using a toyko cabinet hash to map the
fastq record header key to the fseek position value. i suspect that
would be faster than sqlite, but that's certainly not the bottleneck
in my pipeline, so i'll try screed for now.

btw, cdb has a 4G limit, no?

-brent

[1] http://github.com/brentp/methylcode/blob/master/code/methylcoder.py#L137