[bip] Indexing big sequence databases

Brent Pedersen bpederse at gmail.com
Sat Apr 10 17:52:42 PDT 2010


On Sun, Mar 28, 2010 at 7:46 PM, C. Titus Brown <ctb at msu.edu> wrote:
> Hi all,
>
> with reference to this earlier thread in November about indexing FASTA
> and FASTQ files,
>
>  http://lists.idyll.org/pipermail/biology-in-python/2009-November/000499.html
>
> I posted an update:
>
>  http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html
>
> Basically, taking James Casbon's advice, we've switched to using sqlite as our
> backend for the dirty work of storing sequences.
>
> Comments & random thoughts welcome, as always.
>
> cheers,
> --titus
> --
> C. Titus Brown, ctb at msu.edu
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>

i had a look at screed today, and a bit of the biopython stuff. i also
wanted to index some sam files by read-id and cobbled together a
generic indexing class that is format agnostic. i figured for a lot of
bioinformatics stuff, we all have our own parsers, we just need an
index to get us from an id to the place in the file where we can tell
the parser to do it's thing.
so with a parser interface like:
FastQParser(filehandle)
where the class takes a filehandle and reads a single fastq record.
that can be used for indexing and accessing because just by parsing a
record, the FastQParser will (read 4 lines and) advance the filehandle
to the next record.
the same will work for sam files. so then the index just keeps feeding
the filehandle back to the class (or function) and saving the id and
the new file fseek position. i implemented this using a tokyo cabinet
btree.
anyway, i wrote this up in some detail (and hopefully clarity). i also
am interested in feedback:
http://hackmap.blogspot.com/2010/04/fileindex.html

i did open a ticket about screed's benchmarking code. fwiw, i expected
that using TC would be much faster but screed compares pretty well to
the TC implementation.
-b



More information about the biology-in-python mailing list