[bip] Indexing big sequence databases

Sun Apr 11 03:23:22 PDT 2010

Hi Brent,

Interesting post.

On Sun, Apr 11, 2010 at 1:52 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>
> i had a look at screed today, and a bit of the biopython stuff. i also
> wanted to index some sam files by read-id and cobbled together a
> generic indexing class that is format agnostic. i figured for a lot of
> bioinformatics stuff, we all have our own parsers, we just need an
> index to get us from an id to the place in the file where we can tell
> the parser to do it's thing.
> so with a parser interface like:
> FastQParser(filehandle)
> where the class takes a filehandle and reads a single fastq record.
> that can be used for indexing and accessing because just by parsing a
> record, the FastQParser will (read 4 lines and) advance the filehandle
> to the next record.

That is kind of what the Biopython index stuff does - but we cope
with pathalogical FASTQ files with line wrapping (they don't have
to be 4 lines per record). Insert grumble about poorly defined file
formats here.

> the same will work for sam files. so then the index just keeps feeding
> the filehandle back to the class (or function) and saving the id and
> the new file fseek position. i implemented this using a tokyo cabinet
> btree.
> anyway, i wrote this up in some detail (and hopefully clarity). i also
> am interested in feedback:
> http://hackmap.blogspot.com/2010/04/fileindex.html
>
> i did open a ticket about screed's benchmarking code. fwiw, i expected
> that using TC would be much faster but screed compares pretty well to
> the TC implementation.

Was it the new SQLite bases screed you benchmarked?

Did you benchmark my SQLite based index?

I haven't tried Tokyo cabinet btree yet - I should.

Peter