[bip] Indexing big sequence databases

Sun Apr 11 11:24:48 PDT 2010

On Sun, Apr 11, 2010 at 3:23 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Brent,
>
> Interesting post.
>
> On Sun, Apr 11, 2010 at 1:52 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>>
>> i had a look at screed today, and a bit of the biopython stuff. i also
>> wanted to index some sam files by read-id and cobbled together a
>> generic indexing class that is format agnostic. i figured for a lot of
>> bioinformatics stuff, we all have our own parsers, we just need an
>> index to get us from an id to the place in the file where we can tell
>> the parser to do it's thing.
>> so with a parser interface like:
>> FastQParser(filehandle)
>> where the class takes a filehandle and reads a single fastq record.
>> that can be used for indexing and accessing because just by parsing a
>> record, the FastQParser will (read 4 lines and) advance the filehandle
>> to the next record.
>
> That is kind of what the Biopython index stuff does - but we cope
> with pathalogical FASTQ files with line wrapping (they don't have
> to be 4 lines per record). Insert grumble about poorly defined file
> formats here.
>

hi peter,
yeah, that's just evil, but that's for the parser to handle, FileIndex
doesn't need to know
about that.

>> the same will work for sam files. so then the index just keeps feeding
>> the filehandle back to the class (or function) and saving the id and
>> the new file fseek position. i implemented this using a tokyo cabinet
>> btree.
>> anyway, i wrote this up in some detail (and hopefully clarity). i also
>> am interested in feedback:
>> http://hackmap.blogspot.com/2010/04/fileindex.html
>>
>> i did open a ticket about screed's benchmarking code. fwiw, i expected
>> that using TC would be much faster but screed compares pretty well to
>> the TC implementation.
>
> Was it the new SQLite bases screed you benchmarked?
>
> Did you benchmark my SQLite based index?
>
> I haven't tried Tokyo cabinet btree yet - I should.
>
> Peter
>

i wrote an informal timing script that includes biopython-sqlite,
screed-sqlite, fileindex and bsddbfileindex.
bsddbfileindex is using bsddb btree as suggested by istvan
albert--that simplifies and shortens the implementation compared to
the TC version by a few lines, and it could likely do better if i
tuned the btree parameters--but it doesn't support multiple values per
key. all of the implementations seem to use very little memory.

peter, from what i can tell, both by reading the code, and running,
your implementation currently always tries to re-index, is that
correct?
i think biopython-sqlite insert time is longer than screed because you
check if the key is already in the database before every insert.
whereas screed just enforces this via a unique index. and i think both
sqlite implementations could have much faster inserts by setting the
isolation level and doing transactions manually.

one thing i notice is that when running the search, both screed-sqlite
and biopython-sqlite are just cranking on my hard-drive. it'd be
interesting to run this on an SSD.

the benchmark script is here:
http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
i think it's sane, but let me know if you see any problems. running it
on my machine on 15.5 million fastq records, 60M+ lines (while i was
doing other stuff) i get the output below with times in seconds.
(there is much less difference when using only 500K records):

benchmarking fastq file with 15646356 records
performing 100000 random queries

screed
------
create: 707.855
search: 683.427

biopython-sqlite
----------------
create: 758.231
search: 1443.416

fileindex
---------
create: 377.771
search: 603.685

bsddbfileindex
--------------
create: 445.524
search: 954.455