[bip] Indexing big sequence databases

Peter biopython at maubp.freeserve.co.uk
Sun Apr 11 11:57:30 PDT 2010


On Sun, Apr 11, 2010 at 7:24 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>
> peter, from what i can tell, both by reading the code, and running,
> your implementation currently always tries to re-index, is that
> correct?

Yeah - support for loading an existing SQLite index would be needed
before merging that branch to our trunk. This was would be a
secondary aim (loading + saving indexes is useful in itself, not just
as a way to reduce the memory load of keeping the dict in RAM),
but I would like to co-ordinate this with BioPerl etc to use a common
SQLite schema.

> i think biopython-sqlite insert time is longer than screed because you
> check if the key is already in the database before every insert.
> whereas screed just enforces this via a unique index. and i think both
> sqlite implementations could have much faster inserts by setting the
> isolation level and doing transactions manually.

I'd have to check the schema/code to remind myself what I did,
but getting SQLite to enforce uniqueness does make perfect
sense - if I wasn't doing that I should. I hadn't looked at optimising
the transaction commit strategy at all.

> one thing i notice is that when running the search, both screed-sqlite
> and biopython-sqlite are just cranking on my hard-drive. it'd be
> interesting to run this on an SSD.

Yeah - I was wondering if trying things like batching the commits
might reduce the disk load.

> the benchmark script is here:
> http://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py
> i think it's sane, but let me know if you see any problems. running it
> on my machine on 15.5 million fastq records, 60M+ lines (while i was
> doing other stuff) i get the output below with times in seconds.
> (there is much less difference when using only 500K records):
>
> benchmarking fastq file with 15646356 records
> performing 100000 random queries
>
> screed
> ------
> create: 707.855
> search: 683.427
>
> biopython-sqlite
> ----------------
> create: 758.231
> search: 1443.416
>
> fileindex
> ---------
> create: 377.771
> search: 603.685
>
> bsddbfileindex
> --------------
> create: 445.524
> search: 954.455
>

Plenty of room for improvement then ;)

Thanks,

Peter



More information about the biology-in-python mailing list