[bip] Indexing big sequence databases

Mon Mar 29 01:28:30 PDT 2010

On Mon, Mar 29, 2010 at 3:46 AM, C. Titus Brown <ctb at msu.edu> wrote:
>
> Hi all,
>
> with reference to this earlier thread in November about indexing FASTA
> and FASTQ files,
>
>  http://lists.idyll.org/pipermail/biology-in-python/2009-November/000499.html
>
> I posted an update:
>
>  http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html
>
> Basically, taking James Casbon's advice, we've switched to using sqlite as our
> backend for the dirty work of storing sequences.
>
> Comments & random thoughts welcome, as always.
>
> cheers,
> --titus

Hi Titus,

Thanks for the update - interesting that SQLite was able to
beat your custom code.

Regarding your reference to Biopython's indexing which stores
the file offsets in memory, we are aware that will only scale so
far (although as always it depends on what you are trying to do
- this works perfectly well for fairly simple tasks like resorting
FASTQ files). We have already looked at storing the read name
mapping to file offsets in SQLite:

http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html

I'd be interested to see how you guys have handled flushing
the index to disk (how often you do a commit) while building
the index, and other issues to speed up writing the index.

Peter

P.S.: Note that the Biopython code works with many different
file formats, not just FASTA and FASTQ but also things like
GenBank, EMBL, SwissProt, and SFF. This means that trying
to encode the entire file into an SQLite database is probably
not a good idea for us in general. Hence why we are focusing
on storing file offsets.