[bip] Indexing big sequence databases
C. Titus Brown
ctb at msu.edu
Mon Mar 29 05:40:18 PDT 2010
On Mon, Mar 29, 2010 at 09:28:30AM +0100, Peter wrote:
> On Mon, Mar 29, 2010 at 3:46 AM, C. Titus Brown <ctb at msu.edu> wrote:
> >
> > Hi all,
> >
> > with reference to this earlier thread in November about indexing FASTA
> > and FASTQ files,
> >
> > ?http://lists.idyll.org/pipermail/biology-in-python/2009-November/000499.html
> >
> > I posted an update:
> >
> > ?http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html
> >
> > Basically, taking James Casbon's advice, we've switched to using sqlite as our
> > backend for the dirty work of storing sequences.
> >
> > Comments & random thoughts welcome, as always.
>
> Hi Titus,
>
> Thanks for the update - interesting that SQLite was able to
> beat your custom code.
Yep. And by a *lot*, like a factor of two. Kind of depressing.
> Regarding your reference to Biopython's indexing which stores
> the file offsets in memory, we are aware that will only scale so
> far (although as always it depends on what you are trying to do
> - this works perfectly well for fairly simple tasks like resorting
> FASTQ files). We have already looked at storing the read name
> mapping to file offsets in SQLite:
>
> http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html
>
> I'd be interested to see how you guys have handled flushing
> the index to disk (how often you do a commit) while building
> the index, and other issues to speed up writing the index.
We haven't put much effort into optimizing that yet, since *anything*
related to these 10gb files is annoying and so 5 min vs 10 min on
indexing isn't a big deal.
> P.S.: Note that the Biopython code works with many different
> file formats, not just FASTA and FASTQ but also things like
> GenBank, EMBL, SwissProt, and SFF. This means that trying
> to encode the entire file into an SQLite database is probably
> not a good idea for us in general. Hence why we are focusing
> on storing file offsets.
Sure, and I recognize the different philosophies of BioPython
and pygr, too. The problem is that we're faced with ever-increasing
fastq file size, and not so much the rest... seems like a good idea
to optimize specifically for that! Actually, I would be surprised if the
additional parsing involved in handling each record dynamically didn't have a
pretty large impact during execution of scripts dealing with over a few hundred
thousand records.
The screed format supports all sorts of data (as long as it's sequence-linked).
Maybe we'll try SFF or SwissProt.
cheers,
--titus
--
C. Titus Brown, ctb at msu.edu
More information about the biology-in-python
mailing list