[bip] Indexing big sequence databases

Mon Mar 29 08:20:11 PDT 2010

On Mon, Mar 29, 2010 at 1:40 PM, C. Titus Brown <ctb at msu.edu> wrote:
>> P.S.: Note that the Biopython code works with many different
>> file formats, not just FASTA and FASTQ but also things like
>> GenBank, EMBL, SwissProt, and SFF. This means that trying
>> to encode the entire file into an SQLite database is probably
>> not a good idea for us in general. Hence why we are focusing
>> on storing file offsets.
>
> Sure, and I recognize the different philosophies of BioPython
> and pygr, too.  The problem is that we're faced with ever-increasing
> fastq file size, and not so much the rest... seems like a good idea
> to optimize specifically for that!  Actually, I would be surprised if the
> additional parsing involved in handling each record dynamically
> didn't have a pretty large impact during execution of scripts
> dealing with over a few hundred thousand records.

There is undeniably a cost in parsing the record at some step,
either upfront when building the index or when requesting the
specific record. Which is most important will depend on your
exact use case - I have tended to find when I need to process
all (or most) of the records, I don't actually care about the order.
i.e. I can just iterate over the file without indexing.

Have you got any specific usage examples in mind which
need random access to a FASTQ file?

> The screed format supports all sorts of data (as long as it's
> sequence-linked). Maybe we'll try SFF or SwissProt.

Cool - I'm keep an eye out for another blog post from you at
some point down the line.

Peter