[bip] Indexing big sequence databases

C. Titus Brown ctb at msu.edu
Mon Mar 29 08:25:15 PDT 2010


On Mon, Mar 29, 2010 at 04:20:11PM +0100, Peter wrote:
> On Mon, Mar 29, 2010 at 1:40 PM, C. Titus Brown <ctb at msu.edu> wrote:
> >> P.S.: Note that the Biopython code works with many different
> >> file formats, not just FASTA and FASTQ but also things like
> >> GenBank, EMBL, SwissProt, and SFF. This means that trying
> >> to encode the entire file into an SQLite database is probably
> >> not a good idea for us in general. Hence why we are focusing
> >> on storing file offsets.
> >
> > Sure, and I recognize the different philosophies of BioPython
> > and pygr, too. ?The problem is that we're faced with ever-increasing
> > fastq file size, and not so much the rest... seems like a good idea
> > to optimize specifically for that! ?Actually, I would be surprised if the
> > additional parsing involved in handling each record dynamically
> > didn't have a pretty large impact during execution of scripts
> > dealing with over a few hundred thousand records.
> 
> There is undeniably a cost in parsing the record at some step,
> either upfront when building the index or when requesting the
> specific record. Which is most important will depend on your
> exact use case - I have tended to find when I need to process
> all (or most) of the records, I don't actually care about the order.
> i.e. I can just iterate over the file without indexing.
> 
> Have you got any specific usage examples in mind which
> need random access to a FASTQ file?

Our specific use case, in the "Right Now" sense, is integration with
SAMtools-style alignment queries (for which we are releasing another tool Real
Soon Now).  Post-processing and doing quality checks on short-read mapping
and assembly often involves picking the original sequence up from the
sequencing output.

More generally, it's how pygr wants to store sequences for later retrieval
from its more general NLMSA/alignment query format.  On-demand parsing is
way too slow when you're talking about cold-start visualization, for example,
which is my current effort.

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu



More information about the biology-in-python mailing list