[bip] Loading sequences from FASTA files

Peter biopython at maubp.freeserve.co.uk
Tue Nov 24 04:00:35 PST 2009


On Tue, Nov 24, 2009 at 11:32 AM, James Casbon <casbon at gmail.com> wrote:
>
>> Second, I don't think it's likely that a read-write relational database like
>> sqlite will, in the end, be faster than a read-only indexed flat file.  This
>> is of little concern for small data sets like 454 :).  However, ...
>
> I would hold that a corollorary of Greenspun's 10th rule is that any
> indexed data format will contain an ad hoc informally-specified
> bug-ridden slow implementation of half of SQL.   You would only be
> using the read and not the write parts of sql here.

I would say it depends on what exactly people mean when they talk
about an indexed sequence file. If you *just* want random access via
a single key lookup (i.e. a single record ID string per sequence), then
a bespoke index file approach may win. If you want more than that,
then yes, the SQL approach has significant advantages. It is also
important to be clear if you are talking about just storing offsets into
a separate sequence file, or if you want to embedded the sequences
(and any annotation) within the index/database too.

Peter



More information about the biology-in-python mailing list