[bip] Loading sequences from FASTA files

Peter biopython at maubp.freeserve.co.uk
Tue Nov 24 02:32:43 PST 2009


On Tue, Nov 24, 2009 at 3:04 AM, C. Titus Brown <ctb at msu.edu> wrote:
>
> A couple of thoughts --
>
> I started out trying to use SQL for bioinformatic data storage about ...
> 8 years ago.  It sucked then, and it still sucks :).  Now that I have pygr as
> an alternative, I am thoroughly convinced that a relational database doesn't
> fit my needs (which basically involve doing overlap queries and storing
> relationships between objects -- hence invoking the well-known
> object-relational impedance problem that blocks using a SQL db).
>
> It might suffice for a sequence storage file, however.  There are two
> countervailing concerns.

Indeed. For example, BioSQL (we use it on top of MySQL, but other
databases are supported) works very nicely as a "local GenBank
database", giving simple access to records and their annotation.
(However it is not suitable for second generation sequencing read
data).

> First, SQL doesn't provide any standard way to get a substring of a
> sequence record, so if I want to get an arbitrary slice from chr1 of
> hg17, I have to read all of chr1 into memory and then get my slice from
> that in-memory string.  I believe this is what motivated Chris Lee's
> original seqdb implementation for pygr... and it's easy to add to
> screed.

The SQL command SUBSTRING is quite widely supported,
although there are some minor database specific variations.

> Second, I don't think it's likely that a read-write relational database like
> sqlite will, in the end, be faster than a read-only indexed flat file.  This
> is of little concern for small data sets like 454 :).  However, I asked
> Alex to design screed with a billion-sequence database in mind, from e.g.
> the next generation of Illumina sequencers... and screed retrieval seems to be
> constant with database size, so that's a good sign.  I don't have a crystal
> ball but it seems clear that our sequencing bonanza will only continue to
> expand and I'd like to plan ahead a year or two.

I agree that a read-write relational databases won't scale as well as a
read-only indexed file.

> I am also completely unconcerned about compatibility with heathens
> who have not Seen the Light of Python, so I don't care too much about
> cross-language compatibility.  Language bigotry can really simplify
> certain matters :)

;)

Peter



More information about the biology-in-python mailing list