[bip] Loading sequences from FASTA files
Peter
biopython at maubp.freeserve.co.uk
Tue Nov 24 02:32:43 PST 2009
On Tue, Nov 24, 2009 at 3:04 AM, C. Titus Brown <ctb at msu.edu> wrote:
>
> A couple of thoughts --
>
> I started out trying to use SQL for bioinformatic data storage about ...
> 8 years ago. It sucked then, and it still sucks :). Now that I have pygr as
> an alternative, I am thoroughly convinced that a relational database doesn't
> fit my needs (which basically involve doing overlap queries and storing
> relationships between objects -- hence invoking the well-known
> object-relational impedance problem that blocks using a SQL db).
>
> It might suffice for a sequence storage file, however. There are two
> countervailing concerns.
Indeed. For example, BioSQL (we use it on top of MySQL, but other
databases are supported) works very nicely as a "local GenBank
database", giving simple access to records and their annotation.
(However it is not suitable for second generation sequencing read
data).
> First, SQL doesn't provide any standard way to get a substring of a
> sequence record, so if I want to get an arbitrary slice from chr1 of
> hg17, I have to read all of chr1 into memory and then get my slice from
> that in-memory string. I believe this is what motivated Chris Lee's
> original seqdb implementation for pygr... and it's easy to add to
> screed.
The SQL command SUBSTRING is quite widely supported,
although there are some minor database specific variations.
> Second, I don't think it's likely that a read-write relational database like
> sqlite will, in the end, be faster than a read-only indexed flat file. This
> is of little concern for small data sets like 454 :). However, I asked
> Alex to design screed with a billion-sequence database in mind, from e.g.
> the next generation of Illumina sequencers... and screed retrieval seems to be
> constant with database size, so that's a good sign. I don't have a crystal
> ball but it seems clear that our sequencing bonanza will only continue to
> expand and I'd like to plan ahead a year or two.
I agree that a read-write relational databases won't scale as well as a
read-only indexed file.
> I am also completely unconcerned about compatibility with heathens
> who have not Seen the Light of Python, so I don't care too much about
> cross-language compatibility. Language bigotry can really simplify
> certain matters :)
;)
Peter
More information about the biology-in-python
mailing list