[bip] formats in biology

Andrew Dalke dalke at dalkescientific.com
Thu Aug 2 10:52:37 PDT 2007


> But seriously, I think we need to follow Postel's Law:
>
> 	http://en.wikipedia.org/wiki/Robustness_Principle
>
> """
> Be liberal in what you accept, and conservative in what you send
> """

That mostly only works for formats which are designed
for Postel's Law.

Consider one of my counter examples.

>> The flip side of "be liberal in what you accept" is
>> "conservative in what you send."  But many of the database
>> oriented formats are too hard to get correct, and a lot
>> easier to get "correct enough, so that almost everyone
>> understands it".
>>
>> For example, when converting a sequence in FASTA format
>> into a GenBank record, which should be used for the
>> required 'division' field?  Your possible answers are:
>>
>>
>> valid_divisions = ["PRI", "ROD", "MAM", "VRT", "INV", "PLN",  
>> "BCT", "RNA",
>>                    "VRL", "PHG", "SYN", "UNA", "EST", "PAT",  
>> "STS", "GSS",
>>                    "HTG", "HTC", "CON"]
>>
>> And for the required "AUTHORS" block?
>>
>> There's no good answer, because the format definition
>> was designed for a database provider that could put
>> the correct answer in for each field, or change the spec
>> as need be.
>>
>> But if you're working with an analysis program that
>> expects GenBank files as input and only extracts the
>> sequence, then you don't care that bioperl generates
>> an "incorrect but close enough GenBank file" - so
>> long as it is close enough.

I gave the PDB format as another example where just about
no software generates a valid PDB, according to the spec,
but instead generates a file which is close enough
that most other programs accept it without a problem.

How conservative is "conservative"?  How liberal is "liberal"?

I got rather peeved at OpenEye for accepting almost any
garbage as a SMILES string (a representation for small molecules).
Too liberal means the crap isn't cleaned up.  Thankfully
they now have a flag to specify if you want liberal or
conservative parsing, and the default is conservative.

				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list