[bip] formats in biology

Andrew Dalke dalke at dalkescientific.com
Thu Aug 2 08:54:09 PDT 2007


On Aug 2, 2007, at 5:14 PM, Noel O'Boyle wrote:
> Perhaps the solution is to provide a validator like the w3c did/do for
> (x)html. And name and shame all versions of programs that fail it.

For those interested, the Martel parser included with
Biopython was designed for just that purpose.  Here's one
of the errors it caught in SWISS-PROT 39 and had to work around.

# HAS2_CHICK has a DT line like this
# DT   30-MAY-2000 (REL. 39, Created)
#                   ^^^ Note the upper-case "REL" instead of "Rel" !

This extreme sensitivity in Martel is both a curse
(it breaks a lot when there are new format) and blessing
(at least you know it's not silently accepting errors).

Though Martel doesn't have a recovery system so it only
reports the first syntax error, unlike the w3c validator.

It's a first pass validator, checking the syntax of
each record and field.  It doesn't validate data inside
of a given field.  So for example I wrote the biopython
prosite pattern parser, and found undocumented pattern
terms there (since documented).  I also wrote the biopython
feature location parser, and found problems in the BNF
syntax description.

> In this case the spec would be defined by the validator. This
> is the only proactive way I can think of to attempt to
> standardise a de facto file format.

I sure didn't have the right influence to change how
things were done by the data providers.  Maybe I should
have written a polemic paper on the topic some years back.

Perhaps someone can convince me that writing such a paper
would have made a difference?

I did send in various patches, questions, etc. as I found
the problems.  Some got fixed.  But it's otherwise a
thankless, and income-less task.

				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list