[bip] formats in biology

Thu Aug 2 08:14:43 PDT 2007

Perhaps the solution is to provide a validator like the w3c did/do for
(x)html. And name and shame all versions of programs that fail it. In
this case the spec would be defined by the validator. This is the only
proactive way I can think of to attempt to standardise a de facto file
format.

Noel

On 02/08/07, Andrew Dalke <dalke at dalkescientific.com> wrote:
> On Aug 2, 2007, at 3:49 PM, Bruce Southey wrote:
> > Thanks for the email as I do agree. What I don't agree with is
> > sweeping the problem away just because some people are doing it. A
> > user should not have to find out the hard way what is supported and
> > what is not.
>
> Most of the people developing software are *not* sweeping
> the problem under the table.  To suggest otherwise means
> that you do not understand why things got to be the way
> they are, nor the attempts (like readseq and bioperl) to
> mitigate those problems.
>
>
> Everyone agrees that the user shouldn't have to care about
> understanding the nuances of each format and program support
> for a given format.  Well, almost.  I've met some who say
> "just suck it up and deal with it".
>
> The question is, how to get to the point where the users
> don't need to care?  The current solutions are "guess based
> on the file extension, and be liberal on what you read"
> (bioperl, OEChem), or use some sort of format sniffing / try
> every parser until one works  (emboss).  Neither are
> perfect, so the default choice can be overridden.
>
>
> I mentioned I wrote validators for some of the bioinformatics
> data sets.  In pretty much every data set I found examples
> where the spec and the data provided by that data service
> were in disagreement.  PIR? Check. PDB? Check. SWISS-PROT?
> Check. GenBank? Check. Prosite? Check.  All had problems.
>
> Who was wrong?  The spec or the data?
>
> Answer: It depends.  Sometimes the spec was wrong, or incomplete.
> And sometimes the data was wrong.  And sometimes the data was
> "right" according to the spec, but in a way that a person
> looking at the data wouldn't expect.
>
> Sometimes the spec contains fields that no one uses, or
> cares about.  Like comments in FASTA records.
>
> (BTW, in looking at the FASTA code it appears that those
> "comments" were really used for data fields for some other
> FASTA-like format.  By having FASTA ignore those fields,
> it was possible to have FASTA also parse this alternate
> format to only extract the sequence data.)
>
> Formats get used outside the original context.  The PDB
> is an example.  A correct PDB file has a lot of required
> records.
>
> Most structure visualization programs only need the ATOM
> and HETATM (and a few others, to be nice) records.  It's
> a lot of work to verify that the input is in the correct
> format.  It's very simple to ignore unknown fields.
>
> Many of those visualizations programs let you export an
> atom selection as a PDB file.  Because of the difficulties
> in producing a syntactically correct PDB file, nearly
> all don't.  But that's okay because those extra fields,
> for the most part, are ignored by the other tools.
>
> The program cannot correctly say "exports in the PDB 2.3
> format as specified by the PDB".  It can say "generates
> a file in the PDB-like family of formats, that has a
> decent chance of being imported by other tools that
> understand the PDB format".  It can say "saves the
> HEADER, TITLE, ATOM, HETATM, TER and END records".  But
> bear in mind that that subset is *not* "a PDB file".
>
> Doing this documentation requires the users to understand
> the details of each format.  Most don't care.  They'll just
> say "Chime doesn't parse the PDB file that RasMol generates"
> and be done with it.  That's your "find out the hard way."
> It's actually easier to do that than figure out the
> format details.  It might even be the *easiest* way!
>
>
> What is the solution to this format variation problem?
>
> Not emit PDB files and generate some other format?
> Like ".my_pdb"? That removes easy interoperability.
> For example, most other visualization programs could
> read a ".my_pdb" file, but the filename is not going
> to appear in a file selection window looking for "*.pdb"
> files nor be usable by a plug-in that only handles
> "x-chem/x-pdb" MIME type.
>
> Should the authors of these programs get together and
> decide upon a new format?  In practice that rarely happens.
> And I've seen people spent years working on file formats
> that in practice never get used.
>
> Should we all switch to another format, perhaps based on
> XML or RDF?  How do you convince everyone to switch to
> a new format?
>
> Or a simpler question you still haven't answered: why
> should any new FASTA reader support the ;comment field?
> Is there a benefit, other than perhaps the feeling of
> righteous mastery of arcane knowledge?
>
>
> This whole topic on format variations and interoperability
> is a problem with no easy solution.
>
> What's your solution?  Maybe it's something the rest
> of us haven't thought of.
>
> I've given some of the history of two formats: FASTA
> and PDB.  How would your solution, if proposed 15
> years ago, have changed things?  How would history have
> played out.  How can we improve this for the future?
>
> Perhaps I'm just cantankerous after having dealt
> with this so long, <wink>
>
>                                 Andrew
>                                 dalke at dalkescientific.com
>
>
>
> _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python
>