[bip] formats in biology

Bruce Southey bsouthey at gmail.com
Thu Aug 2 11:07:34 PDT 2007


Hi,

> Or a simpler question you still haven't answered: why
> should any new FASTA reader support the ;comment field?

The new or not is irrelevant here since this field does technically
exists and was present at the start of the format. Obviously the field
could be used for other meta-information that could be selectively
ignored by programs instead of having long headers or creating yet
more formats. While it would be nice to have a universal format that
allows sequence specific information rather than parsing UniProt or
Genbank records (assuming that the required information exists in
them), that is another can of worms.

> Is there a benefit, other than perhaps the feeling of
> righteous mastery of arcane knowledge?

Like citing Monty Python? :-) (Sorry, could not contain myself.)

The benefit?
Bioinformatics trivia question?

However, one very much aware that unless a powerhouse does something
about formats nothing is going to change. Yet even these powerhouses
don't seem to remain very stable (UniProt keeps changing and the blast
XML format changes are some that have affected me).

Bruce



On 8/2/07, Andrew Dalke <dalke at dalkescientific.com> wrote:
> On Aug 2, 2007, at 3:49 PM, Bruce Southey wrote:
> > Thanks for the email as I do agree. What I don't agree with is
> > sweeping the problem away just because some people are doing it. A
> > user should not have to find out the hard way what is supported and
> > what is not.
>
> Most of the people developing software are *not* sweeping
> the problem under the table.  To suggest otherwise means
> that you do not understand why things got to be the way
> they are, nor the attempts (like readseq and bioperl) to
> mitigate those problems.
>
>
> Everyone agrees that the user shouldn't have to care about
> understanding the nuances of each format and program support
> for a given format.  Well, almost.  I've met some who say
> "just suck it up and deal with it".
>
> The question is, how to get to the point where the users
> don't need to care?  The current solutions are "guess based
> on the file extension, and be liberal on what you read"
> (bioperl, OEChem), or use some sort of format sniffing / try
> every parser until one works  (emboss).  Neither are
> perfect, so the default choice can be overridden.
>
>
> I mentioned I wrote validators for some of the bioinformatics
> data sets.  In pretty much every data set I found examples
> where the spec and the data provided by that data service
> were in disagreement.  PIR? Check. PDB? Check. SWISS-PROT?
> Check. GenBank? Check. Prosite? Check.  All had problems.
>
> Who was wrong?  The spec or the data?
>
> Answer: It depends.  Sometimes the spec was wrong, or incomplete.
> And sometimes the data was wrong.  And sometimes the data was
> "right" according to the spec, but in a way that a person
> looking at the data wouldn't expect.
>
> Sometimes the spec contains fields that no one uses, or
> cares about.  Like comments in FASTA records.
>
> (BTW, in looking at the FASTA code it appears that those
> "comments" were really used for data fields for some other
> FASTA-like format.  By having FASTA ignore those fields,
> it was possible to have FASTA also parse this alternate
> format to only extract the sequence data.)
>
> Formats get used outside the original context.  The PDB
> is an example.  A correct PDB file has a lot of required
> records.
>
> Most structure visualization programs only need the ATOM
> and HETATM (and a few others, to be nice) records.  It's
> a lot of work to verify that the input is in the correct
> format.  It's very simple to ignore unknown fields.
>
> Many of those visualizations programs let you export an
> atom selection as a PDB file.  Because of the difficulties
> in producing a syntactically correct PDB file, nearly
> all don't.  But that's okay because those extra fields,
> for the most part, are ignored by the other tools.
>
> The program cannot correctly say "exports in the PDB 2.3
> format as specified by the PDB".  It can say "generates
> a file in the PDB-like family of formats, that has a
> decent chance of being imported by other tools that
> understand the PDB format".  It can say "saves the
> HEADER, TITLE, ATOM, HETATM, TER and END records".  But
> bear in mind that that subset is *not* "a PDB file".
>
> Doing this documentation requires the users to understand
> the details of each format.  Most don't care.  They'll just
> say "Chime doesn't parse the PDB file that RasMol generates"
> and be done with it.  That's your "find out the hard way."
> It's actually easier to do that than figure out the
> format details.  It might even be the *easiest* way!
>
>
> What is the solution to this format variation problem?
>
> Not emit PDB files and generate some other format?
> Like ".my_pdb"? That removes easy interoperability.
> For example, most other visualization programs could
> read a ".my_pdb" file, but the filename is not going
> to appear in a file selection window looking for "*.pdb"
> files nor be usable by a plug-in that only handles
> "x-chem/x-pdb" MIME type.
>
> Should the authors of these programs get together and
> decide upon a new format?  In practice that rarely happens.
> And I've seen people spent years working on file formats
> that in practice never get used.
>
> Should we all switch to another format, perhaps based on
> XML or RDF?  How do you convince everyone to switch to
> a new format?
>
> Or a simpler question you still haven't answered: why
> should any new FASTA reader support the ;comment field?
> Is there a benefit, other than perhaps the feeling of
> righteous mastery of arcane knowledge?
>
>
> This whole topic on format variations and interoperability
> is a problem with no easy solution.
>
> What's your solution?  Maybe it's something the rest
> of us haven't thought of.
>
> I've given some of the history of two formats: FASTA
> and PDB.  How would your solution, if proposed 15
> years ago, have changed things?  How would history have
> played out.  How can we improve this for the future?
>
> Perhaps I'm just cantankerous after having dealt
> with this so long, <wink>
>
>                                 Andrew
>                                 dalke at dalkescientific.com
>
>
>
> _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python
>



More information about the biology-in-python mailing list