[bip] formats in biology

Andrew Dalke dalke at dalkescientific.com
Thu Aug 2 08:00:57 PDT 2007


On Aug 2, 2007, at 3:49 PM, Bruce Southey wrote:
> Thanks for the email as I do agree. What I don't agree with is
> sweeping the problem away just because some people are doing it. A
> user should not have to find out the hard way what is supported and
> what is not.

Most of the people developing software are *not* sweeping
the problem under the table.  To suggest otherwise means
that you do not understand why things got to be the way
they are, nor the attempts (like readseq and bioperl) to
mitigate those problems.


Everyone agrees that the user shouldn't have to care about
understanding the nuances of each format and program support
for a given format.  Well, almost.  I've met some who say
"just suck it up and deal with it".

The question is, how to get to the point where the users
don't need to care?  The current solutions are "guess based
on the file extension, and be liberal on what you read"
(bioperl, OEChem), or use some sort of format sniffing / try
every parser until one works  (emboss).  Neither are
perfect, so the default choice can be overridden.


I mentioned I wrote validators for some of the bioinformatics
data sets.  In pretty much every data set I found examples
where the spec and the data provided by that data service
were in disagreement.  PIR? Check. PDB? Check. SWISS-PROT?
Check. GenBank? Check. Prosite? Check.  All had problems.

Who was wrong?  The spec or the data?

Answer: It depends.  Sometimes the spec was wrong, or incomplete.
And sometimes the data was wrong.  And sometimes the data was
"right" according to the spec, but in a way that a person
looking at the data wouldn't expect.

Sometimes the spec contains fields that no one uses, or
cares about.  Like comments in FASTA records.

(BTW, in looking at the FASTA code it appears that those
"comments" were really used for data fields for some other
FASTA-like format.  By having FASTA ignore those fields,
it was possible to have FASTA also parse this alternate
format to only extract the sequence data.)

Formats get used outside the original context.  The PDB
is an example.  A correct PDB file has a lot of required
records.

Most structure visualization programs only need the ATOM
and HETATM (and a few others, to be nice) records.  It's
a lot of work to verify that the input is in the correct
format.  It's very simple to ignore unknown fields.

Many of those visualizations programs let you export an
atom selection as a PDB file.  Because of the difficulties
in producing a syntactically correct PDB file, nearly
all don't.  But that's okay because those extra fields,
for the most part, are ignored by the other tools.

The program cannot correctly say "exports in the PDB 2.3
format as specified by the PDB".  It can say "generates
a file in the PDB-like family of formats, that has a
decent chance of being imported by other tools that
understand the PDB format".  It can say "saves the
HEADER, TITLE, ATOM, HETATM, TER and END records".  But
bear in mind that that subset is *not* "a PDB file".

Doing this documentation requires the users to understand
the details of each format.  Most don't care.  They'll just
say "Chime doesn't parse the PDB file that RasMol generates"
and be done with it.  That's your "find out the hard way."
It's actually easier to do that than figure out the
format details.  It might even be the *easiest* way!


What is the solution to this format variation problem?

Not emit PDB files and generate some other format?
Like ".my_pdb"? That removes easy interoperability.
For example, most other visualization programs could
read a ".my_pdb" file, but the filename is not going
to appear in a file selection window looking for "*.pdb"
files nor be usable by a plug-in that only handles
"x-chem/x-pdb" MIME type.

Should the authors of these programs get together and
decide upon a new format?  In practice that rarely happens.
And I've seen people spent years working on file formats
that in practice never get used.

Should we all switch to another format, perhaps based on
XML or RDF?  How do you convince everyone to switch to
a new format?

Or a simpler question you still haven't answered: why
should any new FASTA reader support the ;comment field?
Is there a benefit, other than perhaps the feeling of
righteous mastery of arcane knowledge?


This whole topic on format variations and interoperability
is a problem with no easy solution.

What's your solution?  Maybe it's something the rest
of us haven't thought of.

I've given some of the history of two formats: FASTA
and PDB.  How would your solution, if proposed 15
years ago, have changed things?  How would history have
played out.  How can we improve this for the future?

Perhaps I'm just cantankerous after having dealt
with this so long, <wink>

				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list