[bip] formats in biology

Thu Aug 2 08:32:32 PDT 2007

On Aug 2, 2007, at 11:53 AM, Giovanni Marco Dall'Olio wrote:
> This is a great problem: if you want to propose a change in one
> standard format, you have to document it clearly, and also underline
> the changes by declaring a new version, for example 'fasta 1.1'.

Let's take FASTA as the case study, and see if this would
have worked.

readseq, written about 15 years ago, does not support ;comments
in the input.  Suppose that Gilbert noticed that and write
in the documentation:

     does not support Pearson FASTA files with comment lines

Would he have called that format "fasta 1.1"?  No.  Because
he didn't come up with the FASTA format, and it would be
considered presumptive to call this more restrictive format
"fasta 1.1".  Plus, dot revisions should indicate that the
format is a superset.

Note that the output is in "fasta 1.0" format.

Did people care that those comment fields were ignored?
Apparently not.

But suppose he did call it "community fasta".  After all
these years of everyone using the new format, I can assure
you that we would drop the version number/variant name and
be calling it "the FASTA format".  Even if it wasn't the
original FASTA format.  Leaving us exactly where we are now.

Plus, there would have been confusion when people asked "is
'community FASTA' the same as 'Pearson FASTA'?"  Since
no data actually used the comment fields, this confusion
would have served no practical purpose other than pedantry.

In the FASTA file, where does that format information go?
Some formats (GenBank, recent PDBs etc.) have fields for
the version number.  FASTA doesn't.

Yes, formats should contain version information.  That
doesn't change the fact that FASTA is very widely used,
easy to generate, easy to parse.

Even for formats with a version number, people use
format variations which don't exactly meet the format
specification.  I posted an example with the PDB format,
but similar problems exist with SWISS-PROT and GenBank
formats generated by tools that only generate the fields
they they care to parse.

The flip side of "be liberal in what you accept" is
"conservative in what you send."  But many of the database
oriented formats are too hard to get correct, and a lot
easier to get "correct enough, so that almost everyone
understands it".

For example, when converting a sequence in FASTA format
into a GenBank record, which should be used for the
required 'division' field?  Your possible answers are:

valid_divisions = ["PRI", "ROD", "MAM", "VRT", "INV", "PLN", "BCT",  
"RNA",
                    "VRL", "PHG", "SYN", "UNA", "EST", "PAT", "STS",  
"GSS",
                    "HTG", "HTC", "CON"]

And for the required "AUTHORS" block?

There's no good answer, because the format definition
was designed for a database provider that could put
the correct answer in for each field, or change the spec
as need be.

But if you're working with an analysis program that
expects GenBank files as input and only extracts the
sequence, then you don't care that bioperl generates
an "incorrect but close enough GenBank file" - so
long as it is close enough.

> p.s. we are way off topic, we should not mess in this way with the ml.

Are we?  It spent a lot of time working on the biopython
parsing system, only to find that my validating formats
were too restrictive to handle "wild-type" formats.

Some of the people reading this list want to develop
replacements for Biopython, which would also mean replacing
some of the parsing code.

But you're right, this is a general topic not restricted
to Python.  Or to biology.

Though if there was a good solution, then this is as good
a place as any to talk about it.

				Andrew
				dalke at dalkescientific.com