[bip] FASTA

Andrew Dalke dalke at dalkescientific.com
Thu Aug 2 02:10:34 PDT 2007


On Aug 2, 2007, at 5:10 AM, Bruce Southey wrote:
> This type of reaction is rather surprising given that original source
> of the format allows OPTIONAL comment lines starting with ';'.

Let's take Python as an example.

When Python first started it supported string exceptions.

    raise 'there was a problem'

It still supports string exceptions.  But these are deprecated.
You are strongly advised to not use them in new code, and if
possible to remove them in old code.

Other parts of Python are also deprecated.  grep for 'Deprecated'
in the standard library to find functions and classes and
even deprecated modules (xmllib, regex).  There was also a keyword
removed in Python 1.3 or so ("access") and some modules removed.


The comment lines in FASTA file are deprecated in that sense.
Don't use them for new code, don't use them for new databases.
And that consensus happened at least 15 years ago, which is
effectively why no code supports it.

No one cares to support it.

No one advocates supporting it.

An advantage of Python is there are people actively documenting
the changes, so there are varying specs, from what's regarded
as the authoritative source.  Pearson hasn't.  And/or he doesn't
care if the rest of the world only use a subset of his format.

Or perhaps he's said in a few conferences, on BITNET, or some
long forgotten journal paper, that it's okay to use "FASTA format"
to mean "the variant without comments", and the details of
that arrangement live on in the code but not in the memories
of those writing here.

Do we have to follow a 20 year old format document (updated for
newer releases)?  If so, why?


> The arguments that the 'file format has evolved' or there is some
> standard are incorrect because it appears to be have been present
> from the beginning.

That makes no sense.  The original Pearson FASTA format has
those ;comments.  The current FASTA format that everyone
uses (which I'm thinking should be called the "NCBI FASTA format"),
does not.  When people use the term "FASTA" they nearly always
mean the newer, more minimal variant.

If it's an argument of terminology then yes, the original
FASTA file format supported comments.  But that format is
no longer in use.  It's a historical curiosity.  The current
FASTA format - which *everyone* supports - is similar to the
original, but does not contain comments.

There are multiple meanings of the term "FASTA file format".

There are multiple meanings of the term "FASTA file format".

Just like there are multiple meanings of the term "PDB file format"
and most other formats.  (Ie, many structure visualization
programs can save to the "PDB format", but that format rarely
fully complies with the format specification put out by the PDB.)

A frustration in this field is that while the spec says one
thing, the consensus understanding and use of that spec can
be rather different.

In this case though, there is a spec for the modern interpretation
of what "FASTA file format" means; the one from the NCBI
page I pointed out in an earlier email.


Saying that our arguments are "incorrect" is only true
if you think that "FASTA file format" can only have one meaning,
and it must be the original meaning.

If that's the case, what should we name the format we currently
call "FASTA"?  And why is it important that we use a new name?

> The optional means you can exclude the comment lines and
> still be valid. Furthermore, the code Andrew provides indicates that
> FASTA can understand the comment lines.

What I showed was that the FASTA reader in FASTA also supports

 >This is the title line
;here is a comment
SEQUENCEGOESHERE
;look! Another comment!
FASTATHINKSTHISISSEQUENCE
; I can intermingle sequence and comment lines.
; FASTA ignores all comment lines, so the code supports a
; format which is the superset of the format it documents.
THEEND

Therefore, FASTA's sequence reader does not support the
same file format that FASTA's documentation describes as
the FASTA file format.  The format document only allowed
comments between the title line and the sequence.

That was my point in showing the code.  Which is correct,
the documentation or the code?  How do you resolve the
ambiguity, and why?

Even if you decide one way or the other, it does not make a difference.

No one cares to support that feature of the original format.
Everyone uses/supports only the consensus definition of FASTA,
as described on the NCBI page.  And this consensus has been
around for over 15 years.

Why do you find this rather minor relic so important?

Me, I'm a parser freak, and have spent a lot of time
writing validators for different formats, and sending
bug reports in to the different data providers.  So I
can go on and on about this topic for a while.

				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list