<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

The vast majority of applications do not support ';' and if it was a

feature of the original format, it has been forgotten and ignored.

Making a program support it now would cause more problems than any

benefit would provide. If some programs support it and others do not,

then we suddenly would have more work to convert the two FASTA formats

between each other, and frankly that makes me want to abandon

bioinformatics altogether as "standard" file formats has been a huge

time sink in the field. FASTA in the form most programs support has

been the closest thing to a non-changing file format in bioinformatics,

which is sad.<br>

<br>

I do understand the desire to make a program as complaint with a

particular file format as possible. That is a good thing. In this case,

the file format has evolved since the original and has effectively

become a de facto standard. I would argue the de facto standard is the

right way to go in this case. If you or others disagree, what benefits

would it provide vs the amount of extra work and problems it would

create?<br>

<br>

-Brandon<br>

<br>

Bruce Southey wrote:

<blockquote

 cite="midbbcd77d00707311908r5bfefe6ewdf6d595ec1461e99@mail.gmail.com"

 type="cite">

  <pre wrap="">Hi,

I know! I was surprised when I first found out about it.

The main source that I can find is the file FASTA.doc which is the

documentation for 'The FASTA program package' . So I am not sure if

you can get much more 'authoritative' than that. Consequently, every

package or reference that ignores the comment line is not implementing

the full FASTA spec and should be treated as such.

I found various versions of FASTA.doc online including:

<a class="moz-txt-link-freetext" href="http://molbio.unmc.edu/other-tools/fasta/fasta-help.html">http://molbio.unmc.edu/other-tools/fasta/fasta-help.html</a> (this one is

Release 2.0 1995)

In section 3.2. Sequence files:

"I have included several sample test files, *.AA.  The first

line may begin with a '&gt;'  or ';' followed by a comment.  The

text after ';' in other lines will  be  ignored.   Spaces  and

tabs  (and anything else that  is  not  an amino-acid code) are

ignored."

A similar  example is:

<a class="moz-txt-link-freetext" href="http://www.psc.edu/general/software/packages/fasta/manual/fasta.html">http://www.psc.edu/general/software/packages/fasta/manual/fasta.html</a>

"The fasta3 programs use a standard text format sequence file. Lines

begin- ning with '&gt;' or ';' are considered comments and ignored;

sequences can be upper or lower case, blanks,tabs and unrecognizable

characters are ignored."

Regards

Bruce

On 7/31/07, Andrew Dalke <a class="moz-txt-link-rfc2396E" href="mailto:dalke@dalkescientific.com">&lt;dalke@dalkescientific.com&gt;</a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">On Jul 31, 2007, at 9:05 PM, Bruce Southey wrote:

    </pre>

    <blockquote type="cite">

      <pre wrap="">I did notice that your FASTA section is incomplete because

you must address the comment part of the FASTA format (eg. see

<a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Fasta_format">http://en.wikipedia.org/wiki/Fasta_format</a> ). Yeah, most programs and

people miss this but it is part of the format.

      </pre>

    </blockquote>

    <pre wrap="">While I know it's in the Wikipedia page, and recall

mention of ;comments on a few other web pages, I have never

seen sequence libraries with those comments, nor have

I reviewed any source code which handles it.  So I

removed that detail from the Wikipedia.  In the

discussion section I'm pointing to this email, once it

gets in the bip archive.

I see no justification for having new code - and especially

not code meant for beginning programmers - support it when

the code cannot be tested against real-world data and

will never be used; because no new data sets will have

those comments.

Bioperl doesn't handle it.  This is from Bio/SeqIO/fasta.pm

     local $/ = "\n&gt;";

     return unless my $entry = $self-&gt;_readline;

     chomp($entry);

      ...

     $entry =~ s/^&gt;//;

     my ($top,$sequence) = split(/\n/,$entry,2);

     defined $sequence &amp;&amp; $sequence =~ s/&gt;//g;

     ...

     my ($id,$fulldesc);

     if( $top =~ /^\s*(\S+)\s*(.*)/ ) {

         ($id,$fulldesc) = ($1,$2);

     }

     if (defined $id &amp;&amp; $id eq '') {$id=$fulldesc;} # FIX incase no

space

                                                    # between &gt; and

name \AE

     defined $sequence &amp;&amp; $sequence =~ s/\s//g;  # Remove whitespace

which means it's certainly not found currently in the wild.

Gilbert's old readseq library doesn't handle it either.  (The

code is in readPearson/endPearson in

   <a class="moz-txt-link-freetext" href="http://iubio.bio.indiana.edu/soft/molbio/readseq/classic/src/">http://iubio.bio.indiana.edu/soft/molbio/readseq/classic/src/</a>

ureadseq.c

but not easily quotable because you have to know what

the "readLoop" and "getline" functions do.)

Together that means that for over 15 years two of the

most widely used sequence readers didn't handle this part

of the spec.

This exact topic was something I asked my students

a few years ago: Find three web pages which describe the

FASTA specification.  Are any of them authoritative?

The Wikipedia page does not link to a formal spec.

Download the FASTA source and you'll find no spec.  None

of the FASTA files distributed with the code contain a

leading ';'.

The only place I found was to look in the actual code.

This comes from for FASTA 35.1.5 : getseq.c

     if (line[0]=='&gt;') {

       seq_format = FASTA_FORMAT;

#ifdef SUPERFAMNUM

   ...

#endif

       if ((bp=(char *)strchr(line,'\n'))!=NULL) *bp='\0';

       strncpy(seq_title,line+1,sizeof(seq_title));

       seq_title[sizeof(seq_title)-1]='\0';

       if ((bp=(char *)strchr(line,' '))!=NULL) *bp='\0';

       strncpy(libstr,line+1,12);

       libstr[12]='\0';

     }

....

   if (seq_format !=GCG_FORMAT)

     while(fgets(line,sizeof(line),fptr)!=NULL) {

#ifdef PIRLIB

  ...

#endif

         if (line[0]!='&gt;'&amp;&amp; line[0]!=';') {

           for (i=0; (n&lt;maxs &amp;&amp; rn &lt; sstop)&amp;&amp;

                  ((ic=qascii[line[i]&amp;AAMASK])&lt;EL); i++)

             if (ic&lt;NA &amp;&amp; ++rn &gt; sstart ) seq[n++]= ic;

           if (ic == ES || rn &gt; sstop) break;

         }

     }

The variable 'line' is 512 characters long.  If this

is authoritative then this means that each sequence of

a FASTA record may be up to 511 characters, and no more.

I don't know what all that code is doing.  It does

look like if the line is more than 512 characters long,

and the 512th character is a "&gt;", then it will be

misinterpreted.  But I would have to test it to find out.

So, ignore the Wikipedia entry.  It was written by a

pedant.  To respond in kind, the world parses "the

NCBI FASTA format" and not "the Pearson FASTA format" ;)

The NCBI FASTA format is at

   <a class="moz-txt-link-freetext" href="http://www.ncbi.nlm.nih.gov/blast/fasta.shtml">http://www.ncbi.nlm.nih.gov/blast/fasta.shtml</a>

                                Andrew

                                <a class="moz-txt-link-abbreviated" href="mailto:dalke@dalkescientific.com">dalke@dalkescientific.com</a>

_______________________________________________

biology-in-python mailing list

<a class="moz-txt-link-abbreviated" href="mailto:biology-in-python@lists.idyll.org">biology-in-python@lists.idyll.org</a>

<a class="moz-txt-link-freetext" href="http://lists.idyll.org/listinfo/biology-in-python">http://lists.idyll.org/listinfo/biology-in-python</a>

    </pre>

  </blockquote>

  <pre wrap=""><!---->

_______________________________________________

biology-in-python mailing list

<a class="moz-txt-link-abbreviated" href="mailto:biology-in-python@lists.idyll.org">biology-in-python@lists.idyll.org</a>

<a class="moz-txt-link-freetext" href="http://lists.idyll.org/listinfo/biology-in-python">http://lists.idyll.org/listinfo/biology-in-python</a>

  </pre>

</blockquote>

</body>

</html>