[bip] welcome!

Andrew Dalke dalke at dalkescientific.com
Tue Jul 31 16:52:14 PDT 2007


On Jul 31, 2007, at 9:05 PM, Bruce Southey wrote:
> I did notice that your FASTA section is incomplete because
> you must address the comment part of the FASTA format (eg. see
> http://en.wikipedia.org/wiki/Fasta_format ). Yeah, most programs and
> people miss this but it is part of the format.

While I know it's in the Wikipedia page, and recall
mention of ;comments on a few other web pages, I have never
seen sequence libraries with those comments, nor have
I reviewed any source code which handles it.  So I
removed that detail from the Wikipedia.  In the
discussion section I'm pointing to this email, once it
gets in the bip archive.


I see no justification for having new code - and especially
not code meant for beginning programmers - support it when
the code cannot be tested against real-world data and
will never be used; because no new data sets will have
those comments.


Bioperl doesn't handle it.  This is from Bio/SeqIO/fasta.pm

     local $/ = "\n>";
     return unless my $entry = $self->_readline;

     chomp($entry);
      ...
     $entry =~ s/^>//;

     my ($top,$sequence) = split(/\n/,$entry,2);
     defined $sequence && $sequence =~ s/>//g;
     ...

     my ($id,$fulldesc);
     if( $top =~ /^\s*(\S+)\s*(.*)/ ) {
         ($id,$fulldesc) = ($1,$2);
     }

     if (defined $id && $id eq '') {$id=$fulldesc;} # FIX incase no  
space
                                                    # between > and  
name \AE
     defined $sequence && $sequence =~ s/\s//g;  # Remove whitespace

which means it's certainly not found currently in the wild.

Gilbert's old readseq library doesn't handle it either.  (The
code is in readPearson/endPearson in
   http://iubio.bio.indiana.edu/soft/molbio/readseq/classic/src/ 
ureadseq.c
but not easily quotable because you have to know what
the "readLoop" and "getline" functions do.)


Together that means that for over 15 years two of the
most widely used sequence readers didn't handle this part
of the spec.



This exact topic was something I asked my students
a few years ago: Find three web pages which describe the
FASTA specification.  Are any of them authoritative?


The Wikipedia page does not link to a formal spec.
Download the FASTA source and you'll find no spec.  None
of the FASTA files distributed with the code contain a
leading ';'.


The only place I found was to look in the actual code.
This comes from for FASTA 35.1.5 : getseq.c

     if (line[0]=='>') {
       seq_format = FASTA_FORMAT;
#ifdef SUPERFAMNUM
   ...
#endif
       if ((bp=(char *)strchr(line,'\n'))!=NULL) *bp='\0';
       strncpy(seq_title,line+1,sizeof(seq_title));
       seq_title[sizeof(seq_title)-1]='\0';
       if ((bp=(char *)strchr(line,' '))!=NULL) *bp='\0';
       strncpy(libstr,line+1,12);
       libstr[12]='\0';
     }

....

   if (seq_format !=GCG_FORMAT)
     while(fgets(line,sizeof(line),fptr)!=NULL) {
#ifdef PIRLIB
  ...
#endif
         if (line[0]!='>'&& line[0]!=';') {
           for (i=0; (n<maxs && rn < sstop)&&
                  ((ic=qascii[line[i]&AAMASK])<EL); i++)
             if (ic<NA && ++rn > sstart ) seq[n++]= ic;
           if (ic == ES || rn > sstop) break;
         }
     }

The variable 'line' is 512 characters long.  If this
is authoritative then this means that each sequence of
a FASTA record may be up to 511 characters, and no more.

I don't know what all that code is doing.  It does
look like if the line is more than 512 characters long,
and the 512th character is a ">", then it will be
misinterpreted.  But I would have to test it to find out.


So, ignore the Wikipedia entry.  It was written by a
pedant.  To respond in kind, the world parses "the
NCBI FASTA format" and not "the Pearson FASTA format" ;)

The NCBI FASTA format is at
   http://www.ncbi.nlm.nih.gov/blast/fasta.shtml

				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list