[bip] welcome!

Brandon King kingb at caltech.edu
Wed Aug 1 14:36:06 PDT 2007


The vast majority of applications do not support ';' and if it was a 
feature of the original format, it has been forgotten and ignored. 
Making a program support it now would cause more problems than any 
benefit would provide. If some programs support it and others do not, 
then we suddenly would have more work to convert the two FASTA formats 
between each other, and frankly that makes me want to abandon 
bioinformatics altogether as "standard" file formats has been a huge 
time sink in the field. FASTA in the form most programs support has been 
the closest thing to a non-changing file format in bioinformatics, which 
is sad.

I do understand the desire to make a program as complaint with a 
particular file format as possible. That is a good thing. In this case, 
the file format has evolved since the original and has effectively 
become a de facto standard. I would argue the de facto standard is the 
right way to go in this case. If you or others disagree, what benefits 
would it provide vs the amount of extra work and problems it would create?

-Brandon

Bruce Southey wrote:
> Hi,
> I know! I was surprised when I first found out about it.
>
> The main source that I can find is the file FASTA.doc which is the
> documentation for 'The FASTA program package' . So I am not sure if
> you can get much more 'authoritative' than that. Consequently, every
> package or reference that ignores the comment line is not implementing
> the full FASTA spec and should be treated as such.
>
> I found various versions of FASTA.doc online including:
> http://molbio.unmc.edu/other-tools/fasta/fasta-help.html (this one is
> Release 2.0 1995)
>
> In section 3.2. Sequence files:
> "I have included several sample test files, *.AA.  The first
> line may begin with a '>'  or ';' followed by a comment.  The
> text after ';' in other lines will  be  ignored.   Spaces  and
> tabs  (and anything else that  is  not  an amino-acid code) are
> ignored."
>
> A similar  example is:
> http://www.psc.edu/general/software/packages/fasta/manual/fasta.html
>
> "The fasta3 programs use a standard text format sequence file. Lines
> begin- ning with '>' or ';' are considered comments and ignored;
> sequences can be upper or lower case, blanks,tabs and unrecognizable
> characters are ignored."
>
>
> Regards
> Bruce
>
> On 7/31/07, Andrew Dalke <dalke at dalkescientific.com> wrote:
>   
>> On Jul 31, 2007, at 9:05 PM, Bruce Southey wrote:
>>     
>>> I did notice that your FASTA section is incomplete because
>>> you must address the comment part of the FASTA format (eg. see
>>> http://en.wikipedia.org/wiki/Fasta_format ). Yeah, most programs and
>>> people miss this but it is part of the format.
>>>       
>> While I know it's in the Wikipedia page, and recall
>> mention of ;comments on a few other web pages, I have never
>> seen sequence libraries with those comments, nor have
>> I reviewed any source code which handles it.  So I
>> removed that detail from the Wikipedia.  In the
>> discussion section I'm pointing to this email, once it
>> gets in the bip archive.
>>
>>
>> I see no justification for having new code - and especially
>> not code meant for beginning programmers - support it when
>> the code cannot be tested against real-world data and
>> will never be used; because no new data sets will have
>> those comments.
>>
>>
>> Bioperl doesn't handle it.  This is from Bio/SeqIO/fasta.pm
>>
>>      local $/ = "\n>";
>>      return unless my $entry = $self->_readline;
>>
>>      chomp($entry);
>>       ...
>>      $entry =~ s/^>//;
>>
>>      my ($top,$sequence) = split(/\n/,$entry,2);
>>      defined $sequence && $sequence =~ s/>//g;
>>      ...
>>
>>      my ($id,$fulldesc);
>>      if( $top =~ /^\s*(\S+)\s*(.*)/ ) {
>>          ($id,$fulldesc) = ($1,$2);
>>      }
>>
>>      if (defined $id && $id eq '') {$id=$fulldesc;} # FIX incase no
>> space
>>                                                     # between > and
>> name \AE
>>      defined $sequence && $sequence =~ s/\s//g;  # Remove whitespace
>>
>> which means it's certainly not found currently in the wild.
>>
>> Gilbert's old readseq library doesn't handle it either.  (The
>> code is in readPearson/endPearson in
>>    http://iubio.bio.indiana.edu/soft/molbio/readseq/classic/src/
>> ureadseq.c
>> but not easily quotable because you have to know what
>> the "readLoop" and "getline" functions do.)
>>
>>
>> Together that means that for over 15 years two of the
>> most widely used sequence readers didn't handle this part
>> of the spec.
>>
>>
>>
>> This exact topic was something I asked my students
>> a few years ago: Find three web pages which describe the
>> FASTA specification.  Are any of them authoritative?
>>
>>
>> The Wikipedia page does not link to a formal spec.
>> Download the FASTA source and you'll find no spec.  None
>> of the FASTA files distributed with the code contain a
>> leading ';'.
>>
>>
>> The only place I found was to look in the actual code.
>> This comes from for FASTA 35.1.5 : getseq.c
>>
>>      if (line[0]=='>') {
>>        seq_format = FASTA_FORMAT;
>> #ifdef SUPERFAMNUM
>>    ...
>> #endif
>>        if ((bp=(char *)strchr(line,'\n'))!=NULL) *bp='\0';
>>        strncpy(seq_title,line+1,sizeof(seq_title));
>>        seq_title[sizeof(seq_title)-1]='\0';
>>        if ((bp=(char *)strchr(line,' '))!=NULL) *bp='\0';
>>        strncpy(libstr,line+1,12);
>>        libstr[12]='\0';
>>      }
>>
>> ....
>>
>>    if (seq_format !=GCG_FORMAT)
>>      while(fgets(line,sizeof(line),fptr)!=NULL) {
>> #ifdef PIRLIB
>>   ...
>> #endif
>>          if (line[0]!='>'&& line[0]!=';') {
>>            for (i=0; (n<maxs && rn < sstop)&&
>>                   ((ic=qascii[line[i]&AAMASK])<EL); i++)
>>              if (ic<NA && ++rn > sstart ) seq[n++]= ic;
>>            if (ic == ES || rn > sstop) break;
>>          }
>>      }
>>
>> The variable 'line' is 512 characters long.  If this
>> is authoritative then this means that each sequence of
>> a FASTA record may be up to 511 characters, and no more.
>>
>> I don't know what all that code is doing.  It does
>> look like if the line is more than 512 characters long,
>> and the 512th character is a ">", then it will be
>> misinterpreted.  But I would have to test it to find out.
>>
>>
>> So, ignore the Wikipedia entry.  It was written by a
>> pedant.  To respond in kind, the world parses "the
>> NCBI FASTA format" and not "the Pearson FASTA format" ;)
>>
>> The NCBI FASTA format is at
>>    http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
>>
>>                                 Andrew
>>                                 dalke at dalkescientific.com
>>
>>
>>
>> _______________________________________________
>> biology-in-python mailing list
>> biology-in-python at lists.idyll.org
>> http://lists.idyll.org/listinfo/biology-in-python
>>
>>     
>
> _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.idyll.org/pipermail/biology-in-python/attachments/20070801/b93345de/attachment.htm 


More information about the biology-in-python mailing list