[bip] welcome!

Wed Aug 1 20:10:29 PDT 2007

Hi,
This type of reaction is rather surprising given that original source
of the format allows OPTIONAL comment lines starting with ';'.  The
arguments that the 'file format has evolved' or there is some standard
are incorrect because it appears to be have been present from the
beginning.  The optional means you can exclude the comment lines and
still be valid. Furthermore, the code Andrew provides indicates that
FASTA can understand the comment lines.

Bruce

On 8/1/07, Brandon King <kingb at caltech.edu> wrote:
>
>  The vast majority of applications do not support ';' and if it was a
> feature of the original format, it has been forgotten and ignored. Making a
> program support it now would cause more problems than any benefit would
> provide. If some programs support it and others do not, then we suddenly
> would have more work to convert the two FASTA formats between each other,
> and frankly that makes me want to abandon bioinformatics altogether as
> "standard" file formats has been a huge time sink in the field. FASTA in the
> form most programs support has been the closest thing to a non-changing file
> format in bioinformatics, which is sad.
>
>  I do understand the desire to make a program as complaint with a particular
> file format as possible. That is a good thing. In this case, the file format
> has evolved since the original and has effectively become a de facto
> standard. I would argue the de facto standard is the right way to go in this
> case. If you or others disagree, what benefits would it provide vs the
> amount of extra work and problems it would create?
>
>  -Brandon
>
>
>  Bruce Southey wrote:
>  Hi,
> I know! I was surprised when I first found out about it.
>
> The main source that I can find is the file FASTA.doc which is the
> documentation for 'The FASTA program package' . So I am not sure if
> you can get much more 'authoritative' than that. Consequently, every
> package or reference that ignores the comment line is not implementing
> the full FASTA spec and should be treated as such.
>
> I found various versions of FASTA.doc online including:
> http://molbio.unmc.edu/other-tools/fasta/fasta-help.html
> (this one is
> Release 2.0 1995)
>
> In section 3.2. Sequence files:
> "I have included several sample test files, *.AA. The first
> line may begin with a '>' or ';' followed by a comment. The
> text after ';' in other lines will be ignored. Spaces and
> tabs (and anything else that is not an amino-acid code) are
> ignored."
>
> A similar example is:
> http://www.psc.edu/general/software/packages/fasta/manual/fasta.html
>
> "The fasta3 programs use a standard text format sequence file. Lines
> begin- ning with '>' or ';' are considered comments and ignored;
> sequences can be upper or lower case, blanks,tabs and unrecognizable
> characters are ignored."
>
>
> Regards
> Bruce
>
> On 7/31/07, Andrew Dalke <dalke at dalkescientific.com> wrote:
>
>
>  On Jul 31, 2007, at 9:05 PM, Bruce Southey wrote:
>
>
>  I did notice that your FASTA section is incomplete because
> you must address the comment part of the FASTA format (eg. see
> http://en.wikipedia.org/wiki/Fasta_format ). Yeah, most
> programs and
> people miss this but it is part of the format.
>
>  While I know it's in the Wikipedia page, and recall
> mention of ;comments on a few other web pages, I have never
> seen sequence libraries with those comments, nor have
> I reviewed any source code which handles it. So I
> removed that detail from the Wikipedia. In the
> discussion section I'm pointing to this email, once it
> gets in the bip archive.
>
>
> I see no justification for having new code - and especially
> not code meant for beginning programmers - support it when
> the code cannot be tested against real-world data and
> will never be used; because no new data sets will have
> those comments.
>
>
> Bioperl doesn't handle it. This is from Bio/SeqIO/fasta.pm
>
>  local $/ = "\n>";
>  return unless my $entry = $self->_readline;
>
>  chomp($entry);
>  ...
>  $entry =~ s/^>//;
>
>  my ($top,$sequence) = split(/\n/,$entry,2);
>  defined $sequence && $sequence =~ s/>//g;
>  ...
>
>  my ($id,$fulldesc);
>  if( $top =~ /^\s*(\S+)\s*(.*)/ ) {
>  ($id,$fulldesc) = ($1,$2);
>  }
>
>  if (defined $id && $id eq '') {$id=$fulldesc;} # FIX incase no
> space
>  # between > and
> name \AE
>  defined $sequence && $sequence =~ s/\s//g; # Remove whitespace
>
> which means it's certainly not found currently in the wild.
>
> Gilbert's old readseq library doesn't handle it either. (The
> code is in readPearson/endPearson in
> http://iubio.bio.indiana.edu/soft/molbio/readseq/classic/src/
> ureadseq.c
> but not easily quotable because you have to know what
> the "readLoop" and "getline" functions do.)
>
>
> Together that means that for over 15 years two of the
> most widely used sequence readers didn't handle this part
> of the spec.
>
>
>
> This exact topic was something I asked my students
> a few years ago: Find three web pages which describe the
> FASTA specification. Are any of them authoritative?
>
>
> The Wikipedia page does not link to a formal spec.
> Download the FASTA source and you'll find no spec. None
> of the FASTA files distributed with the code contain a
> leading ';'.
>
>
> The only place I found was to look in the actual code.
> This comes from for FASTA 35.1.5 : getseq.c
>
>  if (line[0]=='>') {
>  seq_format = FASTA_FORMAT;
> #ifdef SUPERFAMNUM
>  ...
> #endif
>  if ((bp=(char *)strchr(line,'\n'))!=NULL) *bp='\0';
>  strncpy(seq_title,line+1,sizeof(seq_title));
>  seq_title[sizeof(seq_title)-1]='\0';
>  if ((bp=(char *)strchr(line,' '))!=NULL) *bp='\0';
>  strncpy(libstr,line+1,12);
>  libstr[12]='\0';
>  }
>
> ....
>
>  if (seq_format !=GCG_FORMAT)
>  while(fgets(line,sizeof(line),fptr)!=NULL) {
> #ifdef PIRLIB
>  ...
> #endif
>  if (line[0]!='>'&& line[0]!=';') {
>  for (i=0; (n<maxs && rn < sstop)&&
>  ((ic=qascii[line[i]&AAMASK])<EL); i++)
>  if (ic<NA && ++rn > sstart ) seq[n++]= ic;
>  if (ic == ES || rn > sstop) break;
>  }
>  }
>
> The variable 'line' is 512 characters long. If this
> is authoritative then this means that each sequence of
> a FASTA record may be up to 511 characters, and no more.
>
> I don't know what all that code is doing. It does
> look like if the line is more than 512 characters long,
> and the 512th character is a ">", then it will be
> misinterpreted. But I would have to test it to find out.
>
>
> So, ignore the Wikipedia entry. It was written by a
> pedant. To respond in kind, the world parses "the
> NCBI FASTA format" and not "the Pearson FASTA format" ;)
>
> The NCBI FASTA format is at
>  http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
>
>  Andrew
>  dalke at dalkescientific.com
>
>
>
> _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python
>
>
>  _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python
>
>
>