[bip] formats in biology

Thu Aug 2 11:12:59 PDT 2007

-> How conservative is "conservative"?  How liberal is "liberal"?

Despite our best attempts to make it so, the world is not particularly
digital.  Most of the time, simple parsers work for my purposes.

The agile approach is to solve problems as they come up and not to plan
ahead for all possible contingencies.  So, I would suggest the
following:

 - write parsers that work for the purpose of the parser users.  (That
   could be you, or your research group, or all Python programmers
   everywhere.)

 - write automated tests that verify that the parsers don't "bit rot" or
   regress, that is, that they continue working predictably on "live"
   files of the type that you are interested in.

 - as you discover exceptions or corner cases to the format you're
   parsing, fix your parser and then add the corner cases or exceptions
   to the automated test suite.

The parser problem is annoying, yes.  But the purpose of my work is not
to get sh*t done, not spend a lot of time worrying about whether or not
a ; is going to break my parser!

I mean, while this "formats" discussion has been going on, I've been
working on an annoyingly large metagenomics analysis.  I've managed to
parse a number of FASTA files using the no doubt hideously inadequate
corebio FASTA parser.  I'm sure that at some point in the future I will
run into a file that cannot be nicely parsed with that FASTA parser.
When that happens, I will address it.  Why go begging for trouble??

Now, if anyone is proposing to actually *write* a FASTA parser that
handles the entire format spec from 800 A.D., that's great.  Let me know
when it's done and I'll use it ;)

cheers,
--titus