[bip] Blog post on bioinformatics and Python

Andrew Dalke dalke at dalkescientific.com
Thu Sep 18 10:17:39 PDT 2008


(Resend since the first time I only sent to Peter instead of the list.)

On Sep 18, 2008, at 6:42 PM, Peter wrote:

> In order to maintain backwards compatibility, we are unfortunately
> stuck with the compressed variable names now (although we could add
> aliases).
>

One possible migration path is to have the deprecated property
lookups issue a warning.  It depends on how you want code to
change for the future - use the existing parser calls or create
"blast parser, v2" API?

Titus' parser uses pyparsing for the parsing, which means building
expressions like:

         alignment_triple = Literal("Query:").suppress() + \
                            Word(nums).setParseAction(make_int) + \
                            Word(gapped_sequence) + \
                            Word(nums).setParseAction(make_int) + \
                            LineEnd().suppress() + \
                            SkipTo(LineEnd().suppress()) + \
                            Literal("Sbjct:").suppress() + \
                            Word(nums).setParseAction(make_int) + \
                            Word(gapped_sequence) + \
                            Word(nums).setParseAction(make_int)

Before going further - for things like this, pyparsing is a
much more appropriate choice than ply.  Though I really
like ply.

One of the things I learned from Martel was that writing
expressions like this is hard to read.  There's a lot
in the way.  But regular expressions are also hard to read.


Here's part of the corresponding code in Biopython

     # Match a space, if one is available.  Masahir Ishikawa found a
     # case where there's no space between the start and the sequence:
     # Query: 100tt 101
     # line below modified by Yair Benita, Sep 2004
     # Note that the colon is not always present. 2006
     _query_re = re.compile(r"Query(:?) \s*(\d+)\s*(.+) (\d+)")

The definition looks easy (and note support for wild-type BLAST)
The handler code .. not so simple.

     def query(self, line):
         m = self._query_re.search(line)
         if m is None:
             raise ValueError, "I could not find the query in line\n% 
s" % line

         # line below modified by Yair Benita, Sep 2004.
         # added the end attribute for the query
         colon, start, seq, end = m.groups()
         self._hsp.query = self._hsp.query + seq
         if self._hsp.query_start is None:
             self._hsp.query_start = _safe_int(start)

         # line below added by Yair Benita, Sep 2004.
         # added the end attribute for the query
         self._hsp.query_end = _safe_int(end)


The _safe_int code is due to 1) commas in some numbers (which isn't
relevant here) and 2) Python of 9 years ago was before the int/long
unification so there was support for converting to a long.  I think
its prevalence of use is because of habit - "others used it so there
must be a reason; I'll use it because it's safer that way."


There's also a _safe_float because BLAST could report some floats
in the form "e-172".  That is, without a leading number.

     # Thomas Rosleff Soerensen (rosleff at mpiz-koeln.mpg.de) noted that
     # float('e-172') does not produce an error on his platform.  Thus,
     # we need to check the string for this condition.

     # Sometimes BLAST leaves of the '1' in front of an exponent.


How much diversity in wild-type BLAST does a parser need to
support these days?


"All we are saying,
is give BLAST-XML a chance."


				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list