[bip] Blog post on bioinformatics and Python
Andrew Dalke
dalke at dalkescientific.com
Thu Sep 18 10:17:39 PDT 2008
(Resend since the first time I only sent to Peter instead of the list.)
On Sep 18, 2008, at 6:42 PM, Peter wrote:
> In order to maintain backwards compatibility, we are unfortunately
> stuck with the compressed variable names now (although we could add
> aliases).
>
One possible migration path is to have the deprecated property
lookups issue a warning. It depends on how you want code to
change for the future - use the existing parser calls or create
"blast parser, v2" API?
Titus' parser uses pyparsing for the parsing, which means building
expressions like:
alignment_triple = Literal("Query:").suppress() + \
Word(nums).setParseAction(make_int) + \
Word(gapped_sequence) + \
Word(nums).setParseAction(make_int) + \
LineEnd().suppress() + \
SkipTo(LineEnd().suppress()) + \
Literal("Sbjct:").suppress() + \
Word(nums).setParseAction(make_int) + \
Word(gapped_sequence) + \
Word(nums).setParseAction(make_int)
Before going further - for things like this, pyparsing is a
much more appropriate choice than ply. Though I really
like ply.
One of the things I learned from Martel was that writing
expressions like this is hard to read. There's a lot
in the way. But regular expressions are also hard to read.
Here's part of the corresponding code in Biopython
# Match a space, if one is available. Masahir Ishikawa found a
# case where there's no space between the start and the sequence:
# Query: 100tt 101
# line below modified by Yair Benita, Sep 2004
# Note that the colon is not always present. 2006
_query_re = re.compile(r"Query(:?) \s*(\d+)\s*(.+) (\d+)")
The definition looks easy (and note support for wild-type BLAST)
The handler code .. not so simple.
def query(self, line):
m = self._query_re.search(line)
if m is None:
raise ValueError, "I could not find the query in line\n%
s" % line
# line below modified by Yair Benita, Sep 2004.
# added the end attribute for the query
colon, start, seq, end = m.groups()
self._hsp.query = self._hsp.query + seq
if self._hsp.query_start is None:
self._hsp.query_start = _safe_int(start)
# line below added by Yair Benita, Sep 2004.
# added the end attribute for the query
self._hsp.query_end = _safe_int(end)
The _safe_int code is due to 1) commas in some numbers (which isn't
relevant here) and 2) Python of 9 years ago was before the int/long
unification so there was support for converting to a long. I think
its prevalence of use is because of habit - "others used it so there
must be a reason; I'll use it because it's safer that way."
There's also a _safe_float because BLAST could report some floats
in the form "e-172". That is, without a leading number.
# Thomas Rosleff Soerensen (rosleff at mpiz-koeln.mpg.de) noted that
# float('e-172') does not produce an error on his platform. Thus,
# we need to check the string for this condition.
# Sometimes BLAST leaves of the '1' in front of an exponent.
How much diversity in wild-type BLAST does a parser need to
support these days?
"All we are saying,
is give BLAST-XML a chance."
Andrew
dalke at dalkescientific.com
More information about the biology-in-python
mailing list