[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd
Andrew Dalke
dalke at dalkescientific.com
Thu Feb 7 20:02:33 PST 2008
Yes, this isn't "Biology in C++". Sue me ;)
Here's a C++ BLASTP parser comparable to what was in that paper. The
output is byte identical, the LOC is about 40 instead of 80. They
reported needing 41 LOC for Python.
Though this code, unlike theirs, does a bit of extra error checking
and there's no way that evil data can force buffer overflows here.
I'm not a C++ programer any more - haven't been for 10 years - so
there's probably cleaner ways to do some of this.
// Implementation of the BLASTP parser based on the requirements from
// http://www.bioinformatics.org/benchmark/
// Written by Andrew Dalke <dalke at dalkescientific.com>
// 8 Feb 2008, Gothenburg, Sweden
// Share and enjoy.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
// Find the first word (between whitespace) after the next '='
character.
void word_after_equals(const string &s, size_t pos,
size_t &start, size_t &end) {
start = s.find_first_not_of(' ', s.find_first_of('=', pos) + 1);
end = s.find_first_of(' ', start);
}
int main() {
ifstream in("/Users/dalke/nbn_courses/nbn_winter_05/
ncbi_blastp_2.2.3.txt");
size_t start, end;
if (in.fail()) {
cerr << "Cannot open file!" << endl; exit(1);
}
string line, bits, e, identities, positives;
while (!in.eof()) {
getline(in, line);
if (line[0] == '>') {
string name(line, 1, line.length()-1);
while (!in.eof()) {
getline(in, line);
if (line.compare(0, 16, " Length") == 0) {
break;
}
// the '-1' is so I get the space between each line
name.append(line, line.find_first_not_of(' ')-1, line.length
());
}
// replace "," for "." so this can be used as a spreadsheet
for (string::iterator it=name.begin(); it<name.end(); it++) {
if (*it == ',') *it = '.';
}
// blank line
getline(in, line);
// " Score = 177 bits (448), Expect = 2e-43"
getline(in, line);
if (line.compare(0, 8, " Score =")) {
cerr << "bad score: '" << line << "'" << endl; exit(1);
}
word_after_equals(line, 0, start, end);
bits = line.substr(start, end-start);
word_after_equals(line, end, start, end);
e = line.substr(start, end-start);
// " Identities = 85/111 (76%), Positives = 97/111 (86%), Gaps
= 1/111 (0%)"
getline(in, line);
if (line.compare(0, 13, " Identities =")) {
cerr << "bad identities: '" << line << "'" << endl; exit(1);
}
word_after_equals(line, 0, start, end);
identities = line.substr(start, end-start);
word_after_equals(line, end, start, end);
positives = line.substr(start, end-start);
std::cout << name << ',' << bits << ',' << e << ',' <<
identities << ',' << positives << endl;
}
}
}
Andrew
dalke at dalkescientific.com
More information about the biology-in-python
mailing list