[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd
Andrew Dalke
dalke at dalkescientific.com
Fri Feb 8 15:07:44 PST 2008
On Feb 8, 2008, at 5:02 AM, Andrew Dalke wrote:
> Yes, this isn't "Biology in C++". Sue me ;)
No lawsuits yet ...but this should be t
> I'm not a C++ programer any more - haven't been for 10 years - so
> there's probably cleaner ways to do some of this.
Duh! I forgot about sscanf. That's been the way to do this sort of
lightweight parsing in C for 30+ years, and it works just fine with C+
+ strings. It does about as much validation as the regexp patterns do.
Implementation is now at 33 lines. Earlier I reported 40 LOC. The
implementations in the paper were:
Perl - 26 lines
Java - 33
C# - 36
Python - 41
C++ - 81
C - 82
I can remove 4 lines if I omit error checking - I do a stronger
format validation then the paper does - and another 2 if I don't fold
a couple of expressions over two lines.
The C version should be only a 10 or so lines longer, and that mostly
to handle reallocing the 'name' and checking for fgets failure.
BLASTP lines don't get over 100 characters so it's easy to call that
a format error and exit.
// Implementation of the BLASTP parser based on the requirements from
// http://www.bioinformatics.org/benchmark/
// Written by Andrew Dalke <dalke at dalkescientific.com>
// 8 Feb 2008, Gothenburg, Sweden
// Share and enjoy.
#include <iostream>
#include <fstream>
#include <string>
#include <stdio.h>
using namespace std;
int main() {
ifstream in("/Users/dalke/nbn_courses/nbn_winter_05/
ncbi_blastp_2.2.3.txt");
if (in.fail()) {
cerr << "Cannot open file!" << endl; exit(1);
}
size_t start, end;
string line;
char bits[100], e[100], identities[100], positives[100];
while (!in.eof()) {
getline(in, line);
if (line[0] == '>') {
string name(line, 1, line.length()-1);
while (!in.eof()) {
getline(in, line);
if (line.compare(0, 16, " Length") == 0) {
break;
}
// -1 so I get the space between each line
name.append(line, line.find_first_not_of(' ')-1, line.length
());
}
// replace "," with "." so this can be used as a spreadsheet
for (string::iterator it=name.begin(); it<name.end(); it++) {
if (*it == ',') *it = '.';
}
// blank line
getline(in, line);
// " Score = 177 bits (448), Expect = 2e-43"
getline(in, line);
if (!sscanf(line.c_str(), " Score = %99s bits %*s Expect = %99s",
bits, e)) {
cerr << "bad score: '" << line << "'" << endl; exit(1);
}
// " Identities = 85/111 (76%), Positives = 97/111 (86%), Gaps
= 1/111 (0%)"
getline(in, line);
if (!sscanf(line.c_str(), " Identities = %99s %*s Positives = %
99s",
identities, positives)) {
cerr << "bad identities: '" << line << "'" << endl; exit(1);
}
std::cout << name << ',' << bits << ',' << e << ',' <<
identities << ',' << positives << endl;
}
}
}
>
Andrew
dalke at dalkescientific.com
More information about the biology-in-python
mailing list