[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Thu Feb 7 20:02:33 PST 2008

Yes, this isn't "Biology in C++".  Sue me ;)

Here's a C++ BLASTP parser comparable to what was in that paper.  The  
output is byte identical, the LOC is about 40 instead of 80.  They  
reported needing 41 LOC for Python.

Though this code, unlike theirs, does a bit of extra error checking  
and there's no way that evil data can force buffer overflows here.   
I'm not a C++ programer any more - haven't been for 10 years - so  
there's probably cleaner ways to do some of this.

// Implementation of the BLASTP parser based on the requirements from
//   http://www.bioinformatics.org/benchmark/
// Written by Andrew Dalke <dalke at dalkescientific.com>
// 8 Feb 2008, Gothenburg, Sweden
// Share and enjoy.

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

// Find the first word (between whitespace) after the next '='  
character.
void word_after_equals(const string &s, size_t pos,
                        size_t &start, size_t &end) {
   start = s.find_first_not_of(' ', s.find_first_of('=', pos) + 1);
   end = s.find_first_of(' ', start);
}

int main() {
   ifstream in("/Users/dalke/nbn_courses/nbn_winter_05/ 
ncbi_blastp_2.2.3.txt");
   size_t start, end;
   if (in.fail()) {
     cerr << "Cannot open file!" << endl; exit(1);
   }
   string line, bits, e, identities, positives;
   while (!in.eof()) {
     getline(in, line);
     if (line[0] == '>') {
       string name(line, 1, line.length()-1);
       while (!in.eof()) {
         getline(in, line);
         if (line.compare(0, 16, "          Length") == 0) {
           break;
         }
         // the '-1' is so I get the space between each line
         name.append(line, line.find_first_not_of(' ')-1, line.length 
());
       }
       // replace "," for "." so this can be used as a spreadsheet
       for (string::iterator it=name.begin(); it<name.end(); it++) {
         if (*it == ',') *it = '.';
       }

       // blank line
       getline(in, line);

       // " Score =  177 bits (448), Expect = 2e-43"
       getline(in, line);
       if (line.compare(0, 8, " Score =")) {
         cerr << "bad score: '" << line << "'" << endl; exit(1);
       }
       word_after_equals(line, 0, start, end);
       bits = line.substr(start, end-start);
       word_after_equals(line, end, start, end);
       e = line.substr(start, end-start);

       // " Identities = 85/111 (76%), Positives = 97/111 (86%), Gaps  
= 1/111 (0%)"
       getline(in, line);
       if (line.compare(0, 13, " Identities =")) {
         cerr << "bad identities: '" << line << "'" << endl; exit(1);
       }
       word_after_equals(line, 0, start, end);
       identities = line.substr(start, end-start);
       word_after_equals(line, end, start, end);
       positives = line.substr(start, end-start);

       std::cout << name << ',' << bits << ',' << e << ',' <<  
identities << ',' << positives << endl;
     }
   }
}

				Andrew
				dalke at dalkescientific.com