[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Fri Feb 8 15:07:44 PST 2008

On Feb 8, 2008, at 5:02 AM, Andrew Dalke wrote:
> Yes, this isn't "Biology in C++".  Sue me ;)

No lawsuits yet ...but this should be t

> I'm not a C++ programer any more - haven't been for 10 years - so
> there's probably cleaner ways to do some of this.

Duh!  I forgot about sscanf.  That's been the way to do this sort of  
lightweight parsing in C for 30+ years, and it works just fine with C+ 
+ strings.  It does about as much validation as the regexp patterns do.

Implementation is now at 33 lines.  Earlier I reported 40 LOC.  The  
implementations in the paper were:
  Perl   - 26 lines
  Java   - 33
  C#     - 36
  Python - 41
  C++    - 81
  C      - 82

I can remove 4 lines if I omit error checking - I do a stronger  
format validation then the paper does - and another 2 if I don't fold  
a couple of expressions over two lines.

The C version should be only a 10 or so lines longer, and that mostly  
to handle reallocing the 'name' and checking for fgets failure.   
BLASTP lines don't get over 100 characters so it's easy to call that  
a format error and exit.

// Implementation of the BLASTP parser based on the requirements from
//   http://www.bioinformatics.org/benchmark/
// Written by Andrew Dalke <dalke at dalkescientific.com>
// 8 Feb 2008, Gothenburg, Sweden
// Share and enjoy.

#include <iostream>
#include <fstream>
#include <string>
#include <stdio.h>
using namespace std;

int main() {
   ifstream in("/Users/dalke/nbn_courses/nbn_winter_05/ 
ncbi_blastp_2.2.3.txt");
   if (in.fail()) {
     cerr << "Cannot open file!" << endl; exit(1);
   }

   size_t start, end;
   string line;
   char bits[100], e[100], identities[100], positives[100];

   while (!in.eof()) {
     getline(in, line);
     if (line[0] == '>') {
       string name(line, 1, line.length()-1);
       while (!in.eof()) {
         getline(in, line);
         if (line.compare(0, 16, "          Length") == 0) {
           break;
         }
         // -1 so I get the space between each line
         name.append(line, line.find_first_not_of(' ')-1, line.length 
());
       }
       // replace "," with "." so this can be used as a spreadsheet
       for (string::iterator it=name.begin(); it<name.end(); it++) {
         if (*it == ',') *it = '.';
       }

       // blank line
       getline(in, line);

       // " Score =  177 bits (448), Expect = 2e-43"
       getline(in, line);
       if (!sscanf(line.c_str(), " Score = %99s bits %*s Expect = %99s",
                   bits, e)) {
         cerr << "bad score: '" << line << "'" << endl; exit(1);
       }

       // " Identities = 85/111 (76%), Positives = 97/111 (86%), Gaps  
= 1/111 (0%)"
       getline(in, line);
       if (!sscanf(line.c_str(), " Identities = %99s %*s Positives = % 
99s",
                   identities, positives)) {
         cerr << "bad identities: '" << line << "'" << endl; exit(1);
       }

       std::cout << name << ',' << bits << ',' << e << ',' <<  
identities << ',' << positives << endl;
     }
   }
}

>

				Andrew
				dalke at dalkescientific.com