[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Thu Feb 7 17:49:02 PST 2008

On Feb 7, 2008, at 10:30 PM, Paulo Nuin wrote:
> I am trying to do a blastp with the sequence ID from the paper and  
> I am not getting a 9.8 Gb file. Not even 9.8 Mb. I have tested a  
> couple of methods, even using Geneious to to the blast. Anyone else  
> tried to obtain this file?
> I wanted to redo the "benchmarks" with some code modifications.

Same here.  I'm trying to get a test set for the RT code.  All I have  
to go on is

     In the test example 76 Hantavirus segment L sequences were used  
with an
     overall alignment length of 6580 nucleotides.

Any idea of which sequences?

Has anyone figured out how the LOC counts were generated?  This gives  
a the right answer of 119 for alignment.c

egrep -v '^ *\*' alignment.c | perl -pe 's/\t/ /g' | egrep -v '^ *$'  
| egrep -v '^ *//' | egrep -v '^ */\*' | egrep -v '^ *\}' | wc -l

Broken down that's

egrep -v '^ *\*' alignment.c |   # remove comment continuations
     perl -pe 's/\t/ /g' |        # replace tabs with spaces
     egrep -v '^ *$' |            # remove blank lines
     egrep -v '^ *//' |           # remove C++-style comments (not  
legal in C-90)
     egrep -v '^/\*' |            # remove lines which start a comment
     egrep -v '^ *\}' |           # remove lines containing only a "}"
     wc -l

However, there's dead code in that module.

char* insert(char* str, char car){
   int l=strlen(str);
   str=(char*)malloc(sizeof(char) * l+1);
   str[l]=car;
   return str;
}

is never referenced.

If I use the same filter to get the line count for NJ.c and reader.c  
I get 175+77 = 252 while they quote 240.  For "parser.c" and  
"parseRE.c" I get 81, instead of 82.

And it should be shorter.  Given

       size_line=strlen(line);
       line[size_line-1] = '\0';
       size_line--;

     if(line[0] == '>'){
       // don't copy >
       memcpy(name, line+1, size_line-1);

there's two places where the right side character is chopped off.   
There should be only one.  That's a bug.  Anyone surprised?  Though  
more seriously since there's no way to tell what it means to be right  
there's no reason to have some of this code at all.

				Andrew
				dalke at dalkescientific.com