[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Paulo Nuin nuin at genedrift.org
Thu Feb 7 18:48:15 PST 2008


Andrew Dalke wrote:
> On Feb 7, 2008, at 10:30 PM, Paulo Nuin wrote:
>   
>> I am trying to do a blastp with the sequence ID from the paper and  
>> I am not getting a 9.8 Gb file. Not even 9.8 Mb. I have tested a  
>> couple of methods, even using Geneious to to the blast. Anyone else  
>> tried to obtain this file?
>> I wanted to redo the "benchmarks" with some code modifications.
>>     
>
> Same here.  I'm trying to get a test set for the RT code.  All I have  
> to go on is
>
>      In the test example 76 Hantavirus segment L sequences were used  
> with an
>      overall alignment length of 6580 nucleotides.
>
> Any idea of which sequences?
>   
I will check that, but there are 163 nucleotide sequences on NCBI 
regarding the Hantavirus segment L sequences and they are similar in 
size to the one reported. The only question now is which were the ones used.

http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore&cmd=search&term=%20Hantavirus%20segment%20L

A first rule of any scientific publication is the reproducibility. Not 
this case!

> Has anyone figured out how the LOC counts were generated?  This gives  
> a the right answer of 119 for alignment.c
>
> egrep -v '^ *\*' alignment.c | perl -pe 's/\t/ /g' | egrep -v '^ *$'  
> | egrep -v '^ *//' | egrep -v '^ */\*' | egrep -v '^ *\}' | wc -l
>
> Broken down that's
>
> egrep -v '^ *\*' alignment.c |   # remove comment continuations
>      perl -pe 's/\t/ /g' |        # replace tabs with spaces
>      egrep -v '^ *$' |            # remove blank lines
>      egrep -v '^ *//' |           # remove C++-style comments (not  
> legal in C-90)
>      egrep -v '^/\*' |            # remove lines which start a comment
>      egrep -v '^ *\}' |           # remove lines containing only a "}"
>      wc -l
>
> However, there's dead code in that module.
>
> char* insert(char* str, char car){
>    int l=strlen(str);
>    str=(char*)malloc(sizeof(char) * l+1);
>    str[l]=car;
>    return str;
> }
>
> is never referenced.
>
>
> If I use the same filter to get the line count for NJ.c and reader.c  
> I get 175+77 = 252 while they quote 240.  For "parser.c" and  
> "parseRE.c" I get 81, instead of 82.
>
> And it should be shorter.  Given
>
>        size_line=strlen(line);
>        line[size_line-1] = '\0';
>        size_line--;
>
>      if(line[0] == '>'){
>        // don't copy >
>        memcpy(name, line+1, size_line-1);
>
> there's two places where the right side character is chopped off.   
> There should be only one.  That's a bug.  Anyone surprised?  Though  
> more seriously since there's no way to tell what it means to be right  
> there's no reason to have some of this code at all.
>
>   
I am not surprised. I only checked the Python code, but I was very 
suspicious when I see the FTP dirs containing backup (~) files that 
weren't not included in the "analysis".

Paulo




More information about the biology-in-python mailing list