[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd
Paulo Nuin
nuin at genedrift.org
Thu Feb 7 18:48:15 PST 2008
Andrew Dalke wrote:
> On Feb 7, 2008, at 10:30 PM, Paulo Nuin wrote:
>
>> I am trying to do a blastp with the sequence ID from the paper and
>> I am not getting a 9.8 Gb file. Not even 9.8 Mb. I have tested a
>> couple of methods, even using Geneious to to the blast. Anyone else
>> tried to obtain this file?
>> I wanted to redo the "benchmarks" with some code modifications.
>>
>
> Same here. I'm trying to get a test set for the RT code. All I have
> to go on is
>
> In the test example 76 Hantavirus segment L sequences were used
> with an
> overall alignment length of 6580 nucleotides.
>
> Any idea of which sequences?
>
I will check that, but there are 163 nucleotide sequences on NCBI
regarding the Hantavirus segment L sequences and they are similar in
size to the one reported. The only question now is which were the ones used.
http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore&cmd=search&term=%20Hantavirus%20segment%20L
A first rule of any scientific publication is the reproducibility. Not
this case!
> Has anyone figured out how the LOC counts were generated? This gives
> a the right answer of 119 for alignment.c
>
> egrep -v '^ *\*' alignment.c | perl -pe 's/\t/ /g' | egrep -v '^ *$'
> | egrep -v '^ *//' | egrep -v '^ */\*' | egrep -v '^ *\}' | wc -l
>
> Broken down that's
>
> egrep -v '^ *\*' alignment.c | # remove comment continuations
> perl -pe 's/\t/ /g' | # replace tabs with spaces
> egrep -v '^ *$' | # remove blank lines
> egrep -v '^ *//' | # remove C++-style comments (not
> legal in C-90)
> egrep -v '^/\*' | # remove lines which start a comment
> egrep -v '^ *\}' | # remove lines containing only a "}"
> wc -l
>
> However, there's dead code in that module.
>
> char* insert(char* str, char car){
> int l=strlen(str);
> str=(char*)malloc(sizeof(char) * l+1);
> str[l]=car;
> return str;
> }
>
> is never referenced.
>
>
> If I use the same filter to get the line count for NJ.c and reader.c
> I get 175+77 = 252 while they quote 240. For "parser.c" and
> "parseRE.c" I get 81, instead of 82.
>
> And it should be shorter. Given
>
> size_line=strlen(line);
> line[size_line-1] = '\0';
> size_line--;
>
> if(line[0] == '>'){
> // don't copy >
> memcpy(name, line+1, size_line-1);
>
> there's two places where the right side character is chopped off.
> There should be only one. That's a bug. Anyone surprised? Though
> more seriously since there's no way to tell what it means to be right
> there's no reason to have some of this code at all.
>
>
I am not surprised. I only checked the Python code, but I was very
suspicious when I see the FTP dirs containing backup (~) files that
weren't not included in the "analysis".
Paulo
More information about the biology-in-python
mailing list