[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Paulo Nuin nuin at genedrift.org
Fri Feb 8 10:43:46 PST 2008


Andrew Dalke wrote:
> On Feb 8, 2008, at 3:48 AM, Paulo Nuin wrote:
>   
>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore&cmd=search&term= 
>> %20Hantavirus%20segment%20L
>>     
>
> Can anyone make the NJ codes run?
>
> I pulled out 12 sequences from that link.
>
>   GI: 33860560, 157058361, 156567221, 126695337, 124024712, 123967536,
>       123965234, 78778385, 111434066, 23464594, 148361453, 55733703
>
> The Python program gave me
>
>    File "NJ.py", line 204, in <module>
>      compute_disimilarity(N)
>    File "NJ.py", line 157, in compute_disimilarity
>      if(list_seq[i][k] == list_seq[j][k]):
> IndexError: string index out of range
>
> The Perl program NJ.pl took almost a minute to produce.
>
> (gi|124024712|ref|NC_:0,(((((((gi|55733703|ref|NC_0:0,((gi|148361453| 
> gb|EF58:0.000498480301364114,(gi|111434066|gb| 
> DQ82:0.000493233730785394,gi|156567221|gb|EU00:0.000498380445055715): 
> 0.00194332141965669):0.000732614428260341,gi|157058361|gb|EF64:0): 
> 0.00177930420757772):0.00143296717474478,gi|23464594|ref| 
> NC_0:0):-1.00143648699336,gi|123965234|ref|NC_:0):1.12299044463885,gi| 
> 33860560|ref|NC_0:0):1.13312404678835,gi|78778385|ref|NC_0:0): 
> 1.14160938351598,gi|123967536|ref|NC_:0):0.900240655295447,gi| 
> 126695337|ref|NC_:0.293136656759629):1.30376626963158):0
>
>
> The C program core dumped.
>
>
> I then chopped all of the sequences to be the same length; the first  
> 6530 bases.
>
> The Python program gave me
>
> ( ( ( ( ( ( gi|23464594|ref|NC_0: 1.17366651661, ( gi|111434066|gb| 
> DQ82: 0.148904182499, gi|156567221|gb|EU00: 0.151185475232):  
> 1.10169405164): 0.162975482734, ( gi|123965234|ref|NC_:  
> 0.783397347124, gi|126695337|ref|NC_: 0.908025243214):  
> 0.179207247629): 0.0571801650779, gi|33860560|ref|NC_0:  
> 0.970803605456): 0.0659893145874, gi|123967536|ref|NC_:  
> 1.00734624052): 0.0288793245842, ( gi|78778385|ref|NC_0:  
> 1.15020230813, gi|124024712|ref|NC_: 0.986748711856):  
> 0.0994181473971): 0, ( gi|55733703|ref|NC_0: 1.18567966452, ( gi| 
> 148361453|gb|EF58: 0.163944100589, gi|157058361|gb|EF64:  
> 0.346254437969): 0.984616394381): 0.154232726999): 0
>
> The Perl program gave me
>
> (((((((gi|148361453|gb|EF58:0,(((gi|123965234|ref|NC_: 
> 0.988306732076271,gi|33860560|ref|NC_0:0.865560244439941): 
> 1.01694227538299,gi|123967536|ref|NC_:0):1.23242695730548,gi| 
> 124024712|ref|NC_:0):1.26867637791678):0.320615925328267,gi|157058361| 
> gb|EF64:0):1.53542872692372,gi|126695337|ref|NC_:0.487018431249897): 
> 1.06214589526658,gi|23464594|ref|NC_0:0):1.14663660251452,(gi| 
> 111434066|gb|DQ82:0.0762888889451898,gi|156567221|gb| 
> EU00:0.223800768786119):0):0.984861244847282,gi|78778385|ref| 
> NC_0:0.515925217159195):0,gi|55733703|ref|NC_0:1.24796822390579):0
>
> I assumed they should give identical results.  As you can see, they  
> not only aren't byte identical, they don't even give the same numbers.
>
> The C program still seg faults.
>   
I will test the C and C++ later (and I can try the C# at home), can you 
send me the input file you used? Regarding the trees you obtained with 
Python and Perl they are completely different in their topology. If 
anyone is interested I can share the images of the trees. So if your 
trees are different how can you be sure that your algorithms are 
identical and are doing the same thing in different languages?

Paulo



More information about the biology-in-python mailing list