[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd
Paulo Nuin
nuin at genedrift.org
Fri Feb 8 10:43:46 PST 2008
Andrew Dalke wrote:
> On Feb 8, 2008, at 3:48 AM, Paulo Nuin wrote:
>
>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore&cmd=search&term=
>> %20Hantavirus%20segment%20L
>>
>
> Can anyone make the NJ codes run?
>
> I pulled out 12 sequences from that link.
>
> GI: 33860560, 157058361, 156567221, 126695337, 124024712, 123967536,
> 123965234, 78778385, 111434066, 23464594, 148361453, 55733703
>
> The Python program gave me
>
> File "NJ.py", line 204, in <module>
> compute_disimilarity(N)
> File "NJ.py", line 157, in compute_disimilarity
> if(list_seq[i][k] == list_seq[j][k]):
> IndexError: string index out of range
>
> The Perl program NJ.pl took almost a minute to produce.
>
> (gi|124024712|ref|NC_:0,(((((((gi|55733703|ref|NC_0:0,((gi|148361453|
> gb|EF58:0.000498480301364114,(gi|111434066|gb|
> DQ82:0.000493233730785394,gi|156567221|gb|EU00:0.000498380445055715):
> 0.00194332141965669):0.000732614428260341,gi|157058361|gb|EF64:0):
> 0.00177930420757772):0.00143296717474478,gi|23464594|ref|
> NC_0:0):-1.00143648699336,gi|123965234|ref|NC_:0):1.12299044463885,gi|
> 33860560|ref|NC_0:0):1.13312404678835,gi|78778385|ref|NC_0:0):
> 1.14160938351598,gi|123967536|ref|NC_:0):0.900240655295447,gi|
> 126695337|ref|NC_:0.293136656759629):1.30376626963158):0
>
>
> The C program core dumped.
>
>
> I then chopped all of the sequences to be the same length; the first
> 6530 bases.
>
> The Python program gave me
>
> ( ( ( ( ( ( gi|23464594|ref|NC_0: 1.17366651661, ( gi|111434066|gb|
> DQ82: 0.148904182499, gi|156567221|gb|EU00: 0.151185475232):
> 1.10169405164): 0.162975482734, ( gi|123965234|ref|NC_:
> 0.783397347124, gi|126695337|ref|NC_: 0.908025243214):
> 0.179207247629): 0.0571801650779, gi|33860560|ref|NC_0:
> 0.970803605456): 0.0659893145874, gi|123967536|ref|NC_:
> 1.00734624052): 0.0288793245842, ( gi|78778385|ref|NC_0:
> 1.15020230813, gi|124024712|ref|NC_: 0.986748711856):
> 0.0994181473971): 0, ( gi|55733703|ref|NC_0: 1.18567966452, ( gi|
> 148361453|gb|EF58: 0.163944100589, gi|157058361|gb|EF64:
> 0.346254437969): 0.984616394381): 0.154232726999): 0
>
> The Perl program gave me
>
> (((((((gi|148361453|gb|EF58:0,(((gi|123965234|ref|NC_:
> 0.988306732076271,gi|33860560|ref|NC_0:0.865560244439941):
> 1.01694227538299,gi|123967536|ref|NC_:0):1.23242695730548,gi|
> 124024712|ref|NC_:0):1.26867637791678):0.320615925328267,gi|157058361|
> gb|EF64:0):1.53542872692372,gi|126695337|ref|NC_:0.487018431249897):
> 1.06214589526658,gi|23464594|ref|NC_0:0):1.14663660251452,(gi|
> 111434066|gb|DQ82:0.0762888889451898,gi|156567221|gb|
> EU00:0.223800768786119):0):0.984861244847282,gi|78778385|ref|
> NC_0:0.515925217159195):0,gi|55733703|ref|NC_0:1.24796822390579):0
>
> I assumed they should give identical results. As you can see, they
> not only aren't byte identical, they don't even give the same numbers.
>
> The C program still seg faults.
>
I will test the C and C++ later (and I can try the C# at home), can you
send me the input file you used? Regarding the trees you obtained with
Python and Perl they are completely different in their topology. If
anyone is interested I can share the images of the trees. So if your
trees are different how can you be sure that your algorithms are
identical and are doing the same thing in different languages?
Paulo
More information about the biology-in-python
mailing list