[bip] Blog post on bioinformatics and Python

Wed Sep 17 13:30:32 PDT 2008

Hi,

It's hard to jump into the discussion since it's pretty deep by now,  
but I still want to jump in because I'd like to be doing more  
bioinformatic-al things in Python than switching in/out of R, so I'll  
start piping up.

Since this post covers most of the issues and is quire recent, I'll  
just inline here.

On Sep 17, 2008, at 4:04 PM, Ryan Raaum wrote:

>> Is the solution like what Zope's been doing - split itself
>> into many smaller packages, and distribute them as eggs?
>
> I think so. I think this approach solves a lot more problems that it
> creates. Sure, you have to communicate clearly and coordinate among
> groups working on different packages, but that's no different than the
> communication and coordination that is necessary now in the all-in-one
> BioPython. And the more split up approach really clarifies what
> packages/modules are core functionality that needs to be rock-solid,
> well documented, and have a stable API, and what parts are more "tip"
> packages that can be developed faster, be released more often, and be
> more experimental. You can look at the download statistics to see
> which packages have only been installed by 3 people in the last year
> and which get a lot of use.

I mostly agree with the spirit of Ryan's answer here. I see some value  
in splitting biopython into a number of cohesive chunks allowing  
people who are actively developing certain modules at a more  
aggressive pace release quicker than having to wait for another  
biopython "monolithic" release.

I'm not sure how this can be achieved, and still have all the modules  
under the biopython.* module/namespace. I have a vague recollection  
about reading a way that people have done this, or a proposal that  
would allow for this ... I just can't find it, though.

>>> 2. It is not pure python. I recognize the need for Numeric and C for
>>> speed in many circumstances, but having those in the core framework
>>> limits where and how it can be used.
>>
>> Just a history note here.  My memory is hazy, but Jeff Change
>> wrote some code which use a C extension if it was available
>> or in pure Python if it wasn't.  Some people complained about
>> how slow it was, and it turns out it was a misconfiguration
>> that caused only the Python code to be installed.
>>
>> We decided it was better to get complaints about "it doesn't
>> work" than deal with unvoiced "Python is so slow" complaints.
>
> Right. This is a choice you made, and it is the right choice for you.
> This is a tough problem that has no ideal solution. My preferred
> solution has different problems of it own. Nonetheless, I would prefer
> that the documentation for that functionality tells me right up front
> in big letters that "This will run slowly if the C extension has not
> been compiled. To test if you are using the C extension or the pure
> python version, import module and run module.is_c()" For that matter,
> you could have the default python version loudly note to stderr that
> it is going to be slow.

Are you arguing for requiring everything that has a C-implementation  
to also have a (slower) Python implementation? I don't think I'd agree  
with that having to be a mandate.

For instance, I think it would be very handy if Affy's Fusion SDK was  
wrapped and completely accessible to biopython (as I think it is in R) 
[1].

Other things mentioned previously:

About NumPy: I'm also happy to see this conversion finally happening.

... I'm drawing a blank on any thing else, so ... I'll sign off with  
that for now.

-steve

[1] I see Affy's Fusion SDK is LGPL'd though, so let's use it as more  
of an example to illustrate a point (even though I'd still like python- 
bindings to it) since the licensing argument- thing is orthogonal to  
the point I'm trying to make.