[bip] Announcing PEBL (and asking for assistance)

Tue Jun 17 17:59:03 PDT 2008

On Tue, Jun 17, 2008 at 4:34 PM, Abhik Shah <abhikshah at gmail.com> wrote:
> Pebl includes sphinx-generated documentation and 200+ unittests.  I
> think it's ready for prime time but would appreciate an informal code
> review (or just comments) from this group.  Installation is currently
> a bit painful but I'm working on improving that. Same goes for the
> quality of the documentation.

I agree with Titus about the documentation.  Although I've very
curious about this.  Could you describe the data format a little?  I
read the docstring for fromfile() but I'm still unclear on these:

- "data lines specify the data values separated by tab characters" --
what does that mean?  Just that any row can be a tab-separated list of
arbitrary length?
- what is a "sample name" and how might a line include one?
- what is an "intervention" ?

A real example would be best as opposed to a contrived one.  I noticed
the values "normal" vs. "cancer' -- are you working with DNA?  If so,
some short form of a data file for this would make a great example.

As for the code, took only a quick peak.  You must have infinite RAM
;)  I would suggest not doing this in fromfile(): return
fromstring(f.read()).  Instead, you can stream it line by line but
this will require a bit of modification to your current code.  All the
line pre-processing you are doing can be moved to a single for loop I
think but then the Numpy array seems to want an actual list for the
data cells.  Does that function not except a generator?  Anyway, maybe
this isn't a problem.  There are benchmarks then there are lies, alas.
 Have you tried loading very large data files?  I know Numpy is
optimized for storing a lot of data in memory (probably similar to
mmap) but all the way up to N.array() you are asking a *lot* of the
poor machine running this code if it were doing so on a large file.

Kumar