[bip] Announcing PEBL (and asking for assistance)
Kumar McMillan
kumar.mcmillan at gmail.com
Tue Jun 17 17:59:03 PDT 2008
On Tue, Jun 17, 2008 at 4:34 PM, Abhik Shah <abhikshah at gmail.com> wrote:
> Pebl includes sphinx-generated documentation and 200+ unittests. I
> think it's ready for prime time but would appreciate an informal code
> review (or just comments) from this group. Installation is currently
> a bit painful but I'm working on improving that. Same goes for the
> quality of the documentation.
I agree with Titus about the documentation. Although I've very
curious about this. Could you describe the data format a little? I
read the docstring for fromfile() but I'm still unclear on these:
- "data lines specify the data values separated by tab characters" --
what does that mean? Just that any row can be a tab-separated list of
arbitrary length?
- what is a "sample name" and how might a line include one?
- what is an "intervention" ?
A real example would be best as opposed to a contrived one. I noticed
the values "normal" vs. "cancer' -- are you working with DNA? If so,
some short form of a data file for this would make a great example.
As for the code, took only a quick peak. You must have infinite RAM
;) I would suggest not doing this in fromfile(): return
fromstring(f.read()). Instead, you can stream it line by line but
this will require a bit of modification to your current code. All the
line pre-processing you are doing can be moved to a single for loop I
think but then the Numpy array seems to want an actual list for the
data cells. Does that function not except a generator? Anyway, maybe
this isn't a problem. There are benchmarks then there are lies, alas.
Have you tried loading very large data files? I know Numpy is
optimized for storing a lot of data in memory (probably similar to
mmap) but all the way up to N.array() you are asking a *lot* of the
poor machine running this code if it were doing so on a large file.
Kumar
More information about the biology-in-python
mailing list