[bip] Announcing PEBL (and asking for assistance)

Wed Jun 18 13:13:02 PDT 2008

Hi Kumar,
  I've updated the tutorial and it should answer your questions about
the data format and includes a realistic example.

About reading data, I've created a bug report
(http://code.google.com/p/pebl-project/issues/detail?id=17) for it but
it's not too high on my list.  A 500 variable by 500 sample dataset is
pretty big for structure learning of bayesian networks but is rather
small in terms of memory consumption. I haven't run into memory
problems with datasets I've used yet.  Also, numpy accepts 1D
iterables but not 2D and so I'm not sure if any solution will be much
more efficient if at all.

Thanks,
Abhik.

On Tue, Jun 17, 2008 at 8:59 PM, Kumar McMillan
<kumar.mcmillan at gmail.com> wrote:
> On Tue, Jun 17, 2008 at 4:34 PM, Abhik Shah <abhikshah at gmail.com> wrote:
>> Pebl includes sphinx-generated documentation and 200+ unittests.  I
>> think it's ready for prime time but would appreciate an informal code
>> review (or just comments) from this group.  Installation is currently
>> a bit painful but I'm working on improving that. Same goes for the
>> quality of the documentation.
>
> I agree with Titus about the documentation.  Although I've very
> curious about this.  Could you describe the data format a little?  I
> read the docstring for fromfile() but I'm still unclear on these:
>
> - "data lines specify the data values separated by tab characters" --
> what does that mean?  Just that any row can be a tab-separated list of
> arbitrary length?
> - what is a "sample name" and how might a line include one?
> - what is an "intervention" ?
>
> A real example would be best as opposed to a contrived one.  I noticed
> the values "normal" vs. "cancer' -- are you working with DNA?  If so,
> some short form of a data file for this would make a great example.
>
> As for the code, took only a quick peak.  You must have infinite RAM
> ;)  I would suggest not doing this in fromfile(): return
> fromstring(f.read()).  Instead, you can stream it line by line but
> this will require a bit of modification to your current code.  All the
> line pre-processing you are doing can be moved to a single for loop I
> think but then the Numpy array seems to want an actual list for the
> data cells.  Does that function not except a generator?  Anyway, maybe
> this isn't a problem.  There are benchmarks then there are lies, alas.
>  Have you tried loading very large data files?  I know Numpy is
> optimized for storing a lot of data in memory (probably similar to
> mmap) but all the way up to N.array() you are asking a *lot* of the
> poor machine running this code if it were doing so on a large file.
>
> Kumar
>

-- 
Abhik Shah - http://umich.edu/~shahad
Systems Biology Lab, University of Michigan