[bip] Announcing pypipegraph

Florian Finkernagel f.finkernagel at coonabibba.de
Mon Jan 30 09:01:30 PST 2012


Dear list,

my apologies for spamming, but I figure the relative size of the intersection between
people interested in my pet project on our planet and the members of this list is
pretty large.

Pypipegraph is an engine for computational pipelines, somewhat similar to Ruffus*,
or maybe make after 'import antigravity'.

We use it in a research setting where the (in-house) libraries used often
change, some computation steps take a long(ish) time and it's not reasonable
for the user to decide which of the hundreds of output files need to be rebuild
and which could safely be kept. **

It handles:
    -modeling 'pipeline' jobs (tasks) and their dependencies 
    -separating definition and runtime (allowing sanity checking before spending hours of computation)
    -tracking parameters and code changes - and isolating independent tasks from them
    -isolating failing tasks from each other
    -flexible multi core-ing (***) (=trivial parallelization)
    -interruption and resuming (at least between jobs)

In pypipegraph each job is an explicit python object.
Most jobs encapsulate a name and a callback function.
You build a directed acyclic graph (hence the name) of job objects, 
pypipegraph takes care of running those that need running in a sensible order.

If you introduce changes (to parameters, input files, python code**** ) only the 
jobs downstream (i.e. those that could be affected) will be re-run.
If a job dies for whatever reason, only the downstream jobs will be affected.
If more than one job has it's requirements fulfilled, multiple jobs may run at the same time *****
At any given time between runs, either a job was executed correctly (*****) or it's output does not exist.

The current set of jobs handles file generation (including temporary file
support), data loading (and unloading), invariant modeling (file time, file
checksum, parameters, python functions) and generating further jobs on the fly.

Now that I've probably lost all of you with my ramblings, straight to the obligatory
link section:

There's a project page at http://code.google.com/p/pypipegraph/
the source can be 'hg cloned'ed from the same url, 
it can be installed via pip,
documentation is at http://pypipegraph.readthedocs.org
and a basic tutorial at http://pypipegraph.readthedocs.org/en/latest/tutorial.html

I appreciate all feedback,
and would love it if you found pypipegraph to be useful.

Best regards,
    Florian Finkernagel


* Ruffus looks like a fine system, alas I could never wrap my head around the model they're using for tasks.
** For the interested, our day-to-day is mostly ChIPseq, RNAseq and microarray analysis.
*** There is some alpha-quality code for distributing onto multiple computers using twisted and some pretty cool python code pickeling.
**** By default, all callback functions are checked - you can add jobs that check further functions for changes.
***** Provided they didn't declare to need all the cores or all the memory. There is flexibility in the resource requests.
****** Provided the runtime gets to handle the job's return. On 'hard' crashes (power gone, etc.) this only holds true if the (file generating) jobs do the old 'create & rename file' trick.



More information about the biology-in-python mailing list