[khmer] Partitioning: are resulting lumps that different from each other?

C. Titus Brown ctb at msu.edu
Sun Jan 12 09:51:27 PST 2014


On Thu, Jan 09, 2014 at 02:23:29PM +0000, YiJin Liew wrote:
> Dear Dr Brown,
> 
> Before I delve into my sob story, I'd like to thank you (and your lab)
> for writing khmer. I must say that the digital normalisation pipeline
> proved to be an elegant method of reducing the amount of errors in
> sequencing data, and our resulting assembly have improved (and sped up
> considerably) because of your programs. Thanks.
> 
> After the digital normalisation pipeline, I tried out the partitioning
> pipeline as described in
> http://khmer.readthedocs.org/en/latest/partitioning-big-data.html. I'm
> having some trouble wrapping my head around the results produced by
> extract-partitions.py - the resulting lumps (in group000x files) seem to
> be strongly influenced by the -X (--max-size) parameter that one uses.
> 
> Take for example the 1.1G Iowa corn dataset you made available online,
> specifically the

Hi YiJin,

apologies for taking so long to reply.  The 'group' files output by
extract-partitions contain multiple partitions; the -X parameter controls how
many sequences, roughly, go into each group.  So this is entirely expected.

Partitions are connected sequences; groups are merely collections of similarly
sized partitions.

The file to take a look at is the '.dist' file; that's the distribution
of partition sizes.

best,
--titus
-- 
C. Titus Brown, ctb at msu.edu




More information about the khmer mailing list