[khmer] [YiJin.Liew at KAUST.EDU.SA: Partitioning: are resulting lumps that different from each other?]

Thu Jan 9 06:33:19 PST 2014

----- Forwarded message from YiJin Liew <YiJin.Liew at KAUST.EDU.SA> -----

From: YiJin Liew <YiJin.Liew at KAUST.EDU.SA>
To: "ctb at msu.edu" <ctb at msu.edu>
Subject: Partitioning: are resulting lumps that different from each other?
Date: Thu, 9 Jan 2014 14:23:29 +0000

Dear Dr Brown,

Before I delve into my sob story, I'd like to thank you (and your lab)
for writing khmer. I must say that the digital normalisation pipeline
proved to be an elegant method of reducing the amount of errors in
sequencing data, and our resulting assembly have improved (and sped up
considerably) because of your programs. Thanks.

After the digital normalisation pipeline, I tried out the partitioning
pipeline as described in
http://khmer.readthedocs.org/en/latest/partitioning-big-data.html. I'm
having some trouble wrapping my head around the results produced by
extract-partitions.py - the resulting lumps (in group000x files) seem to
be strongly influenced by the -X (--max-size) parameter that one uses.

Take for example the 1.1G Iowa corn dataset you made available online,
specifically the

# now, extract the partitions in groups into 'iowa-corn-50m.groupNNNN.fa'
extract-partitions.py iowa-corn-50m iowa-corn-50m.fa.gz.part

# at this point, you can assemble the group files individually.  Note,
# however, that the last one them is quite big?  this is because it's
# the lump! yay!

part in the pipeline.

(err, by the way, when I ran the pipeline, group0005 wasn't the biggest
lump - it turned out that group0007 was the biggest lump. See below.)

What I expected were groups that were (sort of) connected together would
be thrown together into the same file, but it seems that the -X setting
has the biggest influence over the resulting groups.

I ran these commands (basically varying the -X), to illustrate my point:
extract-partitions.py iowa-corn-50m iowa-corn-50m.fa.gz.part
extract-partitions.py -X 2000000 iowa-corn-50m-X_2mil
iowa-corn-50m.fa.gz.part
extract-partitions.py -X 4000000 iowa-corn-50m-X_4mil
iowa-corn-50m.fa.gz.part
extract-partitions.py -X 10000000 iowa-corn-50m-X_10mil
iowa-corn-50m.fa.gz.part

and the wc -l output for the resulting files (as it's FASTA, divide the
line number by 2 to get the number of sequences):
    2000014 iowa-corn-50m.group0000.fa
    2000012 iowa-corn-50m.group0001.fa
    2000028 iowa-corn-50m.group0002.fa
    2000018 iowa-corn-50m.group0003.fa
    2000010 iowa-corn-50m.group0004.fa
    2000066 iowa-corn-50m.group0005.fa
    2000346 iowa-corn-50m.group0006.fa
    7282860 iowa-corn-50m.group0007.fa
    4000006 iowa-corn-50m-X_2mil.group0000.fa
    4000016 iowa-corn-50m-X_2mil.group0001.fa
    4000126 iowa-corn-50m-X_2mil.group0002.fa
    9283206 iowa-corn-50m-X_2mil.group0003.fa
    8000022 iowa-corn-50m-X_4mil.group0000.fa
   13283332 iowa-corn-50m-X_4mil.group0001.fa
   21283354 iowa-corn-50m-X_10mil.group0000.fa
   81146016 iowa-corn-50m.fa.gz.part

So... am I misunderstanding something, or are the binned groups binned
because they hit an arbitrary --max-size? Pardon my layman's
interpretation, but I thought the program attempts to bin sequences that
"clustered" together better, which means that file sizes / line counts
should vary quite a bit... no?

Grateful for any advice, thanks for your time!

Yours
Yi Jin LIEW
Postdoc
King Abdullah University of Science and Technology

________________________________

This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

----- End forwarded message -----

-- 
C. Titus Brown, ctb at msu.edu