[khmer] Partitioning: are resulting lumps that different from each other?

Sun Jan 12 15:28:35 PST 2014

Thanks for the reply! Could I ask a few follow-up questions regarding
the format of the .dist file then, as I can't seem to find a full
description of how the file is structured?

Take for example

--- iowa-corn-50m.dist ---
1 19750012 19750012 19750012
2 2905935 22655947 25561882
3 745747 23401694 27799123
4 324017 23725711 29095191
5 167228 23892939 29931331
<snip>
2312 1 24356713 37268397
2359 1 24356714 37270756
2714 1 24356715 37273470
3008 1 24356716 37276478
3296530 1 24356717 40573008   <-- is this the most interesting group?

I can sort of guess what the numbers mean, but let me double-check: does
this indicate that there's 19.8 million clusters that are "singlets";
followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
cumulative figures for clusters and sequences respectively?

If you don't mind, could you elaborate briefly on how groups are created
based on the dist file? Judging from the line counts, I suspect that the
script fills the first group with singlets till --max-size is hit, if
not, continue filling with doublets, then move on to the next group once
--max-size is crossed?

On my data, I've tried blastn-ing the groups0000 and 0001 produced from
the partitioning process, but from the results I'd wager that they're
roughly the same - which was what prompted me to seek advice on how the
script functioned.

Apologies for the wall-of-text, thanks again for your help!

Yours
Yi Jin

On 12/01/2014 20:51, C. Titus Brown wrote:
> On Thu, Jan 09, 2014 at 02:23:29PM +0000, YiJin Liew wrote:
>> Dear Dr Brown,
>>
>> Before I delve into my sob story, I'd like to thank you (and your lab)
>> for writing khmer. I must say that the digital normalisation pipeline
>> proved to be an elegant method of reducing the amount of errors in
>> sequencing data, and our resulting assembly have improved (and sped up
>> considerably) because of your programs. Thanks.
>>
>> After the digital normalisation pipeline, I tried out the partitioning
>> pipeline as described in
>> http://khmer.readthedocs.org/en/latest/partitioning-big-data.html. I'm
>> having some trouble wrapping my head around the results produced by
>> extract-partitions.py - the resulting lumps (in group000x files) seem to
>> be strongly influenced by the -X (--max-size) parameter that one uses.
>>
>> Take for example the 1.1G Iowa corn dataset you made available online,
>> specifically the
>
> Hi YiJin,
>
> apologies for taking so long to reply.  The 'group' files output by
> extract-partitions contain multiple partitions; the -X parameter controls how
> many sequences, roughly, go into each group.  So this is entirely expected.
>
> Partitions are connected sequences; groups are merely collections of similarly
> sized partitions.
>
> The file to take a look at is the '.dist' file; that's the distribution
> of partition sizes.
>
> best,
> --titus
>

________________________________

This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.