[khmer] Partitioning: are resulting lumps that different from each other?

C. Titus Brown ctb at msu.edu
Sun Jan 12 17:46:50 PST 2014


On Sun, Jan 12, 2014 at 11:26:50PM +0000, YiJin Liew wrote:
> Thanks for the reply! Could I ask a few follow-up questions regarding
> the format of the .dist file then, as I can't seem to find a full
> description of how the file is structured?
> 
> take for example
> 
> --- iowa-corn-50m.dist ---
> 1 19750012 19750012 19750012
> 2 2905935 22655947 25561882
> 3 745747 23401694 27799123
> 4 324017 23725711 29095191
> 5 167228 23892939 29931331
> <snip>
> 2312 1 24356713 37268397
> 2359 1 24356714 37270756
> 2714 1 24356715 37273470
> 3008 1 24356716 37276478
> 3296530 1 24356717 40573008   <-- is this the most interesting group?
> 
> I can sort of guess what the numbers mean, but let me double-check: does
> this indicate that there's 19.8 million clusters that are "singlets";
> followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
> cumulative figures for clusters and sequences respectively?

Exactly!

> If you don't mind, could you elaborate briefly on how groups are created
> based on the dist file? Judging from the line counts, I suspect that the
> script fills the first group with singlets till --max-size is hit, if
> not, continue filling with doublets, then move on to the next group once
> --max-size is crossed?

Yep.

> On my data, I've tried blastn-ing the groups0000 and 0001 produced from
> the partitioning process, but from the results I'd wager that they're
> roughly the same - which was what prompted me to seek advice on how the
> script functioned.

Roughly the same... no, shouldn't be.  Those are probably spurious
BLAST matches of some sort.  If partitioning worked (and at least from
the examples above you got a lot of partitions) then those reads are
from different components of the overall de Bruijn graph.

cheers,
--titus




More information about the khmer mailing list