[khmer] less reads but more kmers?

Sat Jan 18 05:00:03 PST 2014

On Fri, Jan 17, 2014 at 09:31:32PM -0200, Nacho Caballero wrote:
> I used khmer to digitally normalize two assemblies:
> 
>    - After normalization, Assembly A has *1.5 million reads*, and during
>    assembly SPAdes uses *116 million* kmers (k=37)
>    - After normalization, Assembly B has *1.5 million reads*, during
>    assembly SPAdes uses *612 million* kmers (k=37)
> 
> I followed the same protocol on both assemblies (quality filtering with
> Trimmomatic, 3-pass normalization, etc.), so I don???t understand why
> assembly B, with 16x fewer reads, has 8x more kmers than assembly A.
> 
> What are some possible explanations?

Barring some extraordinarily bizarre bug, the answer *must* be SPAdes
is *choosing to use* more k-mers... I'll ask the SPAdes authors ;)

If you want to check the total number of k-mers, we have some scripts
in khmer to do that.  See 'abundance-dist-single.py' here,

	http://khmer.readthedocs.org/en/latest/scripts.html#scripts-counting

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu