[khmer] less reads but more kmers?
C. Titus Brown
ctb at msu.edu
Sat Jan 18 05:50:58 PST 2014
On Sat, Jan 18, 2014 at 05:00:03AM -0800, C. Titus Brown wrote:
> On Fri, Jan 17, 2014 at 09:31:32PM -0200, Nacho Caballero wrote:
> > I used khmer to digitally normalize two assemblies:
> >
> > - After normalization, Assembly A has *1.5 million reads*, and during
> > assembly SPAdes uses *116 million* kmers (k=37)
> > - After normalization, Assembly B has *1.5 million reads*, during
> > assembly SPAdes uses *612 million* kmers (k=37)
> >
> > I followed the same protocol on both assemblies (quality filtering with
> > Trimmomatic, 3-pass normalization, etc.), so I don???t understand why
> > assembly B, with 16x fewer reads, has 8x more kmers than assembly A.
> >
> > What are some possible explanations?
>
> Barring some extraordinarily bizarre bug, the answer *must* be SPAdes
> is *choosing to use* more k-mers... I'll ask the SPAdes authors ;)
Anton (one of the SPAdes authors) pointed out that I'd misread the e-mail.
If dataset A and dataset B are from different samples, then they could easily
have different levels of diversity which would lead to different numbers of
k-mers for the same coverage level.
The simplest explanation would be that dataset B is both more diverse
and has lower coverage than dataset A, I think. I would guess that
if you generated 6 times as much data for sample B then diginorm would
leave you with many more reads, although this is a bit dependent on the
diversity of sample B.
cheers,
--titus
>
> If you want to check the total number of k-mers, we have some scripts
> in khmer to do that. See 'abundance-dist-single.py' here,
>
> http://khmer.readthedocs.org/en/latest/scripts.html#scripts-counting
>
> cheers,
> --titus
> --
> C. Titus Brown, ctb at msu.edu
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
--
C. Titus Brown, ctb at msu.edu
More information about the khmer
mailing list