[khmer] split-and-strip script ... support for current Illumina headers?

Tue Jul 23 17:51:05 PDT 2013

(redirecting => khmer2lists.idyll.org)

On Tue, Jul 23, 2013 at 01:41:56PM -0700, Joseph Fass wrote:
> Hi Titus,
> 
> Thanks for your response. I've been out of commission for a few days, but
> diving back into the project I was using diginorm with ...
> 
> On the coverage question - I think I'm still missing something. I couldn't
> get what you were referring to as "random" vs "systematic" coverage from
> that blog post. Is one graph made from all k-mers, and their "coverages"
> (or, number of times observed), and the other is from the median k-mer
> coverages of reads? But, still, that should mean that most k-mers are only
> seen ~5 times, which is too rarely for Velvet. So I'm missing something
> about the definition.

Yeah, this is tricky.  Let me give it a try:

Velvet expects a Poisson distribution of k-mer abundances around the average
coverage, because that's what you'd get if you were sampling reads from the
genome at random.  If the average coverage is 5, then many positions in the
genome will never have been sampled -- that's why they recommend a minimum
coverage of ~30 or more.

Diginorm, by contrast, collects reads for each part of the de Bruijn graph
until there is a k-mer coverage of 5 for that *part* of the graph.  This
means that, for any high-coverage data set, the entire graph will have a
minimum coverage of 5 as well as a maximum coverage of 5.  (This varies
slightly for technical reasons, but it's more or less true; the variation arises
because of the way we calculate coverage.)

If you take each read remaining after normalization and map it to the
assembled genome, and then calculate per-base coverage, you'll see that
each base in the genome is covered almost exactly 7 times -- the correction
between k-mer and per-base coverage.

To put it another way, diginorm changes the coverage distribution away
from Poisson to something else (which we're still working out).  Assemblers
don't always like this, but Velvet seems to do OK.

> I'll comb through the paper again, and try diginorm with your settings,
> then Velvet (to see what k-mer coverages Velvet sees after diginorm
> three-pass'ing down to C=5). If you can clarify briefly, though, I'd
> appreciate it.

The numbers output by Velvet are correct but the distributions are wrong :)

HTH... a bit?

cheers,
--titus