[khmer] Diginorm and error correction

C. Titus Brown ctb at msu.edu
Sat Dec 6 04:20:05 PST 2014


On Fri, Dec 05, 2014 at 08:53:31AM -0800, C. Titus Brown wrote:
> On Fri, Dec 05, 2014 at 04:49:27PM +0000, Daniel Standage wrote:
> > Greetings!
> > 
> > I have a quick question. I understand the primary motivation behind digital
> > normalization, the idea of discarding data without losing any information.
> > My question is about the claim that diginorm retains all real kmers while
> > discarding erroneous ones. After reading over the arXiv preprint again, it
> > seems this claim is independent of the three-pass protocol which does
> > additional error correction.
> > 
> > If we assume that errors are present in low abundance, why would diginorm
> > ever discard a read containing an error? Wouldn't the same error have to be
> > present a certain number of times before the associated kmers had
> > sufficient coverage to discard those reads? In that case, we're much less
> > confident that it's not real variation. Or are there probabilistic data
> > structures involved that discard likely errors?
> > 
> > Thanks!
> > Daniel
> 
> Hey Daniel,
> 
> More/better answer later, but look at the part of the paper where we talk
> about losing tips of contigs in the mRNAseq simulation.  The median k-mer count
> cannot tell the difference between undersampled contig edges and errors (which
> may occur in real data sets).
> 
> But good question :)

Hah, I think I misunderstood your question the first time 'round.

Erroneous k-mers are present in every read with an error, so if any reads are
discarded that have errors in them, erroneous k-mers are discarded along with
that read.  So if you have a coverage of 100 and 80% of those reads are
discarded, then roughly 80% of the errors in your original data set also go
away.

The trick is really that using the median k-mer estimator allows us to ask if
*most* of a read is new, and so if two otherwise identical (or mostly
overlapping) reads have different errors, diginorm will regard them as
the same anyway.

HTH!

cheers,
--titus



More information about the khmer mailing list