[khmer] Diginorm and error correction

Daniel Standage daniel.standage at gmail.com
Sat Dec 6 06:04:35 PST 2014


Ah, I see. It seems if every kmer of every read is checked, no
error-containing reads would ever be discarded. But if only a
representative set of kmers from each read is checked, then that makes much
more sense. I guess I need to read up on the median kmer estimator.

Thanks,
Daniel
On Sat, Dec 6, 2014 at 7:20 AM C. Titus Brown <ctb at msu.edu> wrote:

> On Fri, Dec 05, 2014 at 08:53:31AM -0800, C. Titus Brown wrote:
> > On Fri, Dec 05, 2014 at 04:49:27PM +0000, Daniel Standage wrote:
> > > Greetings!
> > >
> > > I have a quick question. I understand the primary motivation behind
> digital
> > > normalization, the idea of discarding data without losing any
> information.
> > > My question is about the claim that diginorm retains all real kmers
> while
> > > discarding erroneous ones. After reading over the arXiv preprint
> again, it
> > > seems this claim is independent of the three-pass protocol which does
> > > additional error correction.
> > >
> > > If we assume that errors are present in low abundance, why would
> diginorm
> > > ever discard a read containing an error? Wouldn't the same error have
> to be
> > > present a certain number of times before the associated kmers had
> > > sufficient coverage to discard those reads? In that case, we're much
> less
> > > confident that it's not real variation. Or are there probabilistic data
> > > structures involved that discard likely errors?
> > >
> > > Thanks!
> > > Daniel
> >
> > Hey Daniel,
> >
> > More/better answer later, but look at the part of the paper where we talk
> > about losing tips of contigs in the mRNAseq simulation.  The median
> k-mer count
> > cannot tell the difference between undersampled contig edges and errors
> (which
> > may occur in real data sets).
> >
> > But good question :)
>
> Hah, I think I misunderstood your question the first time 'round.
>
> Erroneous k-mers are present in every read with an error, so if any reads
> are
> discarded that have errors in them, erroneous k-mers are discarded along
> with
> that read.  So if you have a coverage of 100 and 80% of those reads are
> discarded, then roughly 80% of the errors in your original data set also go
> away.
>
> The trick is really that using the median k-mer estimator allows us to ask
> if
> *most* of a read is new, and so if two otherwise identical (or mostly
> overlapping) reads have different errors, diginorm will regard them as
> the same anyway.
>
> HTH!
>
> cheers,
> --titus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20141206/5d83d30a/attachment.html>


More information about the khmer mailing list