[khmer] Diginorm and error correction

Sat Dec 6 06:08:45 PST 2014

Yep, exactly.  We had that behavior in there as another script and it turned
out to be useless so I removed it :).

cheers,
--titus

On Sat, Dec 06, 2014 at 02:04:35PM +0000, Daniel Standage wrote:
> Ah, I see. It seems if every kmer of every read is checked, no
> error-containing reads would ever be discarded. But if only a
> representative set of kmers from each read is checked, then that makes much
> more sense. I guess I need to read up on the median kmer estimator.
> 
> Thanks,
> Daniel
> On Sat, Dec 6, 2014 at 7:20 AM C. Titus Brown <ctb at msu.edu> wrote:
> 
> > On Fri, Dec 05, 2014 at 08:53:31AM -0800, C. Titus Brown wrote:
> > > On Fri, Dec 05, 2014 at 04:49:27PM +0000, Daniel Standage wrote:
> > > > Greetings!
> > > >
> > > > I have a quick question. I understand the primary motivation behind
> > digital
> > > > normalization, the idea of discarding data without losing any
> > information.
> > > > My question is about the claim that diginorm retains all real kmers
> > while
> > > > discarding erroneous ones. After reading over the arXiv preprint
> > again, it
> > > > seems this claim is independent of the three-pass protocol which does
> > > > additional error correction.
> > > >
> > > > If we assume that errors are present in low abundance, why would
> > diginorm
> > > > ever discard a read containing an error? Wouldn't the same error have
> > to be
> > > > present a certain number of times before the associated kmers had
> > > > sufficient coverage to discard those reads? In that case, we're much
> > less
> > > > confident that it's not real variation. Or are there probabilistic data
> > > > structures involved that discard likely errors?
> > > >
> > > > Thanks!
> > > > Daniel
> > >
> > > Hey Daniel,
> > >
> > > More/better answer later, but look at the part of the paper where we talk
> > > about losing tips of contigs in the mRNAseq simulation.  The median
> > k-mer count
> > > cannot tell the difference between undersampled contig edges and errors
> > (which
> > > may occur in real data sets).
> > >
> > > But good question :)
> >
> > Hah, I think I misunderstood your question the first time 'round.
> >
> > Erroneous k-mers are present in every read with an error, so if any reads
> > are
> > discarded that have errors in them, erroneous k-mers are discarded along
> > with
> > that read.  So if you have a coverage of 100 and 80% of those reads are
> > discarded, then roughly 80% of the errors in your original data set also go
> > away.
> >
> > The trick is really that using the median k-mer estimator allows us to ask
> > if
> > *most* of a read is new, and so if two otherwise identical (or mostly
> > overlapping) reads have different errors, diginorm will regard them as
> > the same anyway.
> >
> > HTH!
> >
> > cheers,
> > --titus
> >

-- 
C. Titus Brown, ctb at msu.edu