[khmer] filter-below-abundance typical discard rate

Tue Jun 10 11:52:21 PDT 2014

On Tue, Jun 10, 2014 at 02:19:42PM -0400, Chuck wrote:
> Hmmm, the false positive rate was 0.015. Here are the load-into-counting
> parameters:
> 
> PARAMETERS:
>  - kmer size =    20            (-k)
>  - n tables =     4             (-N)
>  - min tablesize = 3.7e+10      (-x)
> 
> Any ideas for diagnosing if normalize-by-median is keeping many highly
> erroneous reads? Would that be apparent from the kmer histogram?
> 
> The final discard rate for filter-below-abundance with a cutoff of 225 was
> 16% (reads normalized to C=20). Does this seem high given your experience?

FP rate of 1.5% is not high enough to cause problems, so good news there!

Instead of a k-mer coverage plot, could you generate a coverage
spectrum?

   http://davis-assembly-masterclass-2013.readthedocs.org/en/latest/titus-notes.html#generating-a-coverage-plot-coverage-spectrum-without-a-reference

cheers,
--titus

> On Tue, Jun 10, 2014 at 11:26 AM, C. Titus Brown <ctb at msu.edu> wrote:
> 
> > On Mon, Jun 09, 2014 at 08:33:45PM -0400, Chuck wrote:
> > > I'm curious about typical values that people are seeing with
> > > filter-below-abundance. With the default cutoff (50) I was discarding
> > ~50%
> > > of bp (after normalizing with C=20). If I increase the cutoff to 225 the
> > > discard rate drops to 25%. I thought I was rigorously adapter trimming my
> > > reads (I generally use scythe with default parameters and I monitor the
> > > output fairly closely). Is this way outside the developers' experience?
> > >
> > > Also, at a cutoff of 235, I discard 0%. Not sure how to interpret this. I
> > > realize that you don't count kmers above 255 by default with
> > > load-into-counting. It seems that I don't have any kmers at the ends of
> > > reads at a depth >=235 but I trim much more data with what seems like a
> > > small change in the cutoff value from 235 to 225. Also, 235 < 255 :) .
> >
> > That's tremendously weird.
> >
> > I have no other useful comment :)
> >
> > I can come up with some wild hypotheses about what might be going on,
> > but have never seen this before.
> >
> > If, for example, your data was high coverage but each read had a lot of
> > errors,
> > then normalize-by-median might be keeping a lot of the highly erroneous
> > reads while filter-below-abund trimmed of the legitimate sequence.
> >
> > I have no idea how to interpret the 225-to-235 numbers!  Fascinating.
> >
> > Hmm, what table size are you using and what false positive rate is being
> > reported?
> >
> > cheers,
> > --titus
> > --
> > C. Titus Brown, ctb at msu.edu
> >

> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer

-- 
C. Titus Brown, ctb at msu.edu