<div dir="ltr">Hmmm, the false positive rate was 0.015. Here are the load-into-counting parameters:<div><br></div><div><div>PARAMETERS:</div><div> - kmer size =    20            (-k)</div><div> - n tables =     4             (-N)</div>

<div> - min tablesize = 3.7e+10      (-x)</div></div><div><br></div><div>Any ideas for diagnosing if normalize-by-median is keeping many highly erroneous reads? Would that be apparent from the kmer histogram?</div><div><br>

</div><div>The final discard rate for filter-below-abundance with a cutoff of 225 was 16% (reads normalized to C=20). Does this seem high given your experience?</div><div><br></div><div>-Chuck</div></div><div class="gmail_extra">

<br><br><div class="gmail_quote">On Tue, Jun 10, 2014 at 11:26 AM, C. Titus Brown <span dir="ltr">&lt;<a href="mailto:ctb@msu.edu" target="_blank">ctb@msu.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Mon, Jun 09, 2014 at 08:33:45PM -0400, Chuck wrote:<br>

&gt; I&#39;m curious about typical values that people are seeing with<br>

&gt; filter-below-abundance. With the default cutoff (50) I was discarding ~50%<br>

&gt; of bp (after normalizing with C=20). If I increase the cutoff to 225 the<br>

&gt; discard rate drops to 25%. I thought I was rigorously adapter trimming my<br>

&gt; reads (I generally use scythe with default parameters and I monitor the<br>

&gt; output fairly closely). Is this way outside the developers&#39; experience?<br>

&gt;<br>

&gt; Also, at a cutoff of 235, I discard 0%. Not sure how to interpret this. I<br>

&gt; realize that you don&#39;t count kmers above 255 by default with<br>

&gt; load-into-counting. It seems that I don&#39;t have any kmers at the ends of<br>

&gt; reads at a depth &gt;=235 but I trim much more data with what seems like a<br>

&gt; small change in the cutoff value from 235 to 225. Also, 235 &lt; 255 :) .<br>

<br>

</div></div>That&#39;s tremendously weird.<br>

<br>

I have no other useful comment :)<br>

<br>

I can come up with some wild hypotheses about what might be going on,<br>

but have never seen this before.<br>

<br>

If, for example, your data was high coverage but each read had a lot of errors,<br>

then normalize-by-median might be keeping a lot of the highly erroneous<br>

reads while filter-below-abund trimmed of the legitimate sequence.<br>

<br>

I have no idea how to interpret the 225-to-235 numbers!  Fascinating.<br>

<br>

Hmm, what table size are you using and what false positive rate is being<br>

reported?<br>

<br>

cheers,<br>

--titus<br>

<span class="HOEnZb"><font color="#888888">--<br>

C. Titus Brown, <a href="mailto:ctb@msu.edu">ctb@msu.edu</a><br>

</font></span></blockquote></div><br></div>