<div dir="ltr">I think this could be the bug of the normalize-by-median script. There script with (-p) keep the both paired ends but only count the kmers of one of them. This is why abound-dist.py after normalize-by-median show many kmers with count 0.<div>I think @camillescott was working on fixing that.<br><div><br></div><div>Tamer</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 11, 2015 at 6:13 AM,  <span dir="ltr">&lt;<a href="mailto:khmer-request@lists.idyll.org" target="_blank">khmer-request@lists.idyll.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send khmer mailing list submissions to<br>

        <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="http://lists.idyll.org/listinfo/khmer" target="_blank">http://lists.idyll.org/listinfo/khmer</a><br>

or, via email, send a message with subject or body &#39;help&#39; to<br>

        <a href="mailto:khmer-request@lists.idyll.org">khmer-request@lists.idyll.org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:khmer-owner@lists.idyll.org">khmer-owner@lists.idyll.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than &quot;Re: Contents of khmer digest...&quot;<br>

<br>

<br>

Today&#39;s Topics:<br>

<br>

   1. inconsistent unique k-mer counting (Joran Martijn)<br>

   2. Re: inconsistent unique k-mer counting (C. Titus Brown)<br>

   3. Re: inconsistent unique k-mer counting (Joran Martijn)<br>

   4. Re: inconsistent unique k-mer counting (C. Titus Brown)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Mon, 11 May 2015 11:29:31 +0200<br>

From: Joran Martijn &lt;<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>&gt;<br>

Subject: [khmer] inconsistent unique k-mer counting<br>

To: &lt;<a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a>&gt;<br>

Message-ID: &lt;<a href="mailto:5550767B.4070501@icm.uu.se">5550767B.4070501@icm.uu.se</a>&gt;<br>

Content-Type: text/plain; charset=&quot;utf-8&quot;; Format=&quot;flowed&quot;<br>

<br>

Dear Khmer mailing list,<br>

<br>

I&#39;m trying to compare the number of unique k-mers (lets say 20-mers) in<br>

the raw dataset and diginormed dataset, similar as was done in the<br>

original diginorm paper.<br>

<br>

I have done this as follows. First I create the counting table:<br>

<br>

load-into-counting.py \<br>

     -t \<br>

     -k 20 \<br>

     -N 4 -x 16e9 \<br>

     --threads 30 \<br>

     test.ct \<br>

     test.fastq.gz \<br>

     &amp;&gt; test.ct.report<br>

<br>

The script then reports ~3,1 billion unique 20-mers.<br>

<br>

Then I perform the diginorm:<br>

<br>

normalize-by-median.py \<br>

     -p -t \<br>

     -C 20 \<br>

     --loadtable test.ct \<br>

     -o test_k20_C20.fastq.gz.keep \<br>

     test.fastq.gz \<br>

     &amp;&gt; test_k20_C20.report<br>

<br>

The script then reports approximately 1 million unique 20-mers (if true,<br>

this would mean a 3000-fold reduction in unique kmers, that sounds like<br>

too much to me).<br>

<br>

Then, I count the number of unique 20-mers again in the normalized<br>

dataset using load-into-counting.py:<br>

<br>

load-into-counting.py \<br>

     -t \<br>

     -k 20 \<br>

     -N 4 -x 16e9 \<br>

     --threads 30 \<br>

     test2.ct \<br>

     test_k20_C20.fastq.gz.keep \<br>

     &amp;&gt; test2.ct.report<br>

<br>

This time however, the script reports approximately 2,8 billion unique<br>

20-mers. I am confused, since I was expecting around 1 million, like<br>

normalize-by-median reported.<br>

Are the two scripts reporting a different kind of unique k-mers? Which<br>

counts should I compare to get an estimate of the effect of diginorm on<br>

the dataset? Or is there a more easy and straightforward way to go about<br>

this?<br>

<br>

Kind regards,<br>

<br>

Joran Martijn<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: &lt;<a href="http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment-0001.html" target="_blank">http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment-0001.html</a>&gt;<br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Mon, 11 May 2015 03:12:52 -0700<br>

From: &quot;C. Titus Brown&quot; &lt;<a href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</a>&gt;<br>

Subject: Re: [khmer] inconsistent unique k-mer counting<br>

To: Joran Martijn &lt;<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>&gt;<br>

Cc: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>

Message-ID: &lt;<a href="mailto:20150511101252.GA3199@idyll.org">20150511101252.GA3199@idyll.org</a>&gt;<br>

Content-Type: text/plain; charset=us-ascii<br>

<br>

On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:<br>

&gt; Dear Khmer mailing list,<br>

&gt;<br>

&gt; I&#39;m trying to compare the number of unique k-mers (lets say 20-mers) in<br>

&gt; the raw dataset and diginormed dataset, similar as was done in the<br>

&gt; original diginorm paper.<br>

<br>

[ elided ]<br>

<br>

Hi Joran,<br>

<br>

that certainly doesn&#39;t sound good :). Would it be possible to convey the<br>

various report files to us, publicly or privately?<br>

<br>

thanks,<br>

--titus<br>

<br>

p.s. Thank you for the very detailed question!<br>

<br>

<br>

<br>

------------------------------<br>

<br>

Message: 3<br>

Date: Mon, 11 May 2015 13:06:24 +0200<br>

From: Joran Martijn &lt;<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>&gt;<br>

Subject: Re: [khmer] inconsistent unique k-mer counting<br>

To: &lt;<a href="mailto:titus@idyll.org">titus@idyll.org</a>&gt;<br>

Cc: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>

Message-ID: &lt;<a href="mailto:55508D30.1060604@icm.uu.se">55508D30.1060604@icm.uu.se</a>&gt;<br>

Content-Type: text/plain; charset=&quot;windows-1252&quot;; Format=&quot;flowed&quot;<br>

<br>

Hej Titus,<br>

<br>

Thanks for the quick reply! Here are the report files, which are<br>

basically the STDERR and STDOUT output of the scripts.<br>

Quick note before the reports, I made a small mistake in my<br>

openingspost. The Coverage threshold I tried for these reports was 5,<br>

not 20.<br>

<br>

Here the report file of the first load-into-counting.py execution (on<br>

the raw sequence data), test.ct.report:<br>

<br>

|| This is the script &#39;load-into-counting.py&#39; in khmer.<br>

|| You are running khmer version 1.3<br>

|| You are also using screed version 0.8<br>

||<br>

|| If you use this script in a publication, please cite EACH of the<br>

following:<br>

||<br>

||   * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>

||   * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>

||   * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>

||<br>

|| Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>

details.<br>

<br>

<br>

PARAMETERS:<br>

  - kmer size =    20            (-k)<br>

  - n tables =     4             (-N)<br>

  - min tablesize = 1.6e+10      (-x)<br>

<br>

Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>

--------<br>

Saving k-mer counting table to test.ct<br>

Loading kmers from sequences in [&#39;test.fastq.gz&#39;]<br>

making k-mer counting table<br>

consuming input test.fastq.gz<br>

Total number of unique k-mers: 3102943887<br>

saving test.ct<br>

fp rate estimated to be 0.008<br>

DONE.<br>

wrote to: <a href="http://test.ct.info" target="_blank">test.ct.info</a><br>

<br>

Here the report file of the normalize-by-median.py, test_k20_C5.report<br>

<br>

|| This is the script &#39;normalize-by-median.py&#39; in khmer.<br>

|| You are running khmer version 1.3<br>

|| You are also using screed version 0.8<br>

||<br>

|| If you use this script in a publication, please cite EACH of the<br>

following:<br>

||<br>

||   * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>

||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]<br>

||<br>

|| Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>

details.<br>

<br>

<br>

PARAMETERS:<br>

  - kmer size =    20            (-k)<br>

  - n tables =     4             (-N)<br>

  - min tablesize = 1.6e+10      (-x)<br>

<br>

Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>

--------<br>

... kept 58012 of 200000 or 29%<br>

... in file test.fastq.gz<br>

... kept 116210 of 400000 or 29%<br>

... in file test.fastq.gz<br>

<br>

..... etc etc etc .....<br>

<br>

... kept 90482098 of 346200000 or 26%<br>

... in file test.fastq.gz<br>

... kept 90529526 of 346400000 or 26%<br>

... in file test.fastq.gz<br>

Total number of unique k-mers: 850221<br>

loading k-mer counting table from test.ct<br>

DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%<br>

output in test_k20_C5.fastq.gz.keep<br>

fp rate estimated to be 0.008<br>

<br>

And here the second load-into-counting.py report, test2.ct.report<br>

<br>

|| This is the script &#39;load-into-counting.py&#39; in khmer.<br>

|| You are running khmer version 1.3<br>

|| You are also using screed version 0.8<br>

||<br>

|| If you use this script in a publication, please cite EACH of the<br>

following:<br>

||<br>

||   * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>

||   * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>

||   * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>

||<br>

|| Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>

details.<br>

<br>

<br>

PARAMETERS:<br>

  - kmer size =    20            (-k)<br>

  - n tables =     4             (-N)<br>

  - min tablesize = 1.6e+10      (-x)<br>

<br>

Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>

--------<br>

Saving k-mer counting table to test2.ct<br>

Loading kmers from sequences in [&#39;test_k20_C5.fastq.gz.keep&#39;]<br>

making k-mer counting table<br>

consuming input test_k20_C5.fastq.gz.keep<br>

Total number of unique k-mers: 2822473008<br>

saving test2.ct<br>

<br>

Hope this helps!<br>

<br>

Joran<br>

<br>

On 11/05/15 12:12, C. Titus Brown wrote:<br>

&gt; On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:<br>

&gt;&gt; Dear Khmer mailing list,<br>

&gt;&gt;<br>

&gt;&gt; I&#39;m trying to compare the number of unique k-mers (lets say 20-mers) in<br>

&gt;&gt; the raw dataset and diginormed dataset, similar as was done in the<br>

&gt;&gt; original diginorm paper.<br>

&gt; [ elided ]<br>

&gt;<br>

&gt; Hi Joran,<br>

&gt;<br>

&gt; that certainly doesn&#39;t sound good :). Would it be possible to convey the<br>

&gt; various report files to us, publicly or privately?<br>

&gt;<br>

&gt; thanks,<br>

&gt; --titus<br>

&gt;<br>

&gt; p.s. Thank you for the very detailed question!<br>

<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: &lt;<a href="http://lists.idyll.org/pipermail/khmer/attachments/20150511/96dd402e/attachment-0001.htm" target="_blank">http://lists.idyll.org/pipermail/khmer/attachments/20150511/96dd402e/attachment-0001.htm</a>&gt;<br>

<br>

------------------------------<br>

<br>

Message: 4<br>

Date: Mon, 11 May 2015 04:13:26 -0700<br>

From: &quot;C. Titus Brown&quot; &lt;<a href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</a>&gt;<br>

Subject: Re: [khmer] inconsistent unique k-mer counting<br>

To: Joran Martijn &lt;<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>&gt;<br>

Cc: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>

Message-ID: &lt;<a href="mailto:20150511111326.GG3199@idyll.org">20150511111326.GG3199@idyll.org</a>&gt;<br>

Content-Type: text/plain; charset=us-ascii<br>

<br>

OK, that&#39;s very weird - this must be a bug, but I&#39;ll be darned if I can<br>

figure out what might be causing it.  The numbers in load-into-counting<br>

should be correct but I&#39;ll have to independently confirm that.<br>

<br>

BTW, for the first round of diginorm I&#39;d use C=20; see 3-pass diginorm<br>

in the dn paper.<br>

<br>

best,<br>

--titus<br>

<br>

On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:<br>

&gt; Hej Titus,<br>

&gt;<br>

&gt; Thanks for the quick reply! Here are the report files, which are<br>

&gt; basically the STDERR and STDOUT output of the scripts.<br>

&gt; Quick note before the reports, I made a small mistake in my<br>

&gt; openingspost. The Coverage threshold I tried for these reports was 5,<br>

&gt; not 20.<br>

&gt;<br>

&gt; Here the report file of the first load-into-counting.py execution (on<br>

&gt; the raw sequence data), test.ct.report:<br>

&gt;<br>

&gt; || This is the script &#39;load-into-counting.py&#39; in khmer.<br>

&gt; || You are running khmer version 1.3<br>

&gt; || You are also using screed version 0.8<br>

&gt; ||<br>

&gt; || If you use this script in a publication, please cite EACH of the<br>

&gt; following:<br>

&gt; ||<br>

&gt; ||   * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>

&gt; ||   * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>

&gt; ||   * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>

&gt; ||<br>

&gt; || Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>

&gt; details.<br>

&gt;<br>

&gt;<br>

&gt; PARAMETERS:<br>

&gt;  - kmer size =    20            (-k)<br>

&gt;  - n tables =     4             (-N)<br>

&gt;  - min tablesize = 1.6e+10      (-x)<br>

&gt;<br>

&gt; Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>

&gt; --------<br>

&gt; Saving k-mer counting table to test.ct<br>

&gt; Loading kmers from sequences in [&#39;test.fastq.gz&#39;]<br>

&gt; making k-mer counting table<br>

&gt; consuming input test.fastq.gz<br>

&gt; Total number of unique k-mers: 3102943887<br>

&gt; saving test.ct<br>

&gt; fp rate estimated to be 0.008<br>

&gt; DONE.<br>

&gt; wrote to: <a href="http://test.ct.info" target="_blank">test.ct.info</a><br>

&gt;<br>

&gt; Here the report file of the normalize-by-median.py, test_k20_C5.report<br>

&gt;<br>

&gt; || This is the script &#39;normalize-by-median.py&#39; in khmer.<br>

&gt; || You are running khmer version 1.3<br>

&gt; || You are also using screed version 0.8<br>

&gt; ||<br>

&gt; || If you use this script in a publication, please cite EACH of the<br>

&gt; following:<br>

&gt; ||<br>

&gt; ||   * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>

&gt; ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]<br>

&gt; ||<br>

&gt; || Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>

&gt; details.<br>

&gt;<br>

&gt;<br>

&gt; PARAMETERS:<br>

&gt;  - kmer size =    20            (-k)<br>

&gt;  - n tables =     4             (-N)<br>

&gt;  - min tablesize = 1.6e+10      (-x)<br>

&gt;<br>

&gt; Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>

&gt; --------<br>

&gt; ... kept 58012 of 200000 or 29%<br>

&gt; ... in file test.fastq.gz<br>

&gt; ... kept 116210 of 400000 or 29%<br>

&gt; ... in file test.fastq.gz<br>

&gt;<br>

&gt; ..... etc etc etc .....<br>

&gt;<br>

&gt; ... kept 90482098 of 346200000 or 26%<br>

&gt; ... in file test.fastq.gz<br>

&gt; ... kept 90529526 of 346400000 or 26%<br>

&gt; ... in file test.fastq.gz<br>

&gt; Total number of unique k-mers: 850221<br>

&gt; loading k-mer counting table from test.ct<br>

&gt; DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%<br>

&gt; output in test_k20_C5.fastq.gz.keep<br>

&gt; fp rate estimated to be 0.008<br>

&gt;<br>

&gt; And here the second load-into-counting.py report, test2.ct.report<br>

&gt;<br>

&gt; || This is the script &#39;load-into-counting.py&#39; in khmer.<br>

&gt; || You are running khmer version 1.3<br>

&gt; || You are also using screed version 0.8<br>

&gt; ||<br>

&gt; || If you use this script in a publication, please cite EACH of the<br>

&gt; following:<br>

&gt; ||<br>

&gt; ||   * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>

&gt; ||   * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>

&gt; ||   * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>

&gt; ||<br>

&gt; || Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>

&gt; details.<br>

&gt;<br>

&gt;<br>

&gt; PARAMETERS:<br>

&gt;  - kmer size =    20            (-k)<br>

&gt;  - n tables =     4             (-N)<br>

&gt;  - min tablesize = 1.6e+10      (-x)<br>

&gt;<br>

&gt; Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>

&gt; --------<br>

&gt; Saving k-mer counting table to test2.ct<br>

&gt; Loading kmers from sequences in [&#39;test_k20_C5.fastq.gz.keep&#39;]<br>

&gt; making k-mer counting table<br>

&gt; consuming input test_k20_C5.fastq.gz.keep<br>

&gt; Total number of unique k-mers: 2822473008<br>

&gt; saving test2.ct<br>

&gt;<br>

&gt; Hope this helps!<br>

&gt;<br>

&gt; Joran<br>

&gt;<br>

&gt; On 11/05/15 12:12, C. Titus Brown wrote:<br>

&gt;&gt; On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:<br>

&gt;&gt;&gt; Dear Khmer mailing list,<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; I&#39;m trying to compare the number of unique k-mers (lets say 20-mers) in<br>

&gt;&gt;&gt; the raw dataset and diginormed dataset, similar as was done in the<br>

&gt;&gt;&gt; original diginorm paper.<br>

&gt;&gt; [ elided ]<br>

&gt;&gt;<br>

&gt;&gt; Hi Joran,<br>

&gt;&gt;<br>

&gt;&gt; that certainly doesn&#39;t sound good :). Would it be possible to convey the<br>

&gt;&gt; various report files to us, publicly or privately?<br>

&gt;&gt;<br>

&gt;&gt; thanks,<br>

&gt;&gt; --titus<br>

&gt;&gt;<br>

&gt;&gt; p.s. Thank you for the very detailed question!<br>

&gt;<br>

<br>

--<br>

C. Titus Brown, <a href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</a><br>

<br>

<br>

<br>

------------------------------<br>

<br>

_______________________________________________<br>

khmer mailing list<br>

<a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>

<a href="http://lists.idyll.org/listinfo/khmer" target="_blank">http://lists.idyll.org/listinfo/khmer</a><br>

<br>

<br>

End of khmer Digest, Vol 25, Issue 2<br>

************************************<br>

</blockquote></div><br></div>