<div dir="ltr">I think this could be the bug of the normalize-by-median script. There script with (-p) keep the both paired ends but only count the kmers of one of them. This is why abound-dist.py after normalize-by-median show many kmers with count 0.<div>I think @camillescott was working on fixing that.<br><div><br></div><div>Tamer</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 11, 2015 at 6:13 AM, <span dir="ltr"><<a href="mailto:khmer-request@lists.idyll.org" target="_blank">khmer-request@lists.idyll.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send khmer mailing list submissions to<br>
<a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://lists.idyll.org/listinfo/khmer" target="_blank">http://lists.idyll.org/listinfo/khmer</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:khmer-request@lists.idyll.org">khmer-request@lists.idyll.org</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:khmer-owner@lists.idyll.org">khmer-owner@lists.idyll.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of khmer digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. inconsistent unique k-mer counting (Joran Martijn)<br>
2. Re: inconsistent unique k-mer counting (C. Titus Brown)<br>
3. Re: inconsistent unique k-mer counting (Joran Martijn)<br>
4. Re: inconsistent unique k-mer counting (C. Titus Brown)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Mon, 11 May 2015 11:29:31 +0200<br>
From: Joran Martijn <<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>><br>
Subject: [khmer] inconsistent unique k-mer counting<br>
To: <<a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a>><br>
Message-ID: <<a href="mailto:5550767B.4070501@icm.uu.se">5550767B.4070501@icm.uu.se</a>><br>
Content-Type: text/plain; charset="utf-8"; Format="flowed"<br>
<br>
Dear Khmer mailing list,<br>
<br>
I'm trying to compare the number of unique k-mers (lets say 20-mers) in<br>
the raw dataset and diginormed dataset, similar as was done in the<br>
original diginorm paper.<br>
<br>
I have done this as follows. First I create the counting table:<br>
<br>
load-into-counting.py \<br>
-t \<br>
-k 20 \<br>
-N 4 -x 16e9 \<br>
--threads 30 \<br>
test.ct \<br>
test.fastq.gz \<br>
&> test.ct.report<br>
<br>
The script then reports ~3,1 billion unique 20-mers.<br>
<br>
Then I perform the diginorm:<br>
<br>
normalize-by-median.py \<br>
-p -t \<br>
-C 20 \<br>
--loadtable test.ct \<br>
-o test_k20_C20.fastq.gz.keep \<br>
test.fastq.gz \<br>
&> test_k20_C20.report<br>
<br>
The script then reports approximately 1 million unique 20-mers (if true,<br>
this would mean a 3000-fold reduction in unique kmers, that sounds like<br>
too much to me).<br>
<br>
Then, I count the number of unique 20-mers again in the normalized<br>
dataset using load-into-counting.py:<br>
<br>
load-into-counting.py \<br>
-t \<br>
-k 20 \<br>
-N 4 -x 16e9 \<br>
--threads 30 \<br>
test2.ct \<br>
test_k20_C20.fastq.gz.keep \<br>
&> test2.ct.report<br>
<br>
This time however, the script reports approximately 2,8 billion unique<br>
20-mers. I am confused, since I was expecting around 1 million, like<br>
normalize-by-median reported.<br>
Are the two scripts reporting a different kind of unique k-mers? Which<br>
counts should I compare to get an estimate of the effect of diginorm on<br>
the dataset? Or is there a more easy and straightforward way to go about<br>
this?<br>
<br>
Kind regards,<br>
<br>
Joran Martijn<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment-0001.html" target="_blank">http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Mon, 11 May 2015 03:12:52 -0700<br>
From: "C. Titus Brown" <<a href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</a>><br>
Subject: Re: [khmer] inconsistent unique k-mer counting<br>
To: Joran Martijn <<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>><br>
Cc: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>
Message-ID: <<a href="mailto:20150511101252.GA3199@idyll.org">20150511101252.GA3199@idyll.org</a>><br>
Content-Type: text/plain; charset=us-ascii<br>
<br>
On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:<br>
> Dear Khmer mailing list,<br>
><br>
> I'm trying to compare the number of unique k-mers (lets say 20-mers) in<br>
> the raw dataset and diginormed dataset, similar as was done in the<br>
> original diginorm paper.<br>
<br>
[ elided ]<br>
<br>
Hi Joran,<br>
<br>
that certainly doesn't sound good :). Would it be possible to convey the<br>
various report files to us, publicly or privately?<br>
<br>
thanks,<br>
--titus<br>
<br>
p.s. Thank you for the very detailed question!<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Mon, 11 May 2015 13:06:24 +0200<br>
From: Joran Martijn <<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>><br>
Subject: Re: [khmer] inconsistent unique k-mer counting<br>
To: <<a href="mailto:titus@idyll.org">titus@idyll.org</a>><br>
Cc: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>
Message-ID: <<a href="mailto:55508D30.1060604@icm.uu.se">55508D30.1060604@icm.uu.se</a>><br>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"<br>
<br>
Hej Titus,<br>
<br>
Thanks for the quick reply! Here are the report files, which are<br>
basically the STDERR and STDOUT output of the scripts.<br>
Quick note before the reports, I made a small mistake in my<br>
openingspost. The Coverage threshold I tried for these reports was 5,<br>
not 20.<br>
<br>
Here the report file of the first load-into-counting.py execution (on<br>
the raw sequence data), test.ct.report:<br>
<br>
|| This is the script 'load-into-counting.py' in khmer.<br>
|| You are running khmer version 1.3<br>
|| You are also using screed version 0.8<br>
||<br>
|| If you use this script in a publication, please cite EACH of the<br>
following:<br>
||<br>
|| * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>
|| * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>
|| * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>
||<br>
|| Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>
details.<br>
<br>
<br>
PARAMETERS:<br>
- kmer size = 20 (-k)<br>
- n tables = 4 (-N)<br>
- min tablesize = 1.6e+10 (-x)<br>
<br>
Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>
--------<br>
Saving k-mer counting table to test.ct<br>
Loading kmers from sequences in ['test.fastq.gz']<br>
making k-mer counting table<br>
consuming input test.fastq.gz<br>
Total number of unique k-mers: 3102943887<br>
saving test.ct<br>
fp rate estimated to be 0.008<br>
DONE.<br>
wrote to: <a href="http://test.ct.info" target="_blank">test.ct.info</a><br>
<br>
Here the report file of the normalize-by-median.py, test_k20_C5.report<br>
<br>
|| This is the script 'normalize-by-median.py' in khmer.<br>
|| You are running khmer version 1.3<br>
|| You are also using screed version 0.8<br>
||<br>
|| If you use this script in a publication, please cite EACH of the<br>
following:<br>
||<br>
|| * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>
|| * CT Brown et al., arXiv:1203.4802 [q-bio.GN]<br>
||<br>
|| Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>
details.<br>
<br>
<br>
PARAMETERS:<br>
- kmer size = 20 (-k)<br>
- n tables = 4 (-N)<br>
- min tablesize = 1.6e+10 (-x)<br>
<br>
Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>
--------<br>
... kept 58012 of 200000 or 29%<br>
... in file test.fastq.gz<br>
... kept 116210 of 400000 or 29%<br>
... in file test.fastq.gz<br>
<br>
..... etc etc etc .....<br>
<br>
... kept 90482098 of 346200000 or 26%<br>
... in file test.fastq.gz<br>
... kept 90529526 of 346400000 or 26%<br>
... in file test.fastq.gz<br>
Total number of unique k-mers: 850221<br>
loading k-mer counting table from test.ct<br>
DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%<br>
output in test_k20_C5.fastq.gz.keep<br>
fp rate estimated to be 0.008<br>
<br>
And here the second load-into-counting.py report, test2.ct.report<br>
<br>
|| This is the script 'load-into-counting.py' in khmer.<br>
|| You are running khmer version 1.3<br>
|| You are also using screed version 0.8<br>
||<br>
|| If you use this script in a publication, please cite EACH of the<br>
following:<br>
||<br>
|| * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>
|| * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>
|| * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>
||<br>
|| Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>
details.<br>
<br>
<br>
PARAMETERS:<br>
- kmer size = 20 (-k)<br>
- n tables = 4 (-N)<br>
- min tablesize = 1.6e+10 (-x)<br>
<br>
Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>
--------<br>
Saving k-mer counting table to test2.ct<br>
Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']<br>
making k-mer counting table<br>
consuming input test_k20_C5.fastq.gz.keep<br>
Total number of unique k-mers: 2822473008<br>
saving test2.ct<br>
<br>
Hope this helps!<br>
<br>
Joran<br>
<br>
On 11/05/15 12:12, C. Titus Brown wrote:<br>
> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:<br>
>> Dear Khmer mailing list,<br>
>><br>
>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in<br>
>> the raw dataset and diginormed dataset, similar as was done in the<br>
>> original diginorm paper.<br>
> [ elided ]<br>
><br>
> Hi Joran,<br>
><br>
> that certainly doesn't sound good :). Would it be possible to convey the<br>
> various report files to us, publicly or privately?<br>
><br>
> thanks,<br>
> --titus<br>
><br>
> p.s. Thank you for the very detailed question!<br>
<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.idyll.org/pipermail/khmer/attachments/20150511/96dd402e/attachment-0001.htm" target="_blank">http://lists.idyll.org/pipermail/khmer/attachments/20150511/96dd402e/attachment-0001.htm</a>><br>
<br>
------------------------------<br>
<br>
Message: 4<br>
Date: Mon, 11 May 2015 04:13:26 -0700<br>
From: "C. Titus Brown" <<a href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</a>><br>
Subject: Re: [khmer] inconsistent unique k-mer counting<br>
To: Joran Martijn <<a href="mailto:joran.martijn@icm.uu.se">joran.martijn@icm.uu.se</a>><br>
Cc: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>
Message-ID: <<a href="mailto:20150511111326.GG3199@idyll.org">20150511111326.GG3199@idyll.org</a>><br>
Content-Type: text/plain; charset=us-ascii<br>
<br>
OK, that's very weird - this must be a bug, but I'll be darned if I can<br>
figure out what might be causing it. The numbers in load-into-counting<br>
should be correct but I'll have to independently confirm that.<br>
<br>
BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm<br>
in the dn paper.<br>
<br>
best,<br>
--titus<br>
<br>
On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:<br>
> Hej Titus,<br>
><br>
> Thanks for the quick reply! Here are the report files, which are<br>
> basically the STDERR and STDOUT output of the scripts.<br>
> Quick note before the reports, I made a small mistake in my<br>
> openingspost. The Coverage threshold I tried for these reports was 5,<br>
> not 20.<br>
><br>
> Here the report file of the first load-into-counting.py execution (on<br>
> the raw sequence data), test.ct.report:<br>
><br>
> || This is the script 'load-into-counting.py' in khmer.<br>
> || You are running khmer version 1.3<br>
> || You are also using screed version 0.8<br>
> ||<br>
> || If you use this script in a publication, please cite EACH of the<br>
> following:<br>
> ||<br>
> || * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>
> || * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>
> || * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>
> ||<br>
> || Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>
> details.<br>
><br>
><br>
> PARAMETERS:<br>
> - kmer size = 20 (-k)<br>
> - n tables = 4 (-N)<br>
> - min tablesize = 1.6e+10 (-x)<br>
><br>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>
> --------<br>
> Saving k-mer counting table to test.ct<br>
> Loading kmers from sequences in ['test.fastq.gz']<br>
> making k-mer counting table<br>
> consuming input test.fastq.gz<br>
> Total number of unique k-mers: 3102943887<br>
> saving test.ct<br>
> fp rate estimated to be 0.008<br>
> DONE.<br>
> wrote to: <a href="http://test.ct.info" target="_blank">test.ct.info</a><br>
><br>
> Here the report file of the normalize-by-median.py, test_k20_C5.report<br>
><br>
> || This is the script 'normalize-by-median.py' in khmer.<br>
> || You are running khmer version 1.3<br>
> || You are also using screed version 0.8<br>
> ||<br>
> || If you use this script in a publication, please cite EACH of the<br>
> following:<br>
> ||<br>
> || * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>
> || * CT Brown et al., arXiv:1203.4802 [q-bio.GN]<br>
> ||<br>
> || Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>
> details.<br>
><br>
><br>
> PARAMETERS:<br>
> - kmer size = 20 (-k)<br>
> - n tables = 4 (-N)<br>
> - min tablesize = 1.6e+10 (-x)<br>
><br>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>
> --------<br>
> ... kept 58012 of 200000 or 29%<br>
> ... in file test.fastq.gz<br>
> ... kept 116210 of 400000 or 29%<br>
> ... in file test.fastq.gz<br>
><br>
> ..... etc etc etc .....<br>
><br>
> ... kept 90482098 of 346200000 or 26%<br>
> ... in file test.fastq.gz<br>
> ... kept 90529526 of 346400000 or 26%<br>
> ... in file test.fastq.gz<br>
> Total number of unique k-mers: 850221<br>
> loading k-mer counting table from test.ct<br>
> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%<br>
> output in test_k20_C5.fastq.gz.keep<br>
> fp rate estimated to be 0.008<br>
><br>
> And here the second load-into-counting.py report, test2.ct.report<br>
><br>
> || This is the script 'load-into-counting.py' in khmer.<br>
> || You are running khmer version 1.3<br>
> || You are also using screed version 0.8<br>
> ||<br>
> || If you use this script in a publication, please cite EACH of the<br>
> following:<br>
> ||<br>
> || * MR Crusoe et al., 2014. <a href="http://dx.doi.org/10.6084/m9.figshare.979190" target="_blank">http://dx.doi.org/10.6084/m9.figshare.979190</a><br>
> || * Q Zhang et al., <a href="http://dx.doi.org/10.1371/journal.pone.0101271" target="_blank">http://dx.doi.org/10.1371/journal.pone.0101271</a><br>
> || * A. D303266ring et al. <a href="http://dx.doi.org:80/10.1186/1471-2105-9-11" target="_blank">http://dx.doi.org:80/10.1186/1471-2105-9-11</a><br>
> ||<br>
> || Please see <a href="http://khmer.readthedocs.org/en/latest/citations.html" target="_blank">http://khmer.readthedocs.org/en/latest/citations.html</a> for<br>
> details.<br>
><br>
><br>
> PARAMETERS:<br>
> - kmer size = 20 (-k)<br>
> - n tables = 4 (-N)<br>
> - min tablesize = 1.6e+10 (-x)<br>
><br>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)<br>
> --------<br>
> Saving k-mer counting table to test2.ct<br>
> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']<br>
> making k-mer counting table<br>
> consuming input test_k20_C5.fastq.gz.keep<br>
> Total number of unique k-mers: 2822473008<br>
> saving test2.ct<br>
><br>
> Hope this helps!<br>
><br>
> Joran<br>
><br>
> On 11/05/15 12:12, C. Titus Brown wrote:<br>
>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:<br>
>>> Dear Khmer mailing list,<br>
>>><br>
>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in<br>
>>> the raw dataset and diginormed dataset, similar as was done in the<br>
>>> original diginorm paper.<br>
>> [ elided ]<br>
>><br>
>> Hi Joran,<br>
>><br>
>> that certainly doesn't sound good :). Would it be possible to convey the<br>
>> various report files to us, publicly or privately?<br>
>><br>
>> thanks,<br>
>> --titus<br>
>><br>
>> p.s. Thank you for the very detailed question!<br>
><br>
<br>
--<br>
C. Titus Brown, <a href="mailto:ctbrown@ucdavis.edu">ctbrown@ucdavis.edu</a><br>
<br>
<br>
<br>
------------------------------<br>
<br>
_______________________________________________<br>
khmer mailing list<br>
<a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>
<a href="http://lists.idyll.org/listinfo/khmer" target="_blank">http://lists.idyll.org/listinfo/khmer</a><br>
<br>
<br>
End of khmer Digest, Vol 25, Issue 2<br>
************************************<br>
</blockquote></div><br></div>