<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    So, I did the three-pass protocol (same parameters as kalamazoo,

    with -N 4 -x 32e9) on my metagenome (trimmomatic qc processed) and

    counted the k-mers at each step.<br>

    I've included the bash script for running the protocol in the

    attachment.<br>

    <br>

    Quick summary: <br>

    I started out with approximately <font color="#990000">5,7 billion

      20-mers in 346 million reads</font> <font color="#990000">in the

      paired dataset </font>and <font color="#006600">869 million

      20-mers in 23 million reads for the unpaired dataset</font>.<br>

    After the 3rd pass I have             <font color="#990000">5,3

      billion 20-mers left in 163 million reads for paired data</font>  

    and <font color="#006600">549 million 20-mers left in 9.6 million

      reads for unpaired data</font>. <br>

    <br>

    To me, it seems that, although many reads were discarded, the

    normalization did not have such an amazing effect on the 20-mer

    count as I expected it would have, based on Table 1 of the diginorm

    paper.<br>

    To be fair, there no metagenomes were tested, so I'm not sure

    whether this behaviour is "normal" or not.<br>

    <br>

    What are your thoughts on this? <br>

    <br>

    Cheers,<br>

    <br>

    Joran<br>

    <br>

    <div class="moz-cite-prefix">On 14/05/15 13:24, C. Titus Brown

      wrote:<br>

    </div>

    <blockquote cite="mid:20150514112419.GA15923@idyll.org" type="cite">

      <pre wrap="">Hi Joran,

absolutely.  Let us know how it goes or you!

cheers,

--titus

On Thu, May 14, 2015 at 01:14:34PM +0200, Joran Martijn wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">I just saw the release of Khmer 1.4, and it includes in the sandbox the  

"unique-kmer.py" script.

Do you think I can use this script for my purpose (comparing unique  

number of k-mers for a certain k before and after different steps of the  

3-pass normalization?).

Cheers,

Joran

On 12/05/15 12:46, C. Titus Brown wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">Kalamazoo uses the three-pass :).  We have pretty good evidence that

it works ok for metagenomes - it's not what we used in Howe et al.,

for two reasons (we didn't have the variable-coverage error trimming yet,

and the data set was very low coverage) but we've been using it since.

best,

--titus

On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:

</pre>

          <blockquote type="cite">

            <pre wrap="">Thanks Titus, let me know when you figure something out!

I was playing around with several different Coverage thresholds.

I won't use the 3-pass as I understood this does not work well for

metagenomes.

I was thinking of following the kalamazoo pipeline.

Joran

On 11/05/15 13:13, C. Titus Brown wrote:

</pre>

            <blockquote type="cite">

              <pre wrap="">OK, that's very weird - this must be a bug, but I'll be darned if I can

figure out what might be causing it.  The numbers in load-into-counting

should be correct but I'll have to independently confirm that.

BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm

in the dn paper.

best,

--titus

On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:

</pre>

              <blockquote type="cite">

                <pre wrap="">Hej Titus,

Thanks for the quick reply! Here are the report files, which are

basically the STDERR and STDOUT output of the scripts.

Quick note before the reports, I made a small mistake in my

openingspost. The Coverage threshold I tried for these reports was 5,

not 20.

Here the report file of the first load-into-counting.py execution (on

the raw sequence data), test.ct.report:

|| This is the script 'load-into-counting.py' in khmer.

|| You are running khmer version 1.3

|| You are also using screed version 0.8

||

|| If you use this script in a publication, please cite EACH of the

following:

||

||   * MR Crusoe et al., 2014. <a class="moz-txt-link-freetext" href="http://dx.doi.org/10.6084/m9.figshare.979190">http://dx.doi.org/10.6084/m9.figshare.979190</a>

||   * Q Zhang et al., <a class="moz-txt-link-freetext" href="http://dx.doi.org/10.1371/journal.pone.0101271">http://dx.doi.org/10.1371/journal.pone.0101271</a>

||   * A. D303266ring et al. <a class="moz-txt-link-freetext" href="http://dx.doi.org:80/10.1186/1471-2105-9-11">http://dx.doi.org:80/10.1186/1471-2105-9-11</a>

||

|| Please see <a class="moz-txt-link-freetext" href="http://khmer.readthedocs.org/en/latest/citations.html">http://khmer.readthedocs.org/en/latest/citations.html</a> for

details.

PARAMETERS:

   - kmer size =    20            (-k)

   - n tables =     4             (-N)

   - min tablesize = 1.6e+10      (-x)

Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)

--------

Saving k-mer counting table to test.ct

Loading kmers from sequences in ['test.fastq.gz']

making k-mer counting table

consuming input test.fastq.gz

Total number of unique k-mers: 3102943887

saving test.ct

fp rate estimated to be 0.008

DONE.

wrote to: test.ct.info

Here the report file of the normalize-by-median.py, test_k20_C5.report

|| This is the script 'normalize-by-median.py' in khmer.

|| You are running khmer version 1.3

|| You are also using screed version 0.8

||

|| If you use this script in a publication, please cite EACH of the

following:

||

||   * MR Crusoe et al., 2014. <a class="moz-txt-link-freetext" href="http://dx.doi.org/10.6084/m9.figshare.979190">http://dx.doi.org/10.6084/m9.figshare.979190</a>

||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]

||

|| Please see <a class="moz-txt-link-freetext" href="http://khmer.readthedocs.org/en/latest/citations.html">http://khmer.readthedocs.org/en/latest/citations.html</a> for

details.

PARAMETERS:

   - kmer size =    20            (-k)

   - n tables =     4             (-N)

   - min tablesize = 1.6e+10      (-x)

Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)

--------

... kept 58012 of 200000 or 29%

... in file test.fastq.gz

... kept 116210 of 400000 or 29%

... in file test.fastq.gz

..... etc etc etc .....

... kept 90482098 of 346200000 or 26%

... in file test.fastq.gz

... kept 90529526 of 346400000 or 26%

... in file test.fastq.gz

Total number of unique k-mers: 850221

loading k-mer counting table from test.ct

DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%

output in test_k20_C5.fastq.gz.keep

fp rate estimated to be 0.008

And here the second load-into-counting.py report, test2.ct.report

|| This is the script 'load-into-counting.py' in khmer.

|| You are running khmer version 1.3

|| You are also using screed version 0.8

||

|| If you use this script in a publication, please cite EACH of the

following:

||

||   * MR Crusoe et al., 2014. <a class="moz-txt-link-freetext" href="http://dx.doi.org/10.6084/m9.figshare.979190">http://dx.doi.org/10.6084/m9.figshare.979190</a>

||   * Q Zhang et al., <a class="moz-txt-link-freetext" href="http://dx.doi.org/10.1371/journal.pone.0101271">http://dx.doi.org/10.1371/journal.pone.0101271</a>

||   * A. D303266ring et al. <a class="moz-txt-link-freetext" href="http://dx.doi.org:80/10.1186/1471-2105-9-11">http://dx.doi.org:80/10.1186/1471-2105-9-11</a>

||

|| Please see <a class="moz-txt-link-freetext" href="http://khmer.readthedocs.org/en/latest/citations.html">http://khmer.readthedocs.org/en/latest/citations.html</a> for

details.

PARAMETERS:

   - kmer size =    20            (-k)

   - n tables =     4             (-N)

   - min tablesize = 1.6e+10      (-x)

Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)

--------

Saving k-mer counting table to test2.ct

Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']

making k-mer counting table

consuming input test_k20_C5.fastq.gz.keep

Total number of unique k-mers: 2822473008

saving test2.ct

Hope this helps!

Joran

On 11/05/15 12:12, C. Titus Brown wrote:

</pre>

                <blockquote type="cite">

                  <pre wrap="">On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:

</pre>

                  <blockquote type="cite">

                    <pre wrap="">Dear Khmer mailing list,

I'm trying to compare the number of unique k-mers (lets say 20-mers) in

the raw dataset and diginormed dataset, similar as was done in the

original diginorm paper.

</pre>

                  </blockquote>

                  <pre wrap="">[ elided ]

Hi Joran,

that certainly doesn't sound good :). Would it be possible to convey the

various report files to us, publicly or privately?

thanks,

--titus

p.s. Thank you for the very detailed question!

</pre>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

        </blockquote>

        <pre wrap="">

</pre>

      </blockquote>

      <pre wrap="">

</pre>

    </blockquote>

    <br>

  </body>

</html>