<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Dear Khmer mailing list,<br>

    <br>

    I'm trying to compare the number of unique k-mers (lets say 20-mers)

    in the raw dataset and diginormed dataset, similar as was done in

    the original diginorm paper.<br>

    <br>

    I have done this as follows. First I create the counting table:<br>

    <font color="#990000"><br>

      load-into-counting.py \<br>

          -t \<br>

          -k 20 \<br>

          -N 4 -x 16e9 \<br>

          --threads 30 \<br>

          test.ct \<br>

          test.fastq.gz \<br>

          &amp;&gt; test.ct.report</font><br>

    <br>

    The script then reports ~3,1 billion unique 20-mers.<br>

    <br>

    Then I perform the diginorm:<br>

    <br>

    <font color="#990000">normalize-by-median.py \<br>

          -p -t \<br>

          -C 20 \<br>

          --loadtable test.ct \<br>

          -o test_k20_C20.fastq.gz.keep \<br>

          test.fastq.gz \<br>

          &amp;&gt; test_k20_C20.report</font><br>

    <br>

    The script then reports approximately 1 million unique 20-mers (if

    true, this would mean a 3000-fold reduction in unique kmers, that

    sounds like too much to me).<br>

    <br>

    Then, I count the number of unique 20-mers again in the normalized

    dataset using load-into-counting.py:<br>

    <font color="#990000"><br>

      load-into-counting.py \<br>

          -t \<br>

          -k 20 \<br>

          -N 4 -x 16e9 \<br>

          --threads 30 \<br>

          test2.ct \<br>

          test_k20_C20.fastq.gz.keep \<br>

          &amp;&gt; test2.ct.report</font><br>

    <br>

    This time however, the script reports approximately 2,8 billion

    unique 20-mers. I am confused, since I was expecting around 1

    million, like normalize-by-median reported. <br>

    Are the two scripts reporting a different kind of unique k-mers?

    Which counts should I compare to get an estimate of the effect of

    diginorm on the dataset? Or is there a more easy and straightforward

    way to go about this?<br>

    <br>

    Kind regards,<br>

    <br>

    Joran Martijn<br>

  </body>

</html>