[khmer] inconsistent unique k-mer counting

Mon May 18 07:03:41 PDT 2015

Hej Titus,

Ok that's a relief :) I was afraid some kind of mistake or bug had happened.
I'm not sure whether the error rate was high or low, after trimmomatic 
the data looked relatively clean on FastQC reports.
The data is derived from a 2x100 bp pe HiSeq run, so its not that new.

I don't know into how much detail I can go about the samples, but lets 
say they are aquatic :)

Btw, as you can see in the script I tried to use the new feature "-R" in 
the normalize-by-median script but nothing was written to the .report files.
Perhaps a bug?

Cheers,

Joran

On 18/05/15 12:20, C. Titus Brown wrote:
> Hi Joran,
>
> The script looks good to me and the results seem reasonable.  Sounds to me like
> you have a very low error error rate, and a very diverse metagenome with a few
> high-abundance genomes.
>
> In the Howe et al (2014) paper
> (http://www.pnas.org/content/111/13/4904.abstract) we see similar retention
> of k-mers.  I don't offhand remember the number of k-mers we saw getting
> removed, but we had much older and higher-error data.
>
> And thank you for some great error reports :)
>
> If I may ask - what did you sequence?
>
> cheers,
> --titus
>
> On Mon, May 18, 2015 at 09:56:06AM +0200, Joran Martijn wrote:
>> So, I did the three-pass protocol (same parameters as kalamazoo, with -N
>> 4 -x 32e9) on my metagenome (trimmomatic qc processed) and counted the
>> k-mers at each step.
>> I've included the bash script for running the protocol in the attachment.
>>
>> Quick summary:
>> I started out with approximately 5,7 billion 20-mers in 346 million
>> reads in the paired dataset and 869 million 20-mers in 23 million reads
>> for the unpaired dataset.
>> After the 3rd pass I have 5,3 billion 20-mers left in 163 million reads
>> for paired data and 549 million 20-mers left in 9.6 million reads for
>> unpaired data.
>>
>> To me, it seems that, although many reads were discarded, the
>> normalization did not have such an amazing effect on the 20-mer count as
>> I expected it would have, based on Table 1 of the diginorm paper.
>> To be fair, there no metagenomes were tested, so I'm not sure whether
>> this behaviour is "normal" or not.
>>
>> What are your thoughts on this?
>>
>> Cheers,
>>
>> Joran
>>
>> On 14/05/15 13:24, C. Titus Brown wrote:
>>> Hi Joran,
>>>
>>> absolutely.  Let us know how it goes or you!
>>>
>>> cheers,
>>> --titus
>>>
>>> On Thu, May 14, 2015 at 01:14:34PM +0200, Joran Martijn wrote:
>>>> I just saw the release of Khmer 1.4, and it includes in the sandbox the
>>>> "unique-kmer.py" script.
>>>> Do you think I can use this script for my purpose (comparing unique
>>>> number of k-mers for a certain k before and after different steps of the
>>>> 3-pass normalization?).
>>>>
>>>> Cheers,
>>>>
>>>> Joran
>>>>
>>>> On 12/05/15 12:46, C. Titus Brown wrote:
>>>>> Kalamazoo uses the three-pass :).  We have pretty good evidence that
>>>>> it works ok for metagenomes - it's not what we used in Howe et al.,
>>>>> for two reasons (we didn't have the variable-coverage error trimming yet,
>>>>> and the data set was very low coverage) but we've been using it since.
>>>>>
>>>>> best,
>>>>> --titus
>>>>>
>>>>> On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:
>>>>>> Thanks Titus, let me know when you figure something out!
>>>>>>
>>>>>> I was playing around with several different Coverage thresholds.
>>>>>> I won't use the 3-pass as I understood this does not work well for
>>>>>> metagenomes.
>>>>>> I was thinking of following the kalamazoo pipeline.
>>>>>>
>>>>>> Joran
>>>>>>
>>>>>> On 11/05/15 13:13, C. Titus Brown wrote:
>>>>>>> OK, that's very weird - this must be a bug, but I'll be darned if I can
>>>>>>> figure out what might be causing it.  The numbers in load-into-counting
>>>>>>> should be correct but I'll have to independently confirm that.
>>>>>>>
>>>>>>> BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
>>>>>>> in the dn paper.
>>>>>>>
>>>>>>> best,
>>>>>>> --titus
>>>>>>>
>>>>>>> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
>>>>>>>> Hej Titus,
>>>>>>>>
>>>>>>>> Thanks for the quick reply! Here are the report files, which are
>>>>>>>> basically the STDERR and STDOUT output of the scripts.
>>>>>>>> Quick note before the reports, I made a small mistake in my
>>>>>>>> openingspost. The Coverage threshold I tried for these reports was 5,
>>>>>>>> not 20.
>>>>>>>>
>>>>>>>> Here the report file of the first load-into-counting.py execution (on
>>>>>>>> the raw sequence data), test.ct.report:
>>>>>>>>
>>>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>>>> || You are running khmer version 1.3
>>>>>>>> || You are also using screed version 0.8
>>>>>>>> ||
>>>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>>>> following:
>>>>>>>> ||
>>>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>>>> ||
>>>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>>> details.
>>>>>>>>
>>>>>>>>
>>>>>>>> PARAMETERS:
>>>>>>>>      - kmer size =    20            (-k)
>>>>>>>>      - n tables =     4             (-N)
>>>>>>>>      - min tablesize = 1.6e+10      (-x)
>>>>>>>>
>>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>>>> --------
>>>>>>>> Saving k-mer counting table to test.ct
>>>>>>>> Loading kmers from sequences in ['test.fastq.gz']
>>>>>>>> making k-mer counting table
>>>>>>>> consuming input test.fastq.gz
>>>>>>>> Total number of unique k-mers: 3102943887
>>>>>>>> saving test.ct
>>>>>>>> fp rate estimated to be 0.008
>>>>>>>> DONE.
>>>>>>>> wrote to: test.ct.info
>>>>>>>>
>>>>>>>> Here the report file of the normalize-by-median.py, test_k20_C5.report
>>>>>>>>
>>>>>>>> || This is the script 'normalize-by-median.py' in khmer.
>>>>>>>> || You are running khmer version 1.3
>>>>>>>> || You are also using screed version 0.8
>>>>>>>> ||
>>>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>>>> following:
>>>>>>>> ||
>>>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>>> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
>>>>>>>> ||
>>>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>>> details.
>>>>>>>>
>>>>>>>>
>>>>>>>> PARAMETERS:
>>>>>>>>      - kmer size =    20            (-k)
>>>>>>>>      - n tables =     4             (-N)
>>>>>>>>      - min tablesize = 1.6e+10      (-x)
>>>>>>>>
>>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>>>> --------
>>>>>>>> ... kept 58012 of 200000 or 29%
>>>>>>>> ... in file test.fastq.gz
>>>>>>>> ... kept 116210 of 400000 or 29%
>>>>>>>> ... in file test.fastq.gz
>>>>>>>>
>>>>>>>> ..... etc etc etc .....
>>>>>>>>
>>>>>>>> ... kept 90482098 of 346200000 or 26%
>>>>>>>> ... in file test.fastq.gz
>>>>>>>> ... kept 90529526 of 346400000 or 26%
>>>>>>>> ... in file test.fastq.gz
>>>>>>>> Total number of unique k-mers: 850221
>>>>>>>> loading k-mer counting table from test.ct
>>>>>>>> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
>>>>>>>> output in test_k20_C5.fastq.gz.keep
>>>>>>>> fp rate estimated to be 0.008
>>>>>>>>
>>>>>>>> And here the second load-into-counting.py report, test2.ct.report
>>>>>>>>
>>>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>>>> || You are running khmer version 1.3
>>>>>>>> || You are also using screed version 0.8
>>>>>>>> ||
>>>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>>>> following:
>>>>>>>> ||
>>>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>>>> ||
>>>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>>> details.
>>>>>>>>
>>>>>>>>
>>>>>>>> PARAMETERS:
>>>>>>>>      - kmer size =    20            (-k)
>>>>>>>>      - n tables =     4             (-N)
>>>>>>>>      - min tablesize = 1.6e+10      (-x)
>>>>>>>>
>>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>>>> --------
>>>>>>>> Saving k-mer counting table to test2.ct
>>>>>>>> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
>>>>>>>> making k-mer counting table
>>>>>>>> consuming input test_k20_C5.fastq.gz.keep
>>>>>>>> Total number of unique k-mers: 2822473008
>>>>>>>> saving test2.ct
>>>>>>>>
>>>>>>>> Hope this helps!
>>>>>>>>
>>>>>>>> Joran
>>>>>>>>
>>>>>>>> On 11/05/15 12:12, C. Titus Brown wrote:
>>>>>>>>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>>>>>>>>> Dear Khmer mailing list,
>>>>>>>>>>
>>>>>>>>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
>>>>>>>>>> the raw dataset and diginormed dataset, similar as was done in the
>>>>>>>>>> original diginorm paper.
>>>>>>>>> [ elided ]
>>>>>>>>>
>>>>>>>>> Hi Joran,
>>>>>>>>>
>>>>>>>>> that certainly doesn't sound good :). Would it be possible to convey the
>>>>>>>>> various report files to us, publicly or privately?
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>> --titus
>>>>>>>>>
>>>>>>>>> p.s. Thank you for the very detailed question!
>> #!/bin/bash
>>
>> # rerun kalamazoo pipeline with khmer 1.4.
>>
>> # step 1. Digital normalization (k=20, C=20)
>> echo "step 1 normalizing ..."
>> normalize-by-median.py \
>>      -p -t -q \
>>      -k 20 -C 20 \
>>      -N 4 -x 32e9 \
>>      -s step1.ct \
>>      -R step1.report \
>>      -u unpaired.fq \
>>      paired.fq
>> echo "step 1 normalizing done"
>>
>> # count k-mers
>> echo "counting k-mers after step 1..."
>> unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.kcount      paired.fq        # prior step 1: 5,684,893,399
>> unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.keep.kcount paired.fq.keep   # post  step 1: 5,539,021,593
>>
>> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.kcount      unpaired.fq       # prior step 1: 869,664,270
>> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.kcount unpaired.fq.keep  # post  step 1: 721,158,291
>> echo "counting k-mers after step 1 done"
>>
>> # step 2. Trim reads from high coverage regions with k-mer abundance 1
>> echo "doing filter-abund.py ..."
>> filter-abund.py -T 30 -V raw.ct *.keep
>> echo "filter-abund.py done"
>> echo "doing extract-paired-reads.py..."
>> extract-paired-reads.py paired.fq.keep.abundfilt
>> echo "extract-paired-reads.py done"
>>
>> # count k-mers
>> echo "counting k-mers after step 2..."
>> unique-kmers.py -q -k 20 -e 0.005 -R   paired.fq.keep.abundfilt.pe.kcount   paired.fq.keep.abundfilt.pe   # post step 2: 5,392,852,944
>> unique-kmers.py -q -k 20 -e 0.005 -R   paired.fq.keep.abundfilt.se.kcount   paired.fq.keep.abundfilt.se   # post step 2:   123,145,801
>> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.abundfilt.kcount    unpaired.fq.keep.abundfilt      # post step 2:   715,909,495
>> echo "counting k-mers after step 2 done"
>>
>> # step 3. Digital normalization (k=20, C=5)
>> cat paired.fq.keep.abundfilt.se unpaired.fq.keep.abundfilt > unpairedAll.fq.keep.abundfilt.se
>> unique-kmers.py -q -k 20 -e 0.005 -R unpairedAll.fq.keep.abundfilt.se.kcount unpairedAll.fq.keep.abundfilt.se    # 807,008,522
>>
>> echo "step 3 normalizing ..."
>> normalize-by-median.py \
>>      -p -t -q \
>>      -k 20 -C 5 \
>>      -N 4 -x 32e9 \
>>      -s step3.ct \
>>      -R step3.report \
>>      -u unpairedAll.fq.keep.abundfilt.se \
>>      paired.fq.keep.abundfilt.pe
>> echo "step 3 normalizing done"
>>
>> # count k-mers
>> echo "counting k-mers after step 3..."
>> unique-kmers.py -q -k 20 -e 0.005 -R      paired.fq.keep.abundfilt.pe.keep.kcount      paired.fq.keep.abundfilt.pe.keep # post step 3: 5,334,399,264
>> unique-kmers.py -q -k 20 -e 0.005 -R unpairedAll.fq.keep.abundfilt.se.keep.kcount unpairedAll.fq.keep.abundfilt.se.keep # post step 3:   549,553,382
>> echo "counting k-mers after step 3 done"
>