[khmer] inconsistent unique k-mer counting

Mon May 18 03:20:22 PDT 2015

Hi Joran,

The script looks good to me and the results seem reasonable.  Sounds to me like
you have a very low error error rate, and a very diverse metagenome with a few
high-abundance genomes.

In the Howe et al (2014) paper
(http://www.pnas.org/content/111/13/4904.abstract) we see similar retention
of k-mers.  I don't offhand remember the number of k-mers we saw getting
removed, but we had much older and higher-error data.

And thank you for some great error reports :)

If I may ask - what did you sequence?

cheers,
--titus

On Mon, May 18, 2015 at 09:56:06AM +0200, Joran Martijn wrote:
> So, I did the three-pass protocol (same parameters as kalamazoo, with -N  
> 4 -x 32e9) on my metagenome (trimmomatic qc processed) and counted the  
> k-mers at each step.
> I've included the bash script for running the protocol in the attachment.
>
> Quick summary:
> I started out with approximately 5,7 billion 20-mers in 346 million  
> reads in the paired dataset and 869 million 20-mers in 23 million reads  
> for the unpaired dataset.
> After the 3rd pass I have 5,3 billion 20-mers left in 163 million reads  
> for paired data and 549 million 20-mers left in 9.6 million reads for  
> unpaired data.
>
> To me, it seems that, although many reads were discarded, the  
> normalization did not have such an amazing effect on the 20-mer count as  
> I expected it would have, based on Table 1 of the diginorm paper.
> To be fair, there no metagenomes were tested, so I'm not sure whether  
> this behaviour is "normal" or not.
>
> What are your thoughts on this?
>
> Cheers,
>
> Joran
>
> On 14/05/15 13:24, C. Titus Brown wrote:
>> Hi Joran,
>>
>> absolutely.  Let us know how it goes or you!
>>
>> cheers,
>> --titus
>>
>> On Thu, May 14, 2015 at 01:14:34PM +0200, Joran Martijn wrote:
>>> I just saw the release of Khmer 1.4, and it includes in the sandbox the
>>> "unique-kmer.py" script.
>>> Do you think I can use this script for my purpose (comparing unique
>>> number of k-mers for a certain k before and after different steps of the
>>> 3-pass normalization?).
>>>
>>> Cheers,
>>>
>>> Joran
>>>
>>> On 12/05/15 12:46, C. Titus Brown wrote:
>>>> Kalamazoo uses the three-pass :).  We have pretty good evidence that
>>>> it works ok for metagenomes - it's not what we used in Howe et al.,
>>>> for two reasons (we didn't have the variable-coverage error trimming yet,
>>>> and the data set was very low coverage) but we've been using it since.
>>>>
>>>> best,
>>>> --titus
>>>>
>>>> On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:
>>>>> Thanks Titus, let me know when you figure something out!
>>>>>
>>>>> I was playing around with several different Coverage thresholds.
>>>>> I won't use the 3-pass as I understood this does not work well for
>>>>> metagenomes.
>>>>> I was thinking of following the kalamazoo pipeline.
>>>>>
>>>>> Joran
>>>>>
>>>>> On 11/05/15 13:13, C. Titus Brown wrote:
>>>>>> OK, that's very weird - this must be a bug, but I'll be darned if I can
>>>>>> figure out what might be causing it.  The numbers in load-into-counting
>>>>>> should be correct but I'll have to independently confirm that.
>>>>>>
>>>>>> BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
>>>>>> in the dn paper.
>>>>>>
>>>>>> best,
>>>>>> --titus
>>>>>>
>>>>>> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
>>>>>>> Hej Titus,
>>>>>>>
>>>>>>> Thanks for the quick reply! Here are the report files, which are
>>>>>>> basically the STDERR and STDOUT output of the scripts.
>>>>>>> Quick note before the reports, I made a small mistake in my
>>>>>>> openingspost. The Coverage threshold I tried for these reports was 5,
>>>>>>> not 20.
>>>>>>>
>>>>>>> Here the report file of the first load-into-counting.py execution (on
>>>>>>> the raw sequence data), test.ct.report:
>>>>>>>
>>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>>> || You are running khmer version 1.3
>>>>>>> || You are also using screed version 0.8
>>>>>>> ||
>>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>>> following:
>>>>>>> ||
>>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>>> ||
>>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>> details.
>>>>>>>
>>>>>>>
>>>>>>> PARAMETERS:
>>>>>>>     - kmer size =    20            (-k)
>>>>>>>     - n tables =     4             (-N)
>>>>>>>     - min tablesize = 1.6e+10      (-x)
>>>>>>>
>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>>> --------
>>>>>>> Saving k-mer counting table to test.ct
>>>>>>> Loading kmers from sequences in ['test.fastq.gz']
>>>>>>> making k-mer counting table
>>>>>>> consuming input test.fastq.gz
>>>>>>> Total number of unique k-mers: 3102943887
>>>>>>> saving test.ct
>>>>>>> fp rate estimated to be 0.008
>>>>>>> DONE.
>>>>>>> wrote to: test.ct.info
>>>>>>>
>>>>>>> Here the report file of the normalize-by-median.py, test_k20_C5.report
>>>>>>>
>>>>>>> || This is the script 'normalize-by-median.py' in khmer.
>>>>>>> || You are running khmer version 1.3
>>>>>>> || You are also using screed version 0.8
>>>>>>> ||
>>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>>> following:
>>>>>>> ||
>>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
>>>>>>> ||
>>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>> details.
>>>>>>>
>>>>>>>
>>>>>>> PARAMETERS:
>>>>>>>     - kmer size =    20            (-k)
>>>>>>>     - n tables =     4             (-N)
>>>>>>>     - min tablesize = 1.6e+10      (-x)
>>>>>>>
>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>>> --------
>>>>>>> ... kept 58012 of 200000 or 29%
>>>>>>> ... in file test.fastq.gz
>>>>>>> ... kept 116210 of 400000 or 29%
>>>>>>> ... in file test.fastq.gz
>>>>>>>
>>>>>>> ..... etc etc etc .....
>>>>>>>
>>>>>>> ... kept 90482098 of 346200000 or 26%
>>>>>>> ... in file test.fastq.gz
>>>>>>> ... kept 90529526 of 346400000 or 26%
>>>>>>> ... in file test.fastq.gz
>>>>>>> Total number of unique k-mers: 850221
>>>>>>> loading k-mer counting table from test.ct
>>>>>>> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
>>>>>>> output in test_k20_C5.fastq.gz.keep
>>>>>>> fp rate estimated to be 0.008
>>>>>>>
>>>>>>> And here the second load-into-counting.py report, test2.ct.report
>>>>>>>
>>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>>> || You are running khmer version 1.3
>>>>>>> || You are also using screed version 0.8
>>>>>>> ||
>>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>>> following:
>>>>>>> ||
>>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>>> ||
>>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>> details.
>>>>>>>
>>>>>>>
>>>>>>> PARAMETERS:
>>>>>>>     - kmer size =    20            (-k)
>>>>>>>     - n tables =     4             (-N)
>>>>>>>     - min tablesize = 1.6e+10      (-x)
>>>>>>>
>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>>> --------
>>>>>>> Saving k-mer counting table to test2.ct
>>>>>>> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
>>>>>>> making k-mer counting table
>>>>>>> consuming input test_k20_C5.fastq.gz.keep
>>>>>>> Total number of unique k-mers: 2822473008
>>>>>>> saving test2.ct
>>>>>>>
>>>>>>> Hope this helps!
>>>>>>>
>>>>>>> Joran
>>>>>>>
>>>>>>> On 11/05/15 12:12, C. Titus Brown wrote:
>>>>>>>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>>>>>>>> Dear Khmer mailing list,
>>>>>>>>>
>>>>>>>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
>>>>>>>>> the raw dataset and diginormed dataset, similar as was done in the
>>>>>>>>> original diginorm paper.
>>>>>>>> [ elided ]
>>>>>>>>
>>>>>>>> Hi Joran,
>>>>>>>>
>>>>>>>> that certainly doesn't sound good :). Would it be possible to convey the
>>>>>>>> various report files to us, publicly or privately?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> --titus
>>>>>>>>
>>>>>>>> p.s. Thank you for the very detailed question!
>

> #!/bin/bash
> 
> # rerun kalamazoo pipeline with khmer 1.4.
> 
> # step 1. Digital normalization (k=20, C=20)
> echo "step 1 normalizing ..."
> normalize-by-median.py \
>     -p -t -q \
>     -k 20 -C 20 \
>     -N 4 -x 32e9 \
>     -s step1.ct \
>     -R step1.report \
>     -u unpaired.fq \
>     paired.fq 
> echo "step 1 normalizing done"
> 
> # count k-mers
> echo "counting k-mers after step 1..."
> unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.kcount      paired.fq        # prior step 1: 5,684,893,399
> unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.keep.kcount paired.fq.keep   # post  step 1: 5,539,021,593
> 
> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.kcount      unpaired.fq       # prior step 1: 869,664,270
> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.kcount unpaired.fq.keep  # post  step 1: 721,158,291
> echo "counting k-mers after step 1 done"
> 
> # step 2. Trim reads from high coverage regions with k-mer abundance 1
> echo "doing filter-abund.py ..."
> filter-abund.py -T 30 -V raw.ct *.keep
> echo "filter-abund.py done"
> echo "doing extract-paired-reads.py..."
> extract-paired-reads.py paired.fq.keep.abundfilt
> echo "extract-paired-reads.py done"
> 
> # count k-mers
> echo "counting k-mers after step 2..."
> unique-kmers.py -q -k 20 -e 0.005 -R   paired.fq.keep.abundfilt.pe.kcount   paired.fq.keep.abundfilt.pe   # post step 2: 5,392,852,944
> unique-kmers.py -q -k 20 -e 0.005 -R   paired.fq.keep.abundfilt.se.kcount   paired.fq.keep.abundfilt.se   # post step 2:   123,145,801
> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.abundfilt.kcount    unpaired.fq.keep.abundfilt      # post step 2:   715,909,495
> echo "counting k-mers after step 2 done"
> 
> # step 3. Digital normalization (k=20, C=5)
> cat paired.fq.keep.abundfilt.se unpaired.fq.keep.abundfilt > unpairedAll.fq.keep.abundfilt.se
> unique-kmers.py -q -k 20 -e 0.005 -R unpairedAll.fq.keep.abundfilt.se.kcount unpairedAll.fq.keep.abundfilt.se    # 807,008,522
> 
> echo "step 3 normalizing ..."
> normalize-by-median.py \
>     -p -t -q \
>     -k 20 -C 5 \
>     -N 4 -x 32e9 \
>     -s step3.ct \
>     -R step3.report \
>     -u unpairedAll.fq.keep.abundfilt.se \
>     paired.fq.keep.abundfilt.pe
> echo "step 3 normalizing done"
> 
> # count k-mers
> echo "counting k-mers after step 3..."
> unique-kmers.py -q -k 20 -e 0.005 -R      paired.fq.keep.abundfilt.pe.keep.kcount      paired.fq.keep.abundfilt.pe.keep # post step 3: 5,334,399,264
> unique-kmers.py -q -k 20 -e 0.005 -R unpairedAll.fq.keep.abundfilt.se.keep.kcount unpairedAll.fq.keep.abundfilt.se.keep # post step 3:   549,553,382 
> echo "counting k-mers after step 3 done"

-- 
C. Titus Brown, ctbrown at ucdavis.edu