[khmer] inconsistent unique k-mer counting

Mon May 18 00:56:06 PDT 2015

So, I did the three-pass protocol (same parameters as kalamazoo, with -N 
4 -x 32e9) on my metagenome (trimmomatic qc processed) and counted the 
k-mers at each step.
I've included the bash script for running the protocol in the attachment.

Quick summary:
I started out with approximately 5,7 billion 20-mers in 346 million 
reads in the paired dataset and 869 million 20-mers in 23 million reads 
for the unpaired dataset.
After the 3rd pass I have 5,3 billion 20-mers left in 163 million reads 
for paired data and 549 million 20-mers left in 9.6 million reads for 
unpaired data.

To me, it seems that, although many reads were discarded, the 
normalization did not have such an amazing effect on the 20-mer count as 
I expected it would have, based on Table 1 of the diginorm paper.
To be fair, there no metagenomes were tested, so I'm not sure whether 
this behaviour is "normal" or not.

What are your thoughts on this?

Cheers,

Joran

On 14/05/15 13:24, C. Titus Brown wrote:
> Hi Joran,
>
> absolutely.  Let us know how it goes or you!
>
> cheers,
> --titus
>
> On Thu, May 14, 2015 at 01:14:34PM +0200, Joran Martijn wrote:
>> I just saw the release of Khmer 1.4, and it includes in the sandbox the
>> "unique-kmer.py" script.
>> Do you think I can use this script for my purpose (comparing unique
>> number of k-mers for a certain k before and after different steps of the
>> 3-pass normalization?).
>>
>> Cheers,
>>
>> Joran
>>
>> On 12/05/15 12:46, C. Titus Brown wrote:
>>> Kalamazoo uses the three-pass :).  We have pretty good evidence that
>>> it works ok for metagenomes - it's not what we used in Howe et al.,
>>> for two reasons (we didn't have the variable-coverage error trimming yet,
>>> and the data set was very low coverage) but we've been using it since.
>>>
>>> best,
>>> --titus
>>>
>>> On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:
>>>> Thanks Titus, let me know when you figure something out!
>>>>
>>>> I was playing around with several different Coverage thresholds.
>>>> I won't use the 3-pass as I understood this does not work well for
>>>> metagenomes.
>>>> I was thinking of following the kalamazoo pipeline.
>>>>
>>>> Joran
>>>>
>>>> On 11/05/15 13:13, C. Titus Brown wrote:
>>>>> OK, that's very weird - this must be a bug, but I'll be darned if I can
>>>>> figure out what might be causing it.  The numbers in load-into-counting
>>>>> should be correct but I'll have to independently confirm that.
>>>>>
>>>>> BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
>>>>> in the dn paper.
>>>>>
>>>>> best,
>>>>> --titus
>>>>>
>>>>> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
>>>>>> Hej Titus,
>>>>>>
>>>>>> Thanks for the quick reply! Here are the report files, which are
>>>>>> basically the STDERR and STDOUT output of the scripts.
>>>>>> Quick note before the reports, I made a small mistake in my
>>>>>> openingspost. The Coverage threshold I tried for these reports was 5,
>>>>>> not 20.
>>>>>>
>>>>>> Here the report file of the first load-into-counting.py execution (on
>>>>>> the raw sequence data), test.ct.report:
>>>>>>
>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>> || You are running khmer version 1.3
>>>>>> || You are also using screed version 0.8
>>>>>> ||
>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>> following:
>>>>>> ||
>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>> ||
>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>> details.
>>>>>>
>>>>>>
>>>>>> PARAMETERS:
>>>>>>     - kmer size =    20            (-k)
>>>>>>     - n tables =     4             (-N)
>>>>>>     - min tablesize = 1.6e+10      (-x)
>>>>>>
>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>> --------
>>>>>> Saving k-mer counting table to test.ct
>>>>>> Loading kmers from sequences in ['test.fastq.gz']
>>>>>> making k-mer counting table
>>>>>> consuming input test.fastq.gz
>>>>>> Total number of unique k-mers: 3102943887
>>>>>> saving test.ct
>>>>>> fp rate estimated to be 0.008
>>>>>> DONE.
>>>>>> wrote to: test.ct.info
>>>>>>
>>>>>> Here the report file of the normalize-by-median.py, test_k20_C5.report
>>>>>>
>>>>>> || This is the script 'normalize-by-median.py' in khmer.
>>>>>> || You are running khmer version 1.3
>>>>>> || You are also using screed version 0.8
>>>>>> ||
>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>> following:
>>>>>> ||
>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
>>>>>> ||
>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>> details.
>>>>>>
>>>>>>
>>>>>> PARAMETERS:
>>>>>>     - kmer size =    20            (-k)
>>>>>>     - n tables =     4             (-N)
>>>>>>     - min tablesize = 1.6e+10      (-x)
>>>>>>
>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>> --------
>>>>>> ... kept 58012 of 200000 or 29%
>>>>>> ... in file test.fastq.gz
>>>>>> ... kept 116210 of 400000 or 29%
>>>>>> ... in file test.fastq.gz
>>>>>>
>>>>>> ..... etc etc etc .....
>>>>>>
>>>>>> ... kept 90482098 of 346200000 or 26%
>>>>>> ... in file test.fastq.gz
>>>>>> ... kept 90529526 of 346400000 or 26%
>>>>>> ... in file test.fastq.gz
>>>>>> Total number of unique k-mers: 850221
>>>>>> loading k-mer counting table from test.ct
>>>>>> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
>>>>>> output in test_k20_C5.fastq.gz.keep
>>>>>> fp rate estimated to be 0.008
>>>>>>
>>>>>> And here the second load-into-counting.py report, test2.ct.report
>>>>>>
>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>> || You are running khmer version 1.3
>>>>>> || You are also using screed version 0.8
>>>>>> ||
>>>>>> || If you use this script in a publication, please cite EACH of the
>>>>>> following:
>>>>>> ||
>>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>> ||
>>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>> details.
>>>>>>
>>>>>>
>>>>>> PARAMETERS:
>>>>>>     - kmer size =    20            (-k)
>>>>>>     - n tables =     4             (-N)
>>>>>>     - min tablesize = 1.6e+10      (-x)
>>>>>>
>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>>> --------
>>>>>> Saving k-mer counting table to test2.ct
>>>>>> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
>>>>>> making k-mer counting table
>>>>>> consuming input test_k20_C5.fastq.gz.keep
>>>>>> Total number of unique k-mers: 2822473008
>>>>>> saving test2.ct
>>>>>>
>>>>>> Hope this helps!
>>>>>>
>>>>>> Joran
>>>>>>
>>>>>> On 11/05/15 12:12, C. Titus Brown wrote:
>>>>>>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>>>>>>> Dear Khmer mailing list,
>>>>>>>>
>>>>>>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
>>>>>>>> the raw dataset and diginormed dataset, similar as was done in the
>>>>>>>> original diginorm paper.
>>>>>>> [ elided ]
>>>>>>>
>>>>>>> Hi Joran,
>>>>>>>
>>>>>>> that certainly doesn't sound good :). Would it be possible to convey the
>>>>>>> various report files to us, publicly or privately?
>>>>>>>
>>>>>>> thanks,
>>>>>>> --titus
>>>>>>>
>>>>>>> p.s. Thank you for the very detailed question!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20150518/215b2413/attachment.html>
-------------- next part --------------
#!/bin/bash

# rerun kalamazoo pipeline with khmer 1.4.

# step 1. Digital normalization (k=20, C=20)
echo "step 1 normalizing ..."
normalize-by-median.py \
    -p -t -q \
    -k 20 -C 20 \
    -N 4 -x 32e9 \
    -s step1.ct \
    -R step1.report \
    -u unpaired.fq \
    paired.fq 
echo "step 1 normalizing done"

# count k-mers
echo "counting k-mers after step 1..."
unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.kcount      paired.fq        # prior step 1: 5,684,893,399
unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.keep.kcount paired.fq.keep   # post  step 1: 5,539,021,593

unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.kcount      unpaired.fq       # prior step 1: 869,664,270
unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.kcount unpaired.fq.keep  # post  step 1: 721,158,291
echo "counting k-mers after step 1 done"

# step 2. Trim reads from high coverage regions with k-mer abundance 1
echo "doing filter-abund.py ..."
filter-abund.py -T 30 -V raw.ct *.keep
echo "filter-abund.py done"
echo "doing extract-paired-reads.py..."
extract-paired-reads.py paired.fq.keep.abundfilt
echo "extract-paired-reads.py done"

# count k-mers
echo "counting k-mers after step 2..."
unique-kmers.py -q -k 20 -e 0.005 -R   paired.fq.keep.abundfilt.pe.kcount   paired.fq.keep.abundfilt.pe   # post step 2: 5,392,852,944
unique-kmers.py -q -k 20 -e 0.005 -R   paired.fq.keep.abundfilt.se.kcount   paired.fq.keep.abundfilt.se   # post step 2:   123,145,801
unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.abundfilt.kcount    unpaired.fq.keep.abundfilt      # post step 2:   715,909,495
echo "counting k-mers after step 2 done"

# step 3. Digital normalization (k=20, C=5)
cat paired.fq.keep.abundfilt.se unpaired.fq.keep.abundfilt > unpairedAll.fq.keep.abundfilt.se
unique-kmers.py -q -k 20 -e 0.005 -R unpairedAll.fq.keep.abundfilt.se.kcount unpairedAll.fq.keep.abundfilt.se    # 807,008,522

echo "step 3 normalizing ..."
normalize-by-median.py \
    -p -t -q \
    -k 20 -C 5 \
    -N 4 -x 32e9 \
    -s step3.ct \
    -R step3.report \
    -u unpairedAll.fq.keep.abundfilt.se \
    paired.fq.keep.abundfilt.pe
echo "step 3 normalizing done"

# count k-mers
echo "counting k-mers after step 3..."
unique-kmers.py -q -k 20 -e 0.005 -R      paired.fq.keep.abundfilt.pe.keep.kcount      paired.fq.keep.abundfilt.pe.keep # post step 3: 5,334,399,264
unique-kmers.py -q -k 20 -e 0.005 -R unpairedAll.fq.keep.abundfilt.se.keep.kcount unpairedAll.fq.keep.abundfilt.se.keep # post step 3:   549,553,382 
echo "counting k-mers after step 3 done"