[khmer] inconsistent unique k-mer counting

Thu May 21 06:51:31 PDT 2015

A quick update on the 3-pass diginorm I applied (see a few previous 
emails ago).

I have run Ray Meta with a couple of different k-mers on all datasets 
after each step of the 3-pass.
The results seem to be the case that it unfortunately did not do the 
assembly any good. Especially after the third pass the assemblies are 
really bad.

Although now I'm thinking that maybe I should have not run with Meta 
flag on now that the data is normalized.

See the attachment for the results. Rendering of the third and fourth 
and fifth column titles seems to have gone wrong, but they are largest 
contig (Kbp), number of contigs (x1000) and total assembly size (Mbp).

What do you think?

Joran

On 18/05/15 16:03, Joran Martijn wrote:
> Hej Titus,
>
> Ok that's a relief :) I was afraid some kind of mistake or bug had 
> happened.
> I'm not sure whether the error rate was high or low, after trimmomatic 
> the data looked relatively clean on FastQC reports.
> The data is derived from a 2x100 bp pe HiSeq run, so its not that new.
>
> I don't know into how much detail I can go about the samples, but lets 
> say they are aquatic :)
>
> Btw, as you can see in the script I tried to use the new feature "-R" 
> in the normalize-by-median script but nothing was written to the 
> .report files.
> Perhaps a bug?
>
> Cheers,
>
> Joran
>
> On 18/05/15 12:20, C. Titus Brown wrote:
>> Hi Joran,
>>
>> The script looks good to me and the results seem reasonable. Sounds 
>> to me like
>> you have a very low error error rate, and a very diverse metagenome 
>> with a few
>> high-abundance genomes.
>>
>> In the Howe et al (2014) paper
>> (http://www.pnas.org/content/111/13/4904.abstract) we see similar 
>> retention
>> of k-mers.  I don't offhand remember the number of k-mers we saw getting
>> removed, but we had much older and higher-error data.
>>
>> And thank you for some great error reports :)
>>
>> If I may ask - what did you sequence?
>>
>> cheers,
>> --titus
>>
>> On Mon, May 18, 2015 at 09:56:06AM +0200, Joran Martijn wrote:
>>> So, I did the three-pass protocol (same parameters as kalamazoo, 
>>> with -N
>>> 4 -x 32e9) on my metagenome (trimmomatic qc processed) and counted the
>>> k-mers at each step.
>>> I've included the bash script for running the protocol in the 
>>> attachment.
>>>
>>> Quick summary:
>>> I started out with approximately 5,7 billion 20-mers in 346 million
>>> reads in the paired dataset and 869 million 20-mers in 23 million reads
>>> for the unpaired dataset.
>>> After the 3rd pass I have 5,3 billion 20-mers left in 163 million reads
>>> for paired data and 549 million 20-mers left in 9.6 million reads for
>>> unpaired data.
>>>
>>> To me, it seems that, although many reads were discarded, the
>>> normalization did not have such an amazing effect on the 20-mer 
>>> count as
>>> I expected it would have, based on Table 1 of the diginorm paper.
>>> To be fair, there no metagenomes were tested, so I'm not sure whether
>>> this behaviour is "normal" or not.
>>>
>>> What are your thoughts on this?
>>>
>>> Cheers,
>>>
>>> Joran
>>>
>>> On 14/05/15 13:24, C. Titus Brown wrote:
>>>> Hi Joran,
>>>>
>>>> absolutely.  Let us know how it goes or you!
>>>>
>>>> cheers,
>>>> --titus
>>>>
>>>> On Thu, May 14, 2015 at 01:14:34PM +0200, Joran Martijn wrote:
>>>>> I just saw the release of Khmer 1.4, and it includes in the 
>>>>> sandbox the
>>>>> "unique-kmer.py" script.
>>>>> Do you think I can use this script for my purpose (comparing unique
>>>>> number of k-mers for a certain k before and after different steps 
>>>>> of the
>>>>> 3-pass normalization?).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Joran
>>>>>
>>>>> On 12/05/15 12:46, C. Titus Brown wrote:
>>>>>> Kalamazoo uses the three-pass :). We have pretty good evidence that
>>>>>> it works ok for metagenomes - it's not what we used in Howe et al.,
>>>>>> for two reasons (we didn't have the variable-coverage error 
>>>>>> trimming yet,
>>>>>> and the data set was very low coverage) but we've been using it 
>>>>>> since.
>>>>>>
>>>>>> best,
>>>>>> --titus
>>>>>>
>>>>>> On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:
>>>>>>> Thanks Titus, let me know when you figure something out!
>>>>>>>
>>>>>>> I was playing around with several different Coverage thresholds.
>>>>>>> I won't use the 3-pass as I understood this does not work well for
>>>>>>> metagenomes.
>>>>>>> I was thinking of following the kalamazoo pipeline.
>>>>>>>
>>>>>>> Joran
>>>>>>>
>>>>>>> On 11/05/15 13:13, C. Titus Brown wrote:
>>>>>>>> OK, that's very weird - this must be a bug, but I'll be darned 
>>>>>>>> if I can
>>>>>>>> figure out what might be causing it.  The numbers in 
>>>>>>>> load-into-counting
>>>>>>>> should be correct but I'll have to independently confirm that.
>>>>>>>>
>>>>>>>> BTW, for the first round of diginorm I'd use C=20; see 3-pass 
>>>>>>>> diginorm
>>>>>>>> in the dn paper.
>>>>>>>>
>>>>>>>> best,
>>>>>>>> --titus
>>>>>>>>
>>>>>>>> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
>>>>>>>>> Hej Titus,
>>>>>>>>>
>>>>>>>>> Thanks for the quick reply! Here are the report files, which are
>>>>>>>>> basically the STDERR and STDOUT output of the scripts.
>>>>>>>>> Quick note before the reports, I made a small mistake in my
>>>>>>>>> openingspost. The Coverage threshold I tried for these reports 
>>>>>>>>> was 5,
>>>>>>>>> not 20.
>>>>>>>>>
>>>>>>>>> Here the report file of the first load-into-counting.py 
>>>>>>>>> execution (on
>>>>>>>>> the raw sequence data), test.ct.report:
>>>>>>>>>
>>>>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>>>>> || You are running khmer version 1.3
>>>>>>>>> || You are also using screed version 0.8
>>>>>>>>> ||
>>>>>>>>> || If you use this script in a publication, please cite EACH 
>>>>>>>>> of the
>>>>>>>>> following:
>>>>>>>>> ||
>>>>>>>>> ||   * MR Crusoe et al., 2014. 
>>>>>>>>> http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>>>> ||   * Q Zhang et al., 
>>>>>>>>> http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>>>>> ||   * A. D303266ring et al. 
>>>>>>>>> http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>>>>> ||
>>>>>>>>> || Please see 
>>>>>>>>> http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>>>> details.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PARAMETERS:
>>>>>>>>>      - kmer size =    20            (-k)
>>>>>>>>>      - n tables =     4             (-N)
>>>>>>>>>      - min tablesize = 1.6e+10      (-x)
>>>>>>>>>
>>>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x 
>>>>>>>>> min_tablesize)
>>>>>>>>> --------
>>>>>>>>> Saving k-mer counting table to test.ct
>>>>>>>>> Loading kmers from sequences in ['test.fastq.gz']
>>>>>>>>> making k-mer counting table
>>>>>>>>> consuming input test.fastq.gz
>>>>>>>>> Total number of unique k-mers: 3102943887
>>>>>>>>> saving test.ct
>>>>>>>>> fp rate estimated to be 0.008
>>>>>>>>> DONE.
>>>>>>>>> wrote to: test.ct.info
>>>>>>>>>
>>>>>>>>> Here the report file of the normalize-by-median.py, 
>>>>>>>>> test_k20_C5.report
>>>>>>>>>
>>>>>>>>> || This is the script 'normalize-by-median.py' in khmer.
>>>>>>>>> || You are running khmer version 1.3
>>>>>>>>> || You are also using screed version 0.8
>>>>>>>>> ||
>>>>>>>>> || If you use this script in a publication, please cite EACH 
>>>>>>>>> of the
>>>>>>>>> following:
>>>>>>>>> ||
>>>>>>>>> ||   * MR Crusoe et al., 2014. 
>>>>>>>>> http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>>>> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
>>>>>>>>> ||
>>>>>>>>> || Please see 
>>>>>>>>> http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>>>> details.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PARAMETERS:
>>>>>>>>>      - kmer size =    20            (-k)
>>>>>>>>>      - n tables =     4             (-N)
>>>>>>>>>      - min tablesize = 1.6e+10      (-x)
>>>>>>>>>
>>>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x 
>>>>>>>>> min_tablesize)
>>>>>>>>> --------
>>>>>>>>> ... kept 58012 of 200000 or 29%
>>>>>>>>> ... in file test.fastq.gz
>>>>>>>>> ... kept 116210 of 400000 or 29%
>>>>>>>>> ... in file test.fastq.gz
>>>>>>>>>
>>>>>>>>> ..... etc etc etc .....
>>>>>>>>>
>>>>>>>>> ... kept 90482098 of 346200000 or 26%
>>>>>>>>> ... in file test.fastq.gz
>>>>>>>>> ... kept 90529526 of 346400000 or 26%
>>>>>>>>> ... in file test.fastq.gz
>>>>>>>>> Total number of unique k-mers: 850221
>>>>>>>>> loading k-mer counting table from test.ct
>>>>>>>>> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
>>>>>>>>> output in test_k20_C5.fastq.gz.keep
>>>>>>>>> fp rate estimated to be 0.008
>>>>>>>>>
>>>>>>>>> And here the second load-into-counting.py report, test2.ct.report
>>>>>>>>>
>>>>>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>>>>>> || You are running khmer version 1.3
>>>>>>>>> || You are also using screed version 0.8
>>>>>>>>> ||
>>>>>>>>> || If you use this script in a publication, please cite EACH 
>>>>>>>>> of the
>>>>>>>>> following:
>>>>>>>>> ||
>>>>>>>>> ||   * MR Crusoe et al., 2014. 
>>>>>>>>> http://dx.doi.org/10.6084/m9.figshare.979190
>>>>>>>>> ||   * Q Zhang et al., 
>>>>>>>>> http://dx.doi.org/10.1371/journal.pone.0101271
>>>>>>>>> ||   * A. D303266ring et al. 
>>>>>>>>> http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>>>>>> ||
>>>>>>>>> || Please see 
>>>>>>>>> http://khmer.readthedocs.org/en/latest/citations.html for
>>>>>>>>> details.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PARAMETERS:
>>>>>>>>>      - kmer size =    20            (-k)
>>>>>>>>>      - n tables =     4             (-N)
>>>>>>>>>      - min tablesize = 1.6e+10      (-x)
>>>>>>>>>
>>>>>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x 
>>>>>>>>> min_tablesize)
>>>>>>>>> --------
>>>>>>>>> Saving k-mer counting table to test2.ct
>>>>>>>>> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
>>>>>>>>> making k-mer counting table
>>>>>>>>> consuming input test_k20_C5.fastq.gz.keep
>>>>>>>>> Total number of unique k-mers: 2822473008
>>>>>>>>> saving test2.ct
>>>>>>>>>
>>>>>>>>> Hope this helps!
>>>>>>>>>
>>>>>>>>> Joran
>>>>>>>>>
>>>>>>>>> On 11/05/15 12:12, C. Titus Brown wrote:
>>>>>>>>>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>>>>>>>>>> Dear Khmer mailing list,
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to compare the number of unique k-mers (lets say 
>>>>>>>>>>> 20-mers) in
>>>>>>>>>>> the raw dataset and diginormed dataset, similar as was done 
>>>>>>>>>>> in the
>>>>>>>>>>> original diginorm paper.
>>>>>>>>>> [ elided ]
>>>>>>>>>>
>>>>>>>>>> Hi Joran,
>>>>>>>>>>
>>>>>>>>>> that certainly doesn't sound good :). Would it be possible to 
>>>>>>>>>> convey the
>>>>>>>>>> various report files to us, publicly or privately?
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>> --titus
>>>>>>>>>>
>>>>>>>>>> p.s. Thank you for the very detailed question!
>>> #!/bin/bash
>>>
>>> # rerun kalamazoo pipeline with khmer 1.4.
>>>
>>> # step 1. Digital normalization (k=20, C=20)
>>> echo "step 1 normalizing ..."
>>> normalize-by-median.py \
>>>      -p -t -q \
>>>      -k 20 -C 20 \
>>>      -N 4 -x 32e9 \
>>>      -s step1.ct \
>>>      -R step1.report \
>>>      -u unpaired.fq \
>>>      paired.fq
>>> echo "step 1 normalizing done"
>>>
>>> # count k-mers
>>> echo "counting k-mers after step 1..."
>>> unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.kcount 
>>> paired.fq        # prior step 1: 5,684,893,399
>>> unique-kmers.py -q -k 20 -e 0.005 -R paired.fq.keep.kcount 
>>> paired.fq.keep   # post  step 1: 5,539,021,593
>>>
>>> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.kcount 
>>> unpaired.fq       # prior step 1: 869,664,270
>>> unique-kmers.py -q -k 20 -e 0.005 -R unpaired.fq.keep.kcount 
>>> unpaired.fq.keep  # post  step 1: 721,158,291
>>> echo "counting k-mers after step 1 done"
>>>
>>> # step 2. Trim reads from high coverage regions with k-mer abundance 1
>>> echo "doing filter-abund.py ..."
>>> filter-abund.py -T 30 -V raw.ct *.keep
>>> echo "filter-abund.py done"
>>> echo "doing extract-paired-reads.py..."
>>> extract-paired-reads.py paired.fq.keep.abundfilt
>>> echo "extract-paired-reads.py done"
>>>
>>> # count k-mers
>>> echo "counting k-mers after step 2..."
>>> unique-kmers.py -q -k 20 -e 0.005 -R 
>>> paired.fq.keep.abundfilt.pe.kcount paired.fq.keep.abundfilt.pe   # 
>>> post step 2: 5,392,852,944
>>> unique-kmers.py -q -k 20 -e 0.005 -R 
>>> paired.fq.keep.abundfilt.se.kcount paired.fq.keep.abundfilt.se   # 
>>> post step 2:   123,145,801
>>> unique-kmers.py -q -k 20 -e 0.005 -R 
>>> unpaired.fq.keep.abundfilt.kcount unpaired.fq.keep.abundfilt      # 
>>> post step 2:   715,909,495
>>> echo "counting k-mers after step 2 done"
>>>
>>> # step 3. Digital normalization (k=20, C=5)
>>> cat paired.fq.keep.abundfilt.se unpaired.fq.keep.abundfilt > 
>>> unpairedAll.fq.keep.abundfilt.se
>>> unique-kmers.py -q -k 20 -e 0.005 -R 
>>> unpairedAll.fq.keep.abundfilt.se.kcount 
>>> unpairedAll.fq.keep.abundfilt.se    # 807,008,522
>>>
>>> echo "step 3 normalizing ..."
>>> normalize-by-median.py \
>>>      -p -t -q \
>>>      -k 20 -C 5 \
>>>      -N 4 -x 32e9 \
>>>      -s step3.ct \
>>>      -R step3.report \
>>>      -u unpairedAll.fq.keep.abundfilt.se \
>>>      paired.fq.keep.abundfilt.pe
>>> echo "step 3 normalizing done"
>>>
>>> # count k-mers
>>> echo "counting k-mers after step 3..."
>>> unique-kmers.py -q -k 20 -e 0.005 -R 
>>> paired.fq.keep.abundfilt.pe.keep.kcount 
>>> paired.fq.keep.abundfilt.pe.keep # post step 3: 5,334,399,264
>>> unique-kmers.py -q -k 20 -e 0.005 -R 
>>> unpairedAll.fq.keep.abundfilt.se.keep.kcount 
>>> unpairedAll.fq.keep.abundfilt.se.keep # post step 3: 549,553,382
>>> echo "counting k-mers after step 3 done"
>>
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer

-------------- next part --------------
A non-text attachment was scrubbed...
Name: khmerResults.pdf
Type: application/pdf
Size: 391250 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20150521/f5a6a834/attachment-0001.pdf>