[khmer] inconsistent unique k-mer counting

Tue May 12 03:46:16 PDT 2015

Kalamazoo uses the three-pass :).  We have pretty good evidence that
it works ok for metagenomes - it's not what we used in Howe et al.,
for two reasons (we didn't have the variable-coverage error trimming yet,
and the data set was very low coverage) but we've been using it since.

best,
--titus

On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:
> Thanks Titus, let me know when you figure something out!
>
> I was playing around with several different Coverage thresholds.
> I won't use the 3-pass as I understood this does not work well for  
> metagenomes.
> I was thinking of following the kalamazoo pipeline.
>
> Joran
>
> On 11/05/15 13:13, C. Titus Brown wrote:
>> OK, that's very weird - this must be a bug, but I'll be darned if I can
>> figure out what might be causing it.  The numbers in load-into-counting
>> should be correct but I'll have to independently confirm that.
>>
>> BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
>> in the dn paper.
>>
>> best,
>> --titus
>>
>> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
>>> Hej Titus,
>>>
>>> Thanks for the quick reply! Here are the report files, which are
>>> basically the STDERR and STDOUT output of the scripts.
>>> Quick note before the reports, I made a small mistake in my
>>> openingspost. The Coverage threshold I tried for these reports was 5,
>>> not 20.
>>>
>>> Here the report file of the first load-into-counting.py execution (on
>>> the raw sequence data), test.ct.report:
>>>
>>> || This is the script 'load-into-counting.py' in khmer.
>>> || You are running khmer version 1.3
>>> || You are also using screed version 0.8
>>> ||
>>> || If you use this script in a publication, please cite EACH of the
>>> following:
>>> ||
>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>> ||
>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>> details.
>>>
>>>
>>> PARAMETERS:
>>>   - kmer size =    20            (-k)
>>>   - n tables =     4             (-N)
>>>   - min tablesize = 1.6e+10      (-x)
>>>
>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>> --------
>>> Saving k-mer counting table to test.ct
>>> Loading kmers from sequences in ['test.fastq.gz']
>>> making k-mer counting table
>>> consuming input test.fastq.gz
>>> Total number of unique k-mers: 3102943887
>>> saving test.ct
>>> fp rate estimated to be 0.008
>>> DONE.
>>> wrote to: test.ct.info
>>>
>>> Here the report file of the normalize-by-median.py, test_k20_C5.report
>>>
>>> || This is the script 'normalize-by-median.py' in khmer.
>>> || You are running khmer version 1.3
>>> || You are also using screed version 0.8
>>> ||
>>> || If you use this script in a publication, please cite EACH of the
>>> following:
>>> ||
>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
>>> ||
>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>> details.
>>>
>>>
>>> PARAMETERS:
>>>   - kmer size =    20            (-k)
>>>   - n tables =     4             (-N)
>>>   - min tablesize = 1.6e+10      (-x)
>>>
>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>> --------
>>> ... kept 58012 of 200000 or 29%
>>> ... in file test.fastq.gz
>>> ... kept 116210 of 400000 or 29%
>>> ... in file test.fastq.gz
>>>
>>> ..... etc etc etc .....
>>>
>>> ... kept 90482098 of 346200000 or 26%
>>> ... in file test.fastq.gz
>>> ... kept 90529526 of 346400000 or 26%
>>> ... in file test.fastq.gz
>>> Total number of unique k-mers: 850221
>>> loading k-mer counting table from test.ct
>>> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
>>> output in test_k20_C5.fastq.gz.keep
>>> fp rate estimated to be 0.008
>>>
>>> And here the second load-into-counting.py report, test2.ct.report
>>>
>>> || This is the script 'load-into-counting.py' in khmer.
>>> || You are running khmer version 1.3
>>> || You are also using screed version 0.8
>>> ||
>>> || If you use this script in a publication, please cite EACH of the
>>> following:
>>> ||
>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>> ||
>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>> details.
>>>
>>>
>>> PARAMETERS:
>>>   - kmer size =    20            (-k)
>>>   - n tables =     4             (-N)
>>>   - min tablesize = 1.6e+10      (-x)
>>>
>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>> --------
>>> Saving k-mer counting table to test2.ct
>>> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
>>> making k-mer counting table
>>> consuming input test_k20_C5.fastq.gz.keep
>>> Total number of unique k-mers: 2822473008
>>> saving test2.ct
>>>
>>> Hope this helps!
>>>
>>> Joran
>>>
>>> On 11/05/15 12:12, C. Titus Brown wrote:
>>>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>>>> Dear Khmer mailing list,
>>>>>
>>>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
>>>>> the raw dataset and diginormed dataset, similar as was done in the
>>>>> original diginorm paper.
>>>> [ elided ]
>>>>
>>>> Hi Joran,
>>>>
>>>> that certainly doesn't sound good :). Would it be possible to convey the
>>>> various report files to us, publicly or privately?
>>>>
>>>> thanks,
>>>> --titus
>>>>
>>>> p.s. Thank you for the very detailed question!
>

-- 
C. Titus Brown, ctbrown at ucdavis.edu