[khmer] inconsistent unique k-mer counting

Thu May 14 04:24:19 PDT 2015

Hi Joran,

absolutely.  Let us know how it goes or you!

cheers,
--titus

On Thu, May 14, 2015 at 01:14:34PM +0200, Joran Martijn wrote:
> I just saw the release of Khmer 1.4, and it includes in the sandbox the  
> "unique-kmer.py" script.
> Do you think I can use this script for my purpose (comparing unique  
> number of k-mers for a certain k before and after different steps of the  
> 3-pass normalization?).
>
> Cheers,
>
> Joran
>
> On 12/05/15 12:46, C. Titus Brown wrote:
>> Kalamazoo uses the three-pass :).  We have pretty good evidence that
>> it works ok for metagenomes - it's not what we used in Howe et al.,
>> for two reasons (we didn't have the variable-coverage error trimming yet,
>> and the data set was very low coverage) but we've been using it since.
>>
>> best,
>> --titus
>>
>> On Mon, May 11, 2015 at 01:52:06PM +0200, Joran Martijn wrote:
>>> Thanks Titus, let me know when you figure something out!
>>>
>>> I was playing around with several different Coverage thresholds.
>>> I won't use the 3-pass as I understood this does not work well for
>>> metagenomes.
>>> I was thinking of following the kalamazoo pipeline.
>>>
>>> Joran
>>>
>>> On 11/05/15 13:13, C. Titus Brown wrote:
>>>> OK, that's very weird - this must be a bug, but I'll be darned if I can
>>>> figure out what might be causing it.  The numbers in load-into-counting
>>>> should be correct but I'll have to independently confirm that.
>>>>
>>>> BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
>>>> in the dn paper.
>>>>
>>>> best,
>>>> --titus
>>>>
>>>> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
>>>>> Hej Titus,
>>>>>
>>>>> Thanks for the quick reply! Here are the report files, which are
>>>>> basically the STDERR and STDOUT output of the scripts.
>>>>> Quick note before the reports, I made a small mistake in my
>>>>> openingspost. The Coverage threshold I tried for these reports was 5,
>>>>> not 20.
>>>>>
>>>>> Here the report file of the first load-into-counting.py execution (on
>>>>> the raw sequence data), test.ct.report:
>>>>>
>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>> || You are running khmer version 1.3
>>>>> || You are also using screed version 0.8
>>>>> ||
>>>>> || If you use this script in a publication, please cite EACH of the
>>>>> following:
>>>>> ||
>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>> ||
>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>> details.
>>>>>
>>>>>
>>>>> PARAMETERS:
>>>>>    - kmer size =    20            (-k)
>>>>>    - n tables =     4             (-N)
>>>>>    - min tablesize = 1.6e+10      (-x)
>>>>>
>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>> --------
>>>>> Saving k-mer counting table to test.ct
>>>>> Loading kmers from sequences in ['test.fastq.gz']
>>>>> making k-mer counting table
>>>>> consuming input test.fastq.gz
>>>>> Total number of unique k-mers: 3102943887
>>>>> saving test.ct
>>>>> fp rate estimated to be 0.008
>>>>> DONE.
>>>>> wrote to: test.ct.info
>>>>>
>>>>> Here the report file of the normalize-by-median.py, test_k20_C5.report
>>>>>
>>>>> || This is the script 'normalize-by-median.py' in khmer.
>>>>> || You are running khmer version 1.3
>>>>> || You are also using screed version 0.8
>>>>> ||
>>>>> || If you use this script in a publication, please cite EACH of the
>>>>> following:
>>>>> ||
>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
>>>>> ||
>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>> details.
>>>>>
>>>>>
>>>>> PARAMETERS:
>>>>>    - kmer size =    20            (-k)
>>>>>    - n tables =     4             (-N)
>>>>>    - min tablesize = 1.6e+10      (-x)
>>>>>
>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>> --------
>>>>> ... kept 58012 of 200000 or 29%
>>>>> ... in file test.fastq.gz
>>>>> ... kept 116210 of 400000 or 29%
>>>>> ... in file test.fastq.gz
>>>>>
>>>>> ..... etc etc etc .....
>>>>>
>>>>> ... kept 90482098 of 346200000 or 26%
>>>>> ... in file test.fastq.gz
>>>>> ... kept 90529526 of 346400000 or 26%
>>>>> ... in file test.fastq.gz
>>>>> Total number of unique k-mers: 850221
>>>>> loading k-mer counting table from test.ct
>>>>> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
>>>>> output in test_k20_C5.fastq.gz.keep
>>>>> fp rate estimated to be 0.008
>>>>>
>>>>> And here the second load-into-counting.py report, test2.ct.report
>>>>>
>>>>> || This is the script 'load-into-counting.py' in khmer.
>>>>> || You are running khmer version 1.3
>>>>> || You are also using screed version 0.8
>>>>> ||
>>>>> || If you use this script in a publication, please cite EACH of the
>>>>> following:
>>>>> ||
>>>>> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
>>>>> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
>>>>> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
>>>>> ||
>>>>> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
>>>>> details.
>>>>>
>>>>>
>>>>> PARAMETERS:
>>>>>    - kmer size =    20            (-k)
>>>>>    - n tables =     4             (-N)
>>>>>    - min tablesize = 1.6e+10      (-x)
>>>>>
>>>>> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
>>>>> --------
>>>>> Saving k-mer counting table to test2.ct
>>>>> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
>>>>> making k-mer counting table
>>>>> consuming input test_k20_C5.fastq.gz.keep
>>>>> Total number of unique k-mers: 2822473008
>>>>> saving test2.ct
>>>>>
>>>>> Hope this helps!
>>>>>
>>>>> Joran
>>>>>
>>>>> On 11/05/15 12:12, C. Titus Brown wrote:
>>>>>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>>>>>> Dear Khmer mailing list,
>>>>>>>
>>>>>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
>>>>>>> the raw dataset and diginormed dataset, similar as was done in the
>>>>>>> original diginorm paper.
>>>>>> [ elided ]
>>>>>>
>>>>>> Hi Joran,
>>>>>>
>>>>>> that certainly doesn't sound good :). Would it be possible to convey the
>>>>>> various report files to us, publicly or privately?
>>>>>>
>>>>>> thanks,
>>>>>> --titus
>>>>>>
>>>>>> p.s. Thank you for the very detailed question!
>

-- 
C. Titus Brown, ctbrown at ucdavis.edu