[khmer] inconsistent unique k-mer counting

C. Titus Brown ctbrown at ucdavis.edu
Mon May 11 04:13:26 PDT 2015


OK, that's very weird - this must be a bug, but I'll be darned if I can
figure out what might be causing it.  The numbers in load-into-counting
should be correct but I'll have to independently confirm that.

BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
in the dn paper.

best,
--titus

On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
> Hej Titus,
>
> Thanks for the quick reply! Here are the report files, which are  
> basically the STDERR and STDOUT output of the scripts.
> Quick note before the reports, I made a small mistake in my  
> openingspost. The Coverage threshold I tried for these reports was 5,  
> not 20.
>
> Here the report file of the first load-into-counting.py execution (on  
> the raw sequence data), test.ct.report:
>
> || This is the script 'load-into-counting.py' in khmer.
> || You are running khmer version 1.3
> || You are also using screed version 0.8
> ||
> || If you use this script in a publication, please cite EACH of the  
> following:
> ||
> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
> ||
> || Please see http://khmer.readthedocs.org/en/latest/citations.html for  
> details.
>
>
> PARAMETERS:
>  - kmer size =    20            (-k)
>  - n tables =     4             (-N)
>  - min tablesize = 1.6e+10      (-x)
>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> --------
> Saving k-mer counting table to test.ct
> Loading kmers from sequences in ['test.fastq.gz']
> making k-mer counting table
> consuming input test.fastq.gz
> Total number of unique k-mers: 3102943887
> saving test.ct
> fp rate estimated to be 0.008
> DONE.
> wrote to: test.ct.info
>
> Here the report file of the normalize-by-median.py, test_k20_C5.report
>
> || This is the script 'normalize-by-median.py' in khmer.
> || You are running khmer version 1.3
> || You are also using screed version 0.8
> ||
> || If you use this script in a publication, please cite EACH of the  
> following:
> ||
> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
> ||
> || Please see http://khmer.readthedocs.org/en/latest/citations.html for  
> details.
>
>
> PARAMETERS:
>  - kmer size =    20            (-k)
>  - n tables =     4             (-N)
>  - min tablesize = 1.6e+10      (-x)
>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> --------
> ... kept 58012 of 200000 or 29%
> ... in file test.fastq.gz
> ... kept 116210 of 400000 or 29%
> ... in file test.fastq.gz
>
> ..... etc etc etc .....
>
> ... kept 90482098 of 346200000 or 26%
> ... in file test.fastq.gz
> ... kept 90529526 of 346400000 or 26%
> ... in file test.fastq.gz
> Total number of unique k-mers: 850221
> loading k-mer counting table from test.ct
> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
> output in test_k20_C5.fastq.gz.keep
> fp rate estimated to be 0.008
>
> And here the second load-into-counting.py report, test2.ct.report
>
> || This is the script 'load-into-counting.py' in khmer.
> || You are running khmer version 1.3
> || You are also using screed version 0.8
> ||
> || If you use this script in a publication, please cite EACH of the  
> following:
> ||
> ||   * MR Crusoe et al., 2014. http://dx.doi.org/10.6084/m9.figshare.979190
> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
> ||
> || Please see http://khmer.readthedocs.org/en/latest/citations.html for  
> details.
>
>
> PARAMETERS:
>  - kmer size =    20            (-k)
>  - n tables =     4             (-N)
>  - min tablesize = 1.6e+10      (-x)
>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> --------
> Saving k-mer counting table to test2.ct
> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
> making k-mer counting table
> consuming input test_k20_C5.fastq.gz.keep
> Total number of unique k-mers: 2822473008
> saving test2.ct
>
> Hope this helps!
>
> Joran
>
> On 11/05/15 12:12, C. Titus Brown wrote:
>> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
>>> Dear Khmer mailing list,
>>>
>>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
>>> the raw dataset and diginormed dataset, similar as was done in the
>>> original diginorm paper.
>> [ elided ]
>>
>> Hi Joran,
>>
>> that certainly doesn't sound good :). Would it be possible to convey the
>> various report files to us, publicly or privately?
>>
>> thanks,
>> --titus
>>
>> p.s. Thank you for the very detailed question!
>

-- 
C. Titus Brown, ctbrown at ucdavis.edu



More information about the khmer mailing list