[khmer] inconsistent unique k-mer counting

Mon May 11 02:29:31 PDT 2015

Dear Khmer mailing list,

I'm trying to compare the number of unique k-mers (lets say 20-mers) in 
the raw dataset and diginormed dataset, similar as was done in the 
original diginorm paper.

I have done this as follows. First I create the counting table:

load-into-counting.py \
     -t \
     -k 20 \
     -N 4 -x 16e9 \
     --threads 30 \
     test.ct \
     test.fastq.gz \
     &> test.ct.report

The script then reports ~3,1 billion unique 20-mers.

Then I perform the diginorm:

normalize-by-median.py \
     -p -t \
     -C 20 \
     --loadtable test.ct \
     -o test_k20_C20.fastq.gz.keep \
     test.fastq.gz \
     &> test_k20_C20.report

The script then reports approximately 1 million unique 20-mers (if true, 
this would mean a 3000-fold reduction in unique kmers, that sounds like 
too much to me).

Then, I count the number of unique 20-mers again in the normalized 
dataset using load-into-counting.py:

load-into-counting.py \
     -t \
     -k 20 \
     -N 4 -x 16e9 \
     --threads 30 \
     test2.ct \
     test_k20_C20.fastq.gz.keep \
     &> test2.ct.report

This time however, the script reports approximately 2,8 billion unique 
20-mers. I am confused, since I was expecting around 1 million, like 
normalize-by-median reported.
Are the two scripts reporting a different kind of unique k-mers? Which 
counts should I compare to get an estimate of the effect of diginorm on 
the dataset? Or is there a more easy and straightforward way to go about 
this?

Kind regards,

Joran Martijn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment.html>