[khmer] inconsistent unique k-mer counting
Joran Martijn
joran.martijn at icm.uu.se
Mon May 11 02:29:31 PDT 2015
Dear Khmer mailing list,
I'm trying to compare the number of unique k-mers (lets say 20-mers) in
the raw dataset and diginormed dataset, similar as was done in the
original diginorm paper.
I have done this as follows. First I create the counting table:
load-into-counting.py \
-t \
-k 20 \
-N 4 -x 16e9 \
--threads 30 \
test.ct \
test.fastq.gz \
&> test.ct.report
The script then reports ~3,1 billion unique 20-mers.
Then I perform the diginorm:
normalize-by-median.py \
-p -t \
-C 20 \
--loadtable test.ct \
-o test_k20_C20.fastq.gz.keep \
test.fastq.gz \
&> test_k20_C20.report
The script then reports approximately 1 million unique 20-mers (if true,
this would mean a 3000-fold reduction in unique kmers, that sounds like
too much to me).
Then, I count the number of unique 20-mers again in the normalized
dataset using load-into-counting.py:
load-into-counting.py \
-t \
-k 20 \
-N 4 -x 16e9 \
--threads 30 \
test2.ct \
test_k20_C20.fastq.gz.keep \
&> test2.ct.report
This time however, the script reports approximately 2,8 billion unique
20-mers. I am confused, since I was expecting around 1 million, like
normalize-by-median reported.
Are the two scripts reporting a different kind of unique k-mers? Which
counts should I compare to get an estimate of the effect of diginorm on
the dataset? Or is there a more easy and straightforward way to go about
this?
Kind regards,
Joran Martijn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment.html>
More information about the khmer
mailing list