<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Dear Khmer mailing list,<br>
<br>
I'm trying to compare the number of unique k-mers (lets say 20-mers)
in the raw dataset and diginormed dataset, similar as was done in
the original diginorm paper.<br>
<br>
I have done this as follows. First I create the counting table:<br>
<font color="#990000"><br>
load-into-counting.py \<br>
-t \<br>
-k 20 \<br>
-N 4 -x 16e9 \<br>
--threads 30 \<br>
test.ct \<br>
test.fastq.gz \<br>
&> test.ct.report</font><br>
<br>
The script then reports ~3,1 billion unique 20-mers.<br>
<br>
Then I perform the diginorm:<br>
<br>
<font color="#990000">normalize-by-median.py \<br>
-p -t \<br>
-C 20 \<br>
--loadtable test.ct \<br>
-o test_k20_C20.fastq.gz.keep \<br>
test.fastq.gz \<br>
&> test_k20_C20.report</font><br>
<br>
The script then reports approximately 1 million unique 20-mers (if
true, this would mean a 3000-fold reduction in unique kmers, that
sounds like too much to me).<br>
<br>
Then, I count the number of unique 20-mers again in the normalized
dataset using load-into-counting.py:<br>
<font color="#990000"><br>
load-into-counting.py \<br>
-t \<br>
-k 20 \<br>
-N 4 -x 16e9 \<br>
--threads 30 \<br>
test2.ct \<br>
test_k20_C20.fastq.gz.keep \<br>
&> test2.ct.report</font><br>
<br>
This time however, the script reports approximately 2,8 billion
unique 20-mers. I am confused, since I was expecting around 1
million, like normalize-by-median reported. <br>
Are the two scripts reporting a different kind of unique k-mers?
Which counts should I compare to get an estimate of the effect of
diginorm on the dataset? Or is there a more easy and straightforward
way to go about this?<br>
<br>
Kind regards,<br>
<br>
Joran Martijn<br>
</body>
</html>