[khmer] Question: -C cutoff of estimating genome size

C. Titus Brown ctb at msu.edu
Sat Nov 15 13:48:33 PST 2014


On Mon, Nov 10, 2014 at 04:37:09PM +0800, luoxiao wrote:
> Dear Colleague,
> I am doubted with the the set of cutoff [-C] when using the estimate-genome-size.py  program.
> Fist, I use  plot-abundance-dist.py to plot the 17mer spectrum, just as follows(set xlim & ylim):
> 
> 
> From the picture, I assumed those kmer abundance less than 50X  have high frequency and may be derived from sequencing error. 
> So, I set [-C]=50 when using the estimate-genome-size.py  program, and the Estimated (meta)genome size is: 53602214 bp  (our data is from metagenome and the sequence size is about 5G).
> However, according to your guidance displayed in the khmer website, I also set [-C]=20  and others parameter were unchanged when using the estimate-genome-size.py  program,
> but the Estimated (meta)genome size is: 32765613 bp , what a big the difference it is!
> So I  am confused about  how to choose  cutoff [-C]. Hope you can give me some useful advices. 
> Thank you very much!

Hi, sorry about letting this languish for so long --

first, it's less than a favor of two in difference, so I count that as a
victory ;).  I should probably put some language in pointing out that this
is just about de Bruijn graph size estimation.

The cutoff specifies at what coverage value the genome graph is "saturated";
above that, it stops collecting reads.  So the higher the cutoff, the more
reads were collected with a coverage of less than or equal to that cutoff, and
the larger the estimate.

best,
--titus
-- 
C. Titus Brown, ctb at msu.edu



More information about the khmer mailing list