[khmer] suggestions for digital normalization parameters?

Tue May 28 17:18:32 PDT 2013

I've run khmer digital normalization steps on very high coverage 
Illumina HiSeq2000 data (140M paired end reads for a 1.2Mb bacterial 
genome plus host insect DNA) in preparation for de novo assembly.  I 
used the pipeline suggested here: 
https://khmer.readthedocs.org/en/latest/guide.html#genome-assembly-including-mda-samples-and-highly-polymorphic-genomes

normalize-by-median C=20   (~25% of reads eliminated)
strip-and-split-for-assembly  (~50% of remaining reads eliminated)
filter-abund C=1  (not much reduction)
strip-and-split-for-assembly
normalize-by-median C=5  (~12% of remaining reads eliminated)

Running the Ray assembler with kmer sweep 21..49 results in quick 
assemblies (~30 min on 144 processors) but max contig length is only 
1081 bases.  With an earlier run on a similar sample, with lower 
coverage coming from Illumina I was able to get an assembled contig of 
length 52926 without digital normalization.

Would you recommend different diginorm parameters to try to get longer 
contigs?  Or should I try partitioning instead?

Thanks for any ideas,
Susan Miller
ARL BioComputing