[khmer] suggestions for digital normalization parameters?

C. Titus Brown ctb at msu.edu
Tue May 28 17:29:31 PDT 2013


On Tue, May 28, 2013 at 05:18:32PM -0700, Susan Miller wrote:
> I've run khmer digital normalization steps on very high coverage  
> Illumina HiSeq2000 data (140M paired end reads for a 1.2Mb bacterial  
> genome plus host insect DNA) in preparation for de novo assembly.  I  
> used the pipeline suggested here:  
> https://khmer.readthedocs.org/en/latest/guide.html#genome-assembly-including-mda-samples-and-highly-polymorphic-genomes
>
> normalize-by-median C=20   (~25% of reads eliminated)
> strip-and-split-for-assembly  (~50% of remaining reads eliminated)
> filter-abund C=1  (not much reduction)
> strip-and-split-for-assembly
> normalize-by-median C=5  (~12% of remaining reads eliminated)
>
> Running the Ray assembler with kmer sweep 21..49 results in quick  
> assemblies (~30 min on 144 processors) but max contig length is only  
> 1081 bases.  With an earlier run on a similar sample, with lower  
> coverage coming from Illumina I was able to get an assembled contig of  
> length 52926 without digital normalization.
>
> Would you recommend different diginorm parameters to try to get longer  
> contigs?  Or should I try partitioning instead?

Hi Susan,

I haven't really tested Ray with digital normalization yet; it may not
work that well with the coverage heuristics used by Ray.

Two suggestions --

 - try assembling after the first normalize-by-median to a C of 20; some
   assemblers (like Trinity for mRNAseq) do much better at this coverage level;

 - try using the -p parameter with normalize-by-median, which forcibly retains
   both pairs of any paired end data.  I gather Ray uses paired end data
   well.

cheers,
--titus




More information about the khmer mailing list