[khmer] suggestions for digital normalization parameters?
C. Titus Brown
ctb at msu.edu
Tue May 28 17:29:31 PDT 2013
On Tue, May 28, 2013 at 05:18:32PM -0700, Susan Miller wrote:
> I've run khmer digital normalization steps on very high coverage
> Illumina HiSeq2000 data (140M paired end reads for a 1.2Mb bacterial
> genome plus host insect DNA) in preparation for de novo assembly. I
> used the pipeline suggested here:
> normalize-by-median C=20 (~25% of reads eliminated)
> strip-and-split-for-assembly (~50% of remaining reads eliminated)
> filter-abund C=1 (not much reduction)
> normalize-by-median C=5 (~12% of remaining reads eliminated)
> Running the Ray assembler with kmer sweep 21..49 results in quick
> assemblies (~30 min on 144 processors) but max contig length is only
> 1081 bases. With an earlier run on a similar sample, with lower
> coverage coming from Illumina I was able to get an assembled contig of
> length 52926 without digital normalization.
> Would you recommend different diginorm parameters to try to get longer
> contigs? Or should I try partitioning instead?
I haven't really tested Ray with digital normalization yet; it may not
work that well with the coverage heuristics used by Ray.
Two suggestions --
- try assembling after the first normalize-by-median to a C of 20; some
assemblers (like Trinity for mRNAseq) do much better at this coverage level;
- try using the -p parameter with normalize-by-median, which forcibly retains
both pairs of any paired end data. I gather Ray uses paired end data
More information about the khmer