[khmer] digital normalization clarification

Susan Miller sjmiller at email.arizona.edu
Mon May 20 14:23:35 PDT 2013


I have an Illumina HiSeq2000 data set with ~140M paired end reads from a 
bacterial genome with some insect host DNA.  I would like to use digital 
normalization to reduce this data set in preparation for de novo 
assembly.  I see 2 slightly different suggestions, one in the 
khmer.readthedocs page:
https://khmer.readthedocs.org/en/latest/guide.html#genome-assembly-including-mda-samples-and-highly-polymorphic-genomes
and the other in the angus/diginorm-2012 tutorial:
http://ged.msu.edu/angus/diginorm-2012/tutorial.html

The khmer.readthedocs page (under 
genome-assembly-including-mda-samples-and-highly-polymorphic-genomes) 
suggests running normalize-by-median, and filter-abund, followed by 
strip-and-split-for-assembly and another normalize-by-median.

The angus diginorm tutorial differs in the three-pass instructions, as 
it shows the 2nd normalize-by-median being done before 
strip-and-split-for-assembly.

Does in make a difference whether strip-and-split-for-assembly is run 
before or after the 2nd normalize-by-median step?

Not having /1 and /2 in the read names didn't seem to be a problem for 
normalize-by-median, but strip-and-split-for-assembly is unable to 
detect that my reads are paired.  If I need to go back to the original 
reads and add the paired end /1 and /2 suffixes, it would be nice if the 
"Preparing your sequences" section of khmer.readthedocs specified that.

Thanks,
Susan Miller
Arizona Research Labs BioComputing







More information about the khmer mailing list