[khmer] digital normalization clarification

Tue May 21 19:59:10 PDT 2013

On Mon, May 20, 2013 at 02:23:35PM -0700, Susan Miller wrote:
> I have an Illumina HiSeq2000 data set with ~140M paired end reads from a  
> bacterial genome with some insect host DNA.  I would like to use digital  
> normalization to reduce this data set in preparation for de novo  
> assembly.  I see 2 slightly different suggestions, one in the  
> khmer.readthedocs page:
> https://khmer.readthedocs.org/en/latest/guide.html#genome-assembly-including-mda-samples-and-highly-polymorphic-genomes
> and the other in the angus/diginorm-2012 tutorial:
> http://ged.msu.edu/angus/diginorm-2012/tutorial.html
>
> The khmer.readthedocs page (under  
> genome-assembly-including-mda-samples-and-highly-polymorphic-genomes)  
> suggests running normalize-by-median, and filter-abund, followed by  
> strip-and-split-for-assembly and another normalize-by-median.
>
> The angus diginorm tutorial differs in the three-pass instructions, as  
> it shows the 2nd normalize-by-median being done before  
> strip-and-split-for-assembly.
>
> Does in make a difference whether strip-and-split-for-assembly is run  
> before or after the 2nd normalize-by-median step?
>
> Not having /1 and /2 in the read names didn't seem to be a problem for  
> normalize-by-median, but strip-and-split-for-assembly is unable to  
> detect that my reads are paired.  If I need to go back to the original  
> reads and add the paired end /1 and /2 suffixes, it would be nice if the  
> "Preparing your sequences" section of khmer.readthedocs specified that.
>
> Thanks,
> Susan Miller
> Arizona Research Labs BioComputing

Hi Susan, thanks!  I'll put the /1 and /2 issues on the TODO list; too many
tutorials floating around there now :).  We're actually working on better
ways of handling that whole issue, too, but I think it's sound advice to
mention this stuff up front.

I think the guide should be followed.  Briefly, the order in which reads
go into digital normalization is related to the order in which they are
discarded - first in are most likely to be retained. So the general rule is
"send in your most valuable data first", i.e., your paired ends.  Thus each
step that may orphan reads should be followed by a split, so that you can
keep prioritizing any remaining pairs.

I ... hope that rather convoluted paragraph makes sense...

As with other issues, we're exploring better ways of doing this all 'round.
But for now...

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu