[khmer] Query for digital normalization software package

Thu Aug 20 04:47:50 PDT 2015

On Wed, Aug 19, 2015 at 08:00:10PM -0500, Arghya Kusum Das wrote:
> Dear Dr. Brown,
> 
> Recently I went through the digital normalization approach and the
> corresponding paper (available in arxiv) authored by you. I am extremely
> interested to apply this method for our whole genome data set. Most of the
> time, I use ABySS for the assembly. I want to
> preprocess (trim/correct)  the reads with your digital normalization
> software package.
> 
> I have the following queries about the software:
> 1) Is the implementation fully parallel at this point? I will appreciate if
> you share the link for the latest release of the software package.
> 2) In particular, the size of the entire data set that we are trying to
> analyze is more than 400GB. How should I run the software? Should I run it
> on a single big memory machine (e.g. 256GB RAM) or can I run it on multiple
> small memory machines (e.g. 32GB RAM)?
> 
> --
> Thanks and regards,
> Arghya Kusum Das

Hi Arghya,

that sounds like a big genome - good luck!

* only a few of the scripts in khmer are parallelized.  If you're concerned
  speed, I've heard good things about bbnorm, which has some of the features
  of khmer, and is apparently much faster.

* the latest released version of khmer is 1.4.1, and we're about to
  release 2.0.  Both will be available at

     https://github.com/dib-lab/khmer/releases

* if you are doing digital normalization only (with no k-mer abundance
  trimming) you can run that on as many different machines as you want,
  but be sure to run one combined diginorm at the end.

* you might be able to get as good or better results by doing k-mer error
  trimming instead - see https://peerj.com/preprints/890/ for our
  implementation of this (which is in both khmer 1.4.1 and 2.0, as the
  script trim-low-abund.py).  This should give you similar memory
  savings without downsampling the data, which some assemblers don't like.

We have some guidance on parameters scattered all over the place, but
we find that normalizing to a coverage of 20 with a k-mer size of 20 or
21, and then doing k-mer abundance trimming with a cutoff of 1 or 2,
is good.  Basically, either

% normalize-by-median.py -k 21 -C 20 ...
% filter-abund.py -C 3 ...

or

% trim-low-abund.py -k 21 -Z 20 -C 3 ...

I hope that helps! I'm happy to provide more specific suggestions if you
like.

best,
--titus
-- 
C. Titus Brown, ctbrown at ucdavis.edu