[khmer] Query for digital normalization software package
C. Titus Brown
ctbrown at ucdavis.edu
Thu Aug 20 04:47:50 PDT 2015
On Wed, Aug 19, 2015 at 08:00:10PM -0500, Arghya Kusum Das wrote:
> Dear Dr. Brown,
>
> Recently I went through the digital normalization approach and the
> corresponding paper (available in arxiv) authored by you. I am extremely
> interested to apply this method for our whole genome data set. Most of the
> time, I use ABySS for the assembly. I want to
> preprocess (trim/correct) the reads with your digital normalization
> software package.
>
> I have the following queries about the software:
> 1) Is the implementation fully parallel at this point? I will appreciate if
> you share the link for the latest release of the software package.
> 2) In particular, the size of the entire data set that we are trying to
> analyze is more than 400GB. How should I run the software? Should I run it
> on a single big memory machine (e.g. 256GB RAM) or can I run it on multiple
> small memory machines (e.g. 32GB RAM)?
>
> --
> Thanks and regards,
> Arghya Kusum Das
Hi Arghya,
that sounds like a big genome - good luck!
* only a few of the scripts in khmer are parallelized. If you're concerned
speed, I've heard good things about bbnorm, which has some of the features
of khmer, and is apparently much faster.
* the latest released version of khmer is 1.4.1, and we're about to
release 2.0. Both will be available at
https://github.com/dib-lab/khmer/releases
* if you are doing digital normalization only (with no k-mer abundance
trimming) you can run that on as many different machines as you want,
but be sure to run one combined diginorm at the end.
* you might be able to get as good or better results by doing k-mer error
trimming instead - see https://peerj.com/preprints/890/ for our
implementation of this (which is in both khmer 1.4.1 and 2.0, as the
script trim-low-abund.py). This should give you similar memory
savings without downsampling the data, which some assemblers don't like.
We have some guidance on parameters scattered all over the place, but
we find that normalizing to a coverage of 20 with a k-mer size of 20 or
21, and then doing k-mer abundance trimming with a cutoff of 1 or 2,
is good. Basically, either
% normalize-by-median.py -k 21 -C 20 ...
% filter-abund.py -C 3 ...
or
% trim-low-abund.py -k 21 -Z 20 -C 3 ...
I hope that helps! I'm happy to provide more specific suggestions if you
like.
best,
--titus
--
C. Titus Brown, ctbrown at ucdavis.edu
More information about the khmer
mailing list