[khmer] Using khmer for producing k-mer frequency distribution

Tue Aug 27 13:41:53 PDT 2013

Hi Rajat,

sorry for long delay in response!

On Thu, Jul 18, 2013 at 03:32:39PM -0400, Rajat Shuvro Roy wrote:
> Hello Prof Brown,
> I was attempting to produce a k-mer frequency distribution using khmer and
> followed the instructions in (
> http://khmer.readthedocs.org/en/latest/scripts.html) . I have a Zia mays
> library (SRR404240, 95.8Gbp ) and I executed the following command.
> 
> python load-into-counting.py -k 31 -x 5e10 out.kh SRR404240.fasta
> 
> I believe, this counts k-mer frequencies and the script abundance-dist.py
> produces the distribution.
> 
> We stopped it after it had ran for 2464 mins (41hrs) using 187GB space. I
> tried with smaller values for -x but failed to complete the computation in
> less than 3 days. Could you please let us know if this is expected and we
> should allow more time. And is there a more efficient way of using Khmer?

Your e-mail actually triggered some doc changes and updates ;).

Briefly, khmer can count k-mers in either constant-memory mode or in
accurate-large-counts mode.  In the former, counts above 255 will
stop being counted, but the memory specified with the -N and -x parameters
will be the total amount used; in the latter mode (which is the default),
counts above 255 will be kept and memory use will expand indefinitely.

You can use these modes easily in the latest khmer, the bleeding-edge
branch; you can get that like so:

	git clone https://github.com/ged-lab/khmer.git -b bleeding-edge

Then use 'load-into-counting.py -b' to build the tables, and 'abundance-dist'
to generate the output.

I'd suggest running it on a small test data set (data/25k.fq.gz, in the
khmer repo) just to make sure it all works for you, but it should - we use
this regularly.

Please let me know if you have any questions, and again, apologies for
the delay!

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu