[khmer] speed up partitioning?

Wed Feb 12 11:54:49 PST 2014

Hello, thanks for making this software available, it's a big help.
I'm hoping you will be able to offer advice on how I might choose different settings to accelerate do-partition.py. It's been running for a month now, and that is just too slow for my needs.

What I have:
Environmental sample of 68 million paired, interleaved illumina reads. Each is about 150nt long. Expected diversity is high.
The computer I'm using has 48 cores (AMD 6176 2.3Ghz) and 256GB ram.

The problem:
My problem is that it's been running for about a month now, using 90% of memory, and 4600% cpu. In other words, it appears to be using available resources. If I read the output (pasted below) correctly, I can expect to see 6566 partitions. After a month, it is working on partition 2178, so I'm only 1/3 of the way through. 

My questions:
Given my dataset and computer, are the parameters I chose reasonable?
What effect does kmer size have on speed and sensitivity?
What is the practical effect of varying subset-size?
Given my dataset and computer, what would you advise for params, given that 3 months is an unacceptable run time?

What I've tried:
do-partition.py -k 32 -x 330e9 -s 1e4 -T 46 SA_s13_500m_part SA_s13_500m_1_2.fasta.gz
PARAMETERS:
 - kmer size =    32 		(-k)
 - n hashes =     4 		(-N)
 - min hashsize = 3.3e+11 	(-x)

Estimated memory usage is 1.6e+11 bytes (n_hashes x min_hashsize / 8)
--------
Saving hashtable to SA_s13_500m_part
Loading kmers from sequences in ['SA_s13_500m_1_2.fasta.gz']
--
SUBSET SIZE 10000.0
N THREADS 46
--
making hashtable
consuming input SA_s13_500m_1_2.fasta.gz
fp rate estimated to be 0.000
** Traverse all the things: stop_big_traversals is false.
enqueued 6566 subset tasks
starting 46 threads
---
starting: SA_s13_500m_part 0
starting: SA_s13_500m_part 1
starting: SA_s13_500m_part 2
[snip]

Thanks so much in advance for your help!
Cedar