[khmer] exceeding defined RAM limits?

Tue Dec 17 16:36:34 PST 2013

On Tue, Dec 17, 2013 at 07:53:18PM +0000, Oh, Julia (NIH/NHGRI) [F] wrote:
> Hopefully a simple error on my end that?s causing a memory error: 
> 
> Starting with a fairly large file (estimated ~872400000 reads, ~185GB Illumina data): 
> 
> I?m running the following command on a large memory machine. From what I understand, the first normalization step should be consuming 240GB RAM and it does:
> 
> $python2.7 /home/ohjs/khmer/scripts/normalize-by-median.py -C 20 -k 20 -N 4 -x 60e9 --savehash round2.unaligned_ref.kh -R round2.unaligned_1.report round2.unaligned; 
> 
> Seems to end on removing ~33% of the reads, making ~118GB of sequence data
> 
> tail round2.unaligned_1.report
> 871500000 584890641 0.67113097074
> 871600000 584966095 0.671140540385
> 871700000 585039359 0.671147595503
> 871800000 585109434 0.671150991053
> 871900000 585174062 0.671148138548
> 872000000 585244067 0.671151452982
> 872100000 585314163 0.671154871001
> 872200000 585388191 0.671162796377
> 872300000 585459804 0.671167951393
> 872400000 585529439 0.671170837918
> 
> 
> Then I do the filtering step which seems to run OK and makes the file a lot smaller, to about 54GB data. 
> $python2.7 /home/ohjs/khmer/scripts/filter-abund.py round2.unaligned_ref.kh round2.unaligned.keep; 

Hi Julia,

this looks like a fairly low-coverage metagenome, so I would run filter-abund
with the new -V (variable coverage) parameter.  For a complete protocol,
which includes retention of paired ends, please see:

https://khmer-protocols.readthedocs.org/en/latest/metagenomics/

Now, on to your real question :)

> $python2.7 /home/ohjs/khmer/scripts/normalize-by-median.py -C 5 -k 20 -N 4 -x 16e9 round2.unaligned.keep.abundfilt; 
> 
> I thought I would be maxing out at 64 GB ram for the hash table (I?ve also used 32e9), but I get the following RAM usage report of 
> 
> 4986693.biobos elapsed time:        23358 seconds
> 4986693.biobos walltime:         06:28:36 hh:mm:ss
> 4986693.biobos memory limit:       249.00 GB
> 4986693.biobos memory used:        249.76 GB
> 4986693.biobos cpupercent used:     98.00 %

What the heck!? That's not supposed to happen!

This is either a bug, or (most likely) is being caused by an overabundance of
high-abundance k-mers.  The latter is easy to fix -- I've filed a bug report to
fix the latter in the software overall [0] -- but would require you to modify
the script at the moment.  If you're up for that, put

	ht.set_use_bigcount(False)

at line 186 of normalize-by-median:

https://github.com/ged-lab/khmer/blob/master/scripts/normalize-by-median.py#L186

Sorry about this and we'll get it fixed soon.

thanks,
--titus
-- 
C. Titus Brown, ctb at msu.edu