[khmer] Fwd: How to speed up the filter-below-abund script ?

Fri Mar 15 08:05:02 PDT 2013

Hi Eric,

Good News : I may have found the solution of this tricky bug

The bug come from the hastable construction with load-into-counting.py
We have used the following parameters : load-into-counting.py -k 20*-x 32e9*
With -x 32e9 the hashtable grows until it reaches the maximum ram 
available at the moment, independantly of the size of the fasta.keep file.
But, in a manner I ignore, this file is not correct.
I realise this by repeating the two steps load-into-counting.py and then 
filter-below-abund.py on the very small subsample of 100000 reads.
==> It generates a table.kh of 248.5 Go (!) and leads to the same error 
: Floating point exception(core dumped).

I tried to performed these two steps on the whole data sets (~2.5 
millions of reads) with load-into-counting.py -k 20*-x 5e7*

==> It works perfectly but I got a warning/error in the output file :
** ERROR: the counting hash is too small for
** this data set.  Increase hashsize/num ht.

Finally I ran the two steps with load-into-counting.py -k 20 *-x 
1e9*.... And It works perfectly ! in a fews minutes (~6 mins) without 
any warning or error.

In my opinion, it will be useful( if possible) to include a control on 
the hastable creation by the script load-into-counting.py.
By the way, how this is managed via normalize-by-median.py and the 
--savehash option ?

Now shifting to the next steps (partitioning). I hope in a more easy way ;)

Thanks again for your responsiveness.

Have nice Weekend

Alexis

Le 15/03/2013 02:02, Eric McDonald a écrit :
> I cannot reproduce your problem with a fairly large amount of data - 5 
> GB (50 million reads) of soil metagenomic data processed successfully 
> with 'sandbox/filter-below-abund.py'.  (I think the characteristics of 
> your data set are different though; I thought I noticed some sequences 
> with 'N' in them - those would be discarded. If you have many of those 
> then that could drastically reduce what is kept which might alter the 
> read-process-write "rhythm" between your threads some.)
>
> ... filtering 48400000
> done loading in sequences
> DONE writing.
> processed 48492066 / wrote 48441373 / removed 50693
> processed 3940396871 bp / wrote 3915266313 bp / removed 25130558 bp
> discarded 0.6%
>
> When I have a fresh mind tomorrow, I will suggest some next steps. 
> (Try to isolate which thread is dying, build a fresh Python 2.7 on a 
> machine which has access to your data, etc....)
>
>
>
> On Thu, Mar 14, 2013 at 8:10 PM, Eric McDonald <emcd.msu at gmail.com 
> <mailto:emcd.msu at gmail.com>> wrote:
>
>     Hi Alexis and Louise-Amélie,
>
>     Thank you both for the information. I am trying to reproduce your
>     problem with a large data set right now.
>     I agree that the problem may be a function of the amount of data.
>     However, if you were running out of memory, then I would expect to
>     see a segmentation fault rather than a FPE. I am still guessing
>     this problem may be threading-related (even if the number of
>     workers is reduced to 1, there is still the master thread which
>     supplies the groups of sequences and the writer thread which
>     outputs the kept sequences). But, my guesses have not proved to be
>     that useful with your problem thus far, so take my latest guess
>     with a grain of salt. :-)
>
>     Depending on whether I am able to reproduce the problem, I have
>     some more ideas which I intend to try tomorrow. If you find
>     anything else interesting, I would like to know. But, I feel bad
>     about how much time you have wasted on this. Hopefully I will be
>     able to reproduce the problem....
>
>     Thanks,
>     Eric
>
>

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130315/3d1fcf7c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130315/3d1fcf7c/attachment-0002.png>