[khmer] Fwd: How to speed up the filter-below-abund script ?
Alexis Groppi
alexis.groppi at u-bordeaux2.fr
Fri Mar 15 08:05:02 PDT 2013
Hi Eric,
Good News : I may have found the solution of this tricky bug
The bug come from the hastable construction with load-into-counting.py
We have used the following parameters : load-into-counting.py -k 20*-x 32e9*
With -x 32e9 the hashtable grows until it reaches the maximum ram
available at the moment, independantly of the size of the fasta.keep file.
But, in a manner I ignore, this file is not correct.
I realise this by repeating the two steps load-into-counting.py and then
filter-below-abund.py on the very small subsample of 100000 reads.
==> It generates a table.kh of 248.5 Go (!) and leads to the same error
: Floating point exception(core dumped).
I tried to performed these two steps on the whole data sets (~2.5
millions of reads) with load-into-counting.py -k 20*-x 5e7*
==> It works perfectly but I got a warning/error in the output file :
** ERROR: the counting hash is too small for
** this data set. Increase hashsize/num ht.
Finally I ran the two steps with load-into-counting.py -k 20 *-x
1e9*.... And It works perfectly ! in a fews minutes (~6 mins) without
any warning or error.
In my opinion, it will be useful( if possible) to include a control on
the hastable creation by the script load-into-counting.py.
By the way, how this is managed via normalize-by-median.py and the
--savehash option ?
Now shifting to the next steps (partitioning). I hope in a more easy way ;)
Thanks again for your responsiveness.
Have nice Weekend
Alexis
Le 15/03/2013 02:02, Eric McDonald a écrit :
> I cannot reproduce your problem with a fairly large amount of data - 5
> GB (50 million reads) of soil metagenomic data processed successfully
> with 'sandbox/filter-below-abund.py'. (I think the characteristics of
> your data set are different though; I thought I noticed some sequences
> with 'N' in them - those would be discarded. If you have many of those
> then that could drastically reduce what is kept which might alter the
> read-process-write "rhythm" between your threads some.)
>
> ... filtering 48400000
> done loading in sequences
> DONE writing.
> processed 48492066 / wrote 48441373 / removed 50693
> processed 3940396871 bp / wrote 3915266313 bp / removed 25130558 bp
> discarded 0.6%
>
> When I have a fresh mind tomorrow, I will suggest some next steps.
> (Try to isolate which thread is dying, build a fresh Python 2.7 on a
> machine which has access to your data, etc....)
>
>
>
> On Thu, Mar 14, 2013 at 8:10 PM, Eric McDonald <emcd.msu at gmail.com
> <mailto:emcd.msu at gmail.com>> wrote:
>
> Hi Alexis and Louise-Amélie,
>
> Thank you both for the information. I am trying to reproduce your
> problem with a large data set right now.
> I agree that the problem may be a function of the amount of data.
> However, if you were running out of memory, then I would expect to
> see a segmentation fault rather than a FPE. I am still guessing
> this problem may be threading-related (even if the number of
> workers is reduced to 1, there is still the master thread which
> supplies the groups of sequences and the writer thread which
> outputs the kept sequences). But, my guesses have not proved to be
> that useful with your problem thus far, so take my latest guess
> with a grain of salt. :-)
>
> Depending on whether I am able to reproduce the problem, I have
> some more ideas which I intend to try tomorrow. If you find
> anything else interesting, I would like to know. But, I feel bad
> about how much time you have wasted on this. Hopefully I will be
> able to reproduce the problem....
>
> Thanks,
> Eric
>
>
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130315/3d1fcf7c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130315/3d1fcf7c/attachment-0002.png>
More information about the khmer
mailing list