<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi Eric,<br>
<br>
Good News : I may have found the solution of this tricky bug<br>
<br>
The bug come from the hastable construction with
load-into-counting.py<br>
We have used the following parameters : load-into-counting.py -k 20<b>
-x 32e9</b><br>
With -x 32e9 the hashtable grows until it reaches the maximum ram
available at the moment, independantly of the size of the fasta.keep
file.<br>
But, in a manner I ignore, this file is not correct.<br>
I realise this by repeating the two steps load-into-counting.py and
then filter-below-abund.py on the very small subsample of 100000
reads.<br>
==> It generates a table.kh of 248.5 Go (!) and leads to the same
error : Floating point exception(core dumped).<br>
<br>
I tried to performed these two steps on the whole data sets (~2.5
millions of reads) with load-into-counting.py -k 20<b> -x 5e7</b><br>
<br>
==> It works perfectly but I got a warning/error in the output
file :<br>
** ERROR: the counting hash is too small for<br>
** this data set. Increase hashsize/num ht.<br>
<br>
Finally I ran the two steps with load-into-counting.py -k 20 <b>-x
1e9</b>.... And It works perfectly ! in a fews minutes (~6 mins)
without any warning or error.<br>
<br>
In my opinion, it will be useful( if possible) to include a control
on the hastable creation by the script load-into-counting.py.<br>
By the way, how this is managed via normalize-by-median.py and the
--savehash option ?<br>
<br>
Now shifting to the next steps (partitioning). I hope in a more easy
way ;)<br>
<br>
Thanks again for your responsiveness.<br>
<br>
Have nice Weekend<br>
<br>
Alexis<br>
<br>
<div class="moz-cite-prefix">Le 15/03/2013 02:02, Eric McDonald a
écrit :<br>
</div>
<blockquote
cite="mid:CAGhFaV08Bg4HRA5DsufLjMTJq51K9ngN+1Q5bFqHtCecRmQC_w@mail.gmail.com"
type="cite">
<div dir="ltr">
<div style="">I cannot reproduce your problem with a fairly
large amount of data - 5 GB (50 million reads) of soil
metagenomic data processed successfully with
'sandbox/filter-below-abund.py'. (I think the characteristics
of your data set are different though; I thought I noticed
some sequences with 'N' in them - those would be discarded. If
you have many of those then that could drastically reduce what
is kept which might alter the read-process-write "rhythm"
between your threads some.)</div>
<div><br>
</div>
<div>... filtering 48400000</div>
<div>done loading in sequences</div>
<div>DONE writing.</div>
<div>processed 48492066 / wrote 48441373 / removed 50693</div>
<div>processed 3940396871 bp / wrote 3915266313 bp / removed
25130558 bp</div>
<div>discarded 0.6%</div>
<div><br>
</div>
<div>When I have a fresh mind tomorrow, I will suggest some next
steps. (Try to isolate which thread is dying, build a fresh
Python 2.7 on a machine which has access to your data,
etc....)</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Thu, Mar 14, 2013 at 8:10 PM, Eric
McDonald <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:emcd.msu@gmail.com" target="_blank">emcd.msu@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hi Alexis and <span
style="font-family:arial,sans-serif;font-size:13px">Louise-Amélie,</span>
<div><span
style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px">Thank
you both for the information. I am trying to reproduce
your problem with a large data set right now.</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px">I
agree that the problem may be a function of the amount
of data. However, if you were running out of memory,
then I would expect to see a segmentation fault rather
than a FPE. I am still guessing this problem may be
threading-related (even if the number of workers is
reduced to 1, there is still the master thread which
supplies the groups of sequences and the writer thread
which outputs the kept sequences). But, my guesses
have not proved to be that useful with your problem
thus far, so take my latest guess with a grain of
salt. :-)</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px">Depending
on whether I am able to reproduce the problem, I have
some more ideas which I intend to try tomorrow. If you
find anything else interesting, I would like to know.
But, I feel bad about how much time you have wasted on
this. Hopefully I will be able to reproduce the
problem....</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px">Thanks,</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px">
Eric</span></div>
<div><span
style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
</div>
<br>
</blockquote>
</div>
</div>
</blockquote>
<br>
<div class="moz-signature">-- <br>
<img src="cid:part2.03000403.09050809@u-bordeaux2.fr" border="0"></div>
</body>
</html>