[khmer] Fwd: How to speed up the filter-below-abund script ?

Thu Mar 14 05:56:56 PDT 2013

Alexis,

Sorry, I didn't mean to imply anything bad about David. As someone who has
previously worked as a HPC systems administrator, I know that I have felt
annoyed when users fill the wrong file system. So, if he was annoyed, then
I understand.

I just realized that we didn't check something very basic.... What is your
PYTHONPATH environment variable set to? What is the result, if you add:
  echo $PYTHONPATH
before the commands in the script?

Also, did you install 'khmer' into your virtualenv (i.e., did you do
"python setup.py install" at some point after you had done ".
/mnt/var/home/ag/env/bin/activate")? If so, then we have likely been using
the wrong 'khmer' modules this whole time... To verify this, what does the
following command tell you:
  python -c "import khmer.thread_utils as tu; print tu.__file__"
We want to use the Python modules under your 'khmer-BETA/python' directory
and not the ones under your
'/mnt/var/home/ag/env/lib/python2.7/site-packages' directory. I should have
asked you to check this much earlier in the debugging process, especially
since I was helping someone else with a similar issue.

Thank you,
  Eric

On Thu, Mar 14, 2013 at 7:28 AM, Alexis Groppi <alexis.groppi at u-bordeaux2.fr
> wrote:

>  Hi Eric,
>
>
> Le 14/03/2013 11:47, Eric McDonald a écrit :
>
> Hi Alexis,
>
>  The 'coredump' file comes from a standard Unix feature - it is simply
> the image of the Python process as it was in memory at the time of crash.
> This is nothing that any 'khmer' script produces explicitly. This is made
> by your operating system.
>
>  You should be able to disable the core dumps if your systems engineer is
> getting upset about the space they are using. Please add the following
> before other commands in your job script:
>   ulimit -c 0
>
> Done, but for some mysterious reason, its not taken in account...
> But David is a cool guy ;)
>
>
>  (Note: we could actually use the core dumps for debugging. I refrained
> from suggesting this to you yesterday, since describing the process can be
> somewhat complicated.)
>
>  Anyway, thanks for rerunning with the diagnostics I suggested. The exit
> code is 136, which is what you get if a process experiences a
> floating-point exception. If the exception had occurred within
> 'filter-below-abund.py' proper, the exit code would've been 1 rather than
> 136. This is what I wanted to double-check.
>
>  I suspect that something bad is happening within the Python interpreter.
> This may be due to some subtle bug involving Python's global interpreter
> lock and 'khmer' not doing something proper with regards to that. I will
> attempt to analyze the problem in more detail later today.
>
>  I know you must be getting tired of working on this problem
>
>
> Not at all. Thanks for your work !
>
>
>  , but if you want to try one more thing (for now), then it would be
> appreciated. Could you edit your copy of 'filter-below-abund.py' and change:
>   WORKER_THREADS = 8
> to:
>   WORKER_THREADS = 1
> and see if that helps?
>
> Done also... but unfortunately same result :(
>
>
> Alexis
>
>
>
>
>  Thanks,
>   Eric
>
>  P.S. If you want to reduce memory usage, you can decrease the Bloom
> filter size by adjusting the "-x" parameter that you use in the scripts to
> create your .kh files. Making this number smaller will reduce memory usage
> but will also increase the false positive rate, so be careful about tuning
> this too much.
>
>
>
> On Thu, Mar 14, 2013 at 5:42 AM, Alexis Groppi <
> alexis.groppi at u-bordeaux2.fr> wrote:
>
>>  Hi Eric,
>>
>> I've tried all the suggestions you made
>> But same result (see attached e/o file)
>>
>> But with the help of David (the system engineer of the lab) I think we
>> have found the bug :
>>  ==> filter-below-abund.py fills a directory ( */var/spool/abrt/ccpp-
>> 2013-03-14-10\:24\:13-26642.new/* coredump/) until it reaches all the
>> available space.  (see below)
>> ==> Then it crashes
>>
>> Is there a way to modify this ?
>>
>> Thanks again
>>
>> Alexis
>> **************************************************
>> [root at rainman ~]# ll -h */var/spool/abrt/ccpp-2013-03-14-10
>> \:24\:13-26642.new/*
>> total 12G
>> -rw-r----- 1 abrt users    4 14 mars  10:24 analyzer
>> -rw-r----- 1 abrt users    6 14 mars  10:24 architecture
>> -rw-r----- 1 abrt users  150 14 mars  10:24 cmdline
>> -rw-r----- 1 abrt users  12G 14 mars  10:24 coredump
>> -rw-r----- 1 abrt users 1,5K 14 mars  10:24 environ
>> -rw-r----- 1 abrt users   31 14 mars  10:24 executable
>> -rw-r----- 1 abrt users   27 14 mars  10:24 hostname
>> -rw-r----- 1 abrt users   26 14 mars  10:24 kernel
>> -rw-r----- 1 abrt users  13K 14 mars  10:24 maps
>> -rw-r----- 1 abrt users   26 14 mars  10:24 os_release
>> -rw-r----- 1 abrt users   71 14 mars  10:24 reason
>> -rw-r----- 1 abrt users   10 14 mars  10:24 time
>> -rw-r----- 1 abrt users    3 14 mars  10:24 uid
>>
>>
>> [root at rainman ~]# ll -h */var/spool/abrt/ccpp-2013-03-14-10
>> \:24\:13-26642.new/*
>> total 18G
>> -rw-r----- 1 abrt users    4 14 mars  10:24 analyzer
>> -rw-r----- 1 abrt users    6 14 mars  10:24 architecture
>> -rw-r----- 1 abrt users  150 14 mars  10:24 cmdline
>> -rw-r----- 1 abrt users  18G 14 mars  10:25 coredump
>> -rw-r----- 1 abrt users 1,5K 14 mars  10:24 environ
>> -rw-r----- 1 abrt users   31 14 mars  10:24 executable
>> -rw-r----- 1 abrt users   27 14 mars  10:24 hostname
>> -rw-r----- 1 abrt users   26 14 mars  10:24 kernel
>> -rw-r----- 1 abrt users  13K 14 mars  10:24 maps
>> -rw-r----- 1 abrt users   26 14 mars  10:24 os_release
>> -rw-r----- 1 abrt users   71 14 mars  10:24 reason
>> -rw-r----- 1 abrt users   10 14 mars  10:24 time
>> -rw-r----- 1 abrt users    3 14 mars  10:24 uid
>>
>>
>> Le 13/03/2013 22:58, Eric McDonald a écrit :
>>
>> Forwarding my earlier reply to the list, since I didn't reply-to-all
>> earlier.
>>
>>  Also, Alexis, you may wish to change the following in your job script:
>>   #PBS -l nodes=1:ppn=1
>> to
>>   #PBS -l nodes=1:ppn=8
>> assuming that you have 8-core nodes available. 'filter-below-abund.py'
>> uses 8 threads by default; if a 'khmer' job runs on the same node as
>> another job, it may try using more CPU cores than it was allocated and that
>> could create problems with your systems administrators. And, if a job's
>> threads are restricted to the requested number of cores, then you will also
>> not be getting optimal performance by using more threads (8) than available
>> cores (1).
>>
>> ---------- Forwarded message ----------
>> From: Eric McDonald <emcd.msu at gmail.com>
>> Date: Wed, Mar 13, 2013 at 3:12 PM
>> Subject: Re: [khmer] How to speed up the filter-below-abund script ?
>> To: alexis.groppi at u-bordeaux2.fr
>>
>>
>> Alexis,
>>
>>  I just realized that the floating-point exception is from inside the
>> Python interpreter itself. If the floating-point exception had appeared
>> from within the 'filter-below-abund.py' script, then we shoul have seen a
>> traceback from the exception, ending with:
>>   ZeroDivisionError: float division by zero
>> Instead, we are seeing:
>>    line 49: 54757 Floating point exception(core dumped)
>>  from your job shell. (I should've noticed that earlier.)
>>
>> Would you please add the following lines to your job script somewhere
>> before you invoke 'filter-below-abund.py':
>>   python --version
>>    which python
>>
>>  And would you please add the following line _immediately after_ you
>> invoke 'filter-below-abund.py':
>>   echo "Exit Code: $?"
>>
>>  Also, would you remove the 'time' command from in front of your
>> invocation of 'filter-below-abund.py'?
>>
>>  And, one more action before trying again... please run:
>>   git pull
>> in your 'khmer-BETA' directory. (I added another possible fix to the
>> 'bleeding-edge' branch. This command will pull that fix into your clone.)
>>
>>  Thank you,
>>   Eric
>>
>>
>> On Wed, Mar 13, 2013 at 10:13 AM, Alexis Groppi <
>> alexis.groppi at u-bordeaux2.fr> wrote:
>>
>>>  Hi,
>>>
>>> Le 13/03/2013 14:12, Eric McDonald a écrit :
>>>
>>> Hi Alexis,
>>>
>>>  First, let me say thank you for being patient and working with us in
>>> spite of all the problems you are encountering.
>>>
>>>
>>> That's bioinformatician life ;)
>>>
>>>
>>>
>>>  With regards to the floating point exception, I see several
>>> opportunities for a division-by-zero condition in the threading utilities
>>> used by the script. These opportunities exist if an input file is empty.
>>> (The problem may be coming from another place, but this would be my first
>>> guess.) What does the following command say:
>>>
>>>    ls -lh /scratch/ag/khmer/174r1_table.kh /mnt/var/home/ag/174r1_
>>> prinseq_good_bFr8.fasta.keep
>>>
>>>
>>>   The result : (the files are not empty)
>>> -rw-r--r-- 1 ag users 299M 12 mars  20:54
>>> /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>> -rw-r--r-- 1 ag users 141G 12 mars  21:05 /scratch/ag/khmer/
>>> 174r1_table.kh
>>>
>>>
>>>
>>> Also, since you appear to be using TORQUE as your resource manager/batch
>>> system, could you please attach the complete output and error files for the
>>> job? (These files should be of the form <job_name>.o2693 and
>>> <job_name>.e2693, where <job_name> is the name of your job. There may only
>>> be one or the other of these files, depending on site defaults and whether
>>> you specified "-j oe" or "-j eo" in your job submission.)
>>>
>>>
>>>  I re run the job since I have deleted previous (2693) err/out files.
>>> Here is the new file (merged with the option -j oe in the bash script) :
>>>
>>> #############################
>>> User: ag
>>> Date: Wed Mar 13 14:59:21 CET 2013
>>> Host: rainman.cbib.u-bordeaux2.fr
>>> Directory: /mnt/var/home/ag
>>> PBS_JOBID: 2695.rainman
>>> PBS_O_WORKDIR: /mnt/var/home/ag
>>> PBS_NODEFILE:  rainman
>>> #############################
>>> #############################
>>> Debut filter-below-abund: Wed Mar 13 14:59:21 CET 2013
>>>
>>> starting threads
>>> starting writer
>>> loading...
>>> ... filtering 0
>>>  /var/lib/torque/mom_priv/jobs/2695.rainman.SC: line 49: 54757 Floating
>>> point exception(core dumped) ./khmer-BETA/sandbox/fi
>>> lter-below-abund.py /scratch/ag/khmer/174r1_table.kh/mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>>
>>> real    3m54.873s
>>> user    0m0.085s
>>> sys     2m2.180s
>>> Date fin: Wed Mar 13 15:03:15 CET 2013
>>> Job finished
>>>
>>> Thanks again for your help :)
>>>
>>> Alexis
>>>
>>>
>>>
>>>  Thanks,
>>>   Eric
>>>
>>>
>>>
>>> On Wed, Mar 13, 2013 at 5:38 AM, Alexis Groppi <
>>> alexis.groppi at u-bordeaux2.fr> wrote:
>>>
>>>>  Hi Eric,
>>>>
>>>> Thanks for your answer.
>>>> But unfortunately, after many attempts I'm getting this error :
>>>>
>>>> starting threads
>>>> starting writer
>>>> loading...
>>>> ... filtering 0
>>>> /var/lib/torque/mom_priv/jobs/2693.rainman.SC: line 46: 63657 Floating
>>>> point exception(core dumped) ./khmer-BETA/sandbox/filter-below-abund.py
>>>> /scratch/ag/khmer/174r1_table.kh/mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>>>
>>>> real    3m30.163s
>>>> user    0m0.088s
>>>>
>>>> Your opinion ?
>>>>
>>>> Thanks
>>>>
>>>> Alexis
>>>>
>>>>
>>>> Le 13/03/2013 00:55, Eric McDonald a écrit :
>>>>
>>>> Hi Alexis,
>>>>
>>>>  One way to get the 'bleeding-edge' branch is to clone it into a fresh
>>>> directory; for example:
>>>>    git clone http://github.com/ged-lab/khmer.git -b bleeding-edge
>>>> khmer-BETA
>>>>
>>>>  Assuming you already have a clone of the 'ged-lab/khmer' repo, then
>>>> you should also be able to do:
>>>>   git fetch origin
>>>>   git checkout bleeding-edge
>>>> Depending on how old your Git client is and what its defaults are, you
>>>> may have to do the following instead:
>>>>   git checkout --track -b bleeding-edge origin/bleeding-edge
>>>>
>>>>  Hope this helps,
>>>>   Eric
>>>>
>>>>
>>>> On Tue, Mar 12, 2013 at 11:32 AM, Alexis Groppi <
>>>> alexis.groppi at u-bordeaux2.fr> wrote:
>>>>
>>>>>
>>>>> Le 12/03/2013 16:16, C. Titus Brown a écrit :
>>>>>
>>>>> On Tue, Mar 12, 2013 at 04:15:05PM +0100, Alexis Groppi wrote:
>>>>>
>>>>>  Hi Titus,
>>>>>
>>>>> Thanks for your answer
>>>>> Actually it's my second attempt with filter-below-abund.
>>>>> The first time, I thought the problem was coming from the location of my  table.kh file : in a storage element with poor level performance of I/O
>>>>> I killed the job after 24h, moved the file in a best place and re run it
>>>>> But with the same result : no completion after 24h
>>>>>
>>>>> Any Idea ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Cheers From Bordeaux :)
>>>>>
>>>>> Alexis
>>>>>
>>>>> PS : The command line was the following :
>>>>>
>>>>> ./filter-below-abund.py 174r1_table.kh 174r1_prinseq_good_bFr8.fasta.keep
>>>>>
>>>>> Is this correct ?
>>>>>
>>>>>  Yes, looks right... Can you try with the bleeding-edge branch, which now
>>>>> incorporates a potential fix for this issue?
>>>>>
>>>>>  From here : https://github.com/ged-lab/khmer/tree/bleeding-edge ?
>>>>> or
>>>>> here : https://github.com/ctb/khmer/tree/bleeding-edge ?
>>>>>
>>>>> Do I have to make a fresh install ? and How  ?
>>>>> Or just replace all the files and folders ?
>>>>>
>>>>> Thanks :)
>>>>>
>>>>> Alexis
>>>>>
>>>>>
>>>>>  thanks,
>>>>> --titus
>>>>>
>>>>>
>>>>>  Le 12/03/2013 14:41, C. Titus Brown a ?crit :
>>>>>
>>>>>  On Tue, Mar 12, 2013 at 10:48:03AM +0100, Alexis Groppi wrote:
>>>>>
>>>>>  Metagenome assembly :
>>>>> My data :
>>>>> - original (quality filtered) data : 4463243 reads (75 nt) (Illumina)
>>>>> 1/ Single pass digital normalization with normalize-by-median (C=20)
>>>>> ==> file .keep of 2560557 reads
>>>>> 2/ generated a hash table by load-into-counting on the .keep file
>>>>> ==> file .kh of ~16Go (huge file ?!)
>>>>> 3/ filter-below-abund with C=100 from the two previous file (table.kh
>>>>> and reads.keep)
>>>>> Still running after 24 hours  :(
>>>>>
>>>>> Any advice to speed up this step ? ... and the others (partitionning ...) ?
>>>>>
>>>>> I can have an access to a HPC : ~3000 cores.
>>>>>
>>>>>  Hi Alexis,
>>>>>
>>>>> filter-below-abund and filter-abund have occasional bugs that prevent them
>>>>> from completing.  I would kill and restart.  For that few reads it should
>>>>> take no more than a few hours to do everything.
>>>>>
>>>>> Most of what khmer does cannot easily be distributed across multiple chassis,
>>>>> note.
>>>>>
>>>>> best,
>>>>> --titus
>>>>>
>>>>>  --
>>>>>
>>>>>
>>>>>   --
>>>>>
>>>>> _______________________________________________
>>>>> khmer mailing list
>>>>> khmer at lists.idyll.org
>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>  Eric McDonald
>>>> HPC/Cloud Software Engineer
>>>>   for the Institute for Cyber-Enabled Research (iCER)
>>>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>>>> Michigan State University
>>>> P: 517-355-8733
>>>>
>>>>
>>>>   --
>>>>
>>>
>>>
>>>
>>>  --
>>>  Eric McDonald
>>> HPC/Cloud Software Engineer
>>>   for the Institute for Cyber-Enabled Research (iCER)
>>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>>> Michigan State University
>>> P: 517-355-8733
>>>
>>>
>>>   --
>>>
>>
>>
>>
>>  --
>>  Eric McDonald
>> HPC/Cloud Software Engineer
>>   for the Institute for Cyber-Enabled Research (iCER)
>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>>
>>
>>
>>  --
>>  Eric McDonald
>> HPC/Cloud Software Engineer
>>   for the Institute for Cyber-Enabled Research (iCER)
>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>>
>>
>> _______________________________________________
>> khmer mailing listkhmer at lists.idyll.orghttp://lists.idyll.org/listinfo/khmer
>>
>>
>>   --
>>
>
>
>
>  --
>  Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>
>
> --
>

-- 
Eric McDonald
HPC/Cloud Software Engineer
  for the Institute for Cyber-Enabled Research (iCER)
  and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130314/64170f53/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130314/64170f53/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130314/64170f53/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130314/64170f53/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130314/64170f53/attachment-0013.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130314/64170f53/attachment-0014.png>