[khmer] Fwd: parition-graph memory requirements

Fri Apr 26 20:56:02 PDT 2013

Hi Jens-Konrad,

apologies for late response. It's been quite a month.

partition-graph *really* shouldn't be doing that; the remaining memory
ballooning script in there is find-knots, not partition-graph :).  I
looked at your scripts and didn't see anything obviously problematic,
but then again this shouldn't be happening at all.

Could you try with --no-big-traverse and tell me what happens?

I don't suppose you can share a problematic data set with me?

thanks,
--titus

On Sat, Apr 13, 2013 at 11:16:50AM +0300, Jens-Konrad Preem wrote:
> Yes the steady ballooning is quite obvious, espescially if I take some  
> time staring at top command output etc. Thank you for your time I will  
> then hope that someone will look at this stuff here. As a note might it  
> be that my graafik.ht is corrupted somehow or something? It is even  
> smaller by size than the 50m.ht which I was nicely able to partition, as  
> additional information to anybody interested the data used was ~36M 250  
> bp reads.
> Jens-Konrad
> On 04/13/2013 05:35 AM, Eric McDonald wrote:
>> Jens-Konrad,
>>
>> Thanks for providing this information.
>>  15: resources_used.mem = 52379536kb
>> 30: resources_used.mem = 90676068kb
>> 45: resources_used.mem = 122543188kb
>> Definitely some ballooning memory use there.
>>
>> One more thing you may wish to examine from the command line is:
>>   qmgr -c "l s" | grep 'resources_'
>> This will tell you about any default resources (such as physical  
>> memory) that your PBS server is assigning to new jobs. That said, I do  
>> believe that your jobs are exhausting available memory.
>> So, now the question is whether anything can be done about it. Unless  
>> someone with more experience with the partitioning code decides to  
>> speak up, I am going to have analyze your chosen parameters and the  
>> pieces of code in question to see if I can deduce anything. I might  
>> not be able to do this until Monday - I am too tired to do it tonight  
>> (here in US Eastern time) and have a busy weekend ahead of me.
>>
>> I promise I will get back to you with some better answers if no one  
>> else decides to say anything. While you are waiting for a response and  
>> if you want to test your hypothesis about the number of threads  
>> correlating to increased memory use, then I would recommend using a  
>> smaller data set and seeing what kind of scaling in the memory use you  
>> see as you change the number of threads.
>>
>> Have a good weekend,
>>   Eric
>>
>>
>>
>> On Fri, Apr 12, 2013 at 7:30 AM, Jens-Konrad Preem <jpreem at ut.ee  
>> <mailto:jpreem at ut.ee>> wrote:
>>
>>     On 04/11/2013 02:58 AM, Eric McDonald wrote:
>>>     Forgot to reply to all, in case the answer will help anyone else
>>>     on the list....
>>>
>>>     ---------- Forwarded message ----------
>>>     From: *Eric McDonald* <emcd.msu at gmail.com
>>>     <mailto:emcd.msu at gmail.com>>
>>>     Date: Wed, Apr 10, 2013 at 7:57 PM
>>>     Subject: Re: [khmer] parition-graph memory requirements
>>>     To: Jens-Konrad Preem <jpreem at ut.ee <mailto:jpreem at ut.ee>>
>>>
>>>
>>>     Hi,
>>>
>>>     Sorry for the delayed reply.
>>>
>>>     Thanks for sharing your job scripts. I notice that you are
>>>     specifying the 'vmem' resource. However, if PBS is also enforcing
>>>     a limit on the 'mem' resource (physical memory), then you may be
>>>     encountering that limit. Do you know what default value is
>>>     assigned by your site's PBS server for the 'mem' resource?
>>>
>>>     Again, if you run:
>>>       qstat -f <job_id>
>>>     you should be able to determine both the resources allocated for
>>>     the job and how much the job is actually using. Please let us
>>>     know the results of this command, if you would like help
>>>     interpreting them and figuring out how to change your PBS
>>>     resource request, if necessary.
>>>
>>>     As a side note, smaller k-mer lengths mean that more k-mers are
>>>     being extracted from each sequence. This means that the hash
>>>     tables are being more densely populated. And, that means that you
>>>     are more likely to need larger hash tables to avoid a significant
>>>     false positive rate. But, I think a better thing to say is that
>>>     the amount of memory used by the hash tables is independent of
>>>     k-mer size. So, changing k-mer length does not affect memory
>>>     usage for many parts of khmer. (I would have to look more closely
>>>     to see how this affects the partitioning code.)
>>>
>>>     Hope that helps,
>>>       Eric
>>>
>>>
>>>
>>>     On Wed, Apr 10, 2013 at 4:23 AM, Jens-Konrad Preem <jpreem at ut.ee
>>>     <mailto:jpreem at ut.ee>> wrote:
>>>
>>>         Hi,
>>>
>>>         In an extreme act of foolishness I do seem to have lost my
>>>         error logs. (I have been messing with the different  scripts  
>>>         here a lot and so got rid of some of the outputs, in some ill
>>>         thought out "housekeeping" event).
>>>
>>>         I do attach here a bunch of PBS scripts that I used to get as
>>>         far as I am. I did use a different script for most of the
>>>         normalize and partition pipeline, so I'd have time to look at
>>>         the outputs and get a sense of time taken for each. The
>>>         scripts are in following order - supkhme(normalize),
>>>         suprem(filter-below), supload(load-graph), and finally
>>>         supart(partition-graph). (As can be seen I try to do the
>>>         meta-genome analysis as per the guide.txt)
>>>         All the previous scripts completed without complaint,
>>>         producing the 5.2 Gb "graafik" graph.
>>>
>>>         The partition graph had failed a few times after running an
>>>         hour or so always with error messages concerning memory. Now
>>>         the latest script there demands 240 Gb of memory which is
>>>         maximum I can demand in the near future, and still failed
>>>         with an error message concerning memory.
>>>
>>>         I am right now working on reproducing the error, so I can
>>>         then supply you with .logs and .error files, when no error
>>>         occurs the better for me of course.
>>>         I decided to try different k-values this time as suggested by
>>>         https://khmer.readthedocs.org/en/latest/guide.html (20 for
>>>         normalization, and 32 for partitioning) those should make the
>>>         graph file all the bigger - I used the smaller ones to avoid
>>>         running out of memory but as it doesn't seem to help then
>>>         what the heck. ;D. Right now I am at the load-graph stage
>>>         with the new set. As it will complete in few hours I'll put
>>>         the partition-graph on the run and then we will see if it
>>>         dies within an hour. If so I'll post a new set of scripts and
>>>         logs.
>>>
>>>         Thank you for your time,
>>>         Jens-Konrad
>>>
>>>
>>>
>>>
>>>         On 04/10/2013 04:18 AM, Eric McDonald wrote:
>>>>         Hi Jens-Konrad,
>>>>
>>>>         Sorry for the delayed response. (I was on vacation yesterday
>>>>         and hoping that someone more familiar with the partitioning
>>>>         code would answer.)
>>>>
>>>>         My understanding of the code is that decreasing the subset
>>>>         size will increase the number of partitions but will not
>>>>         change the overall graph coverage. Therefore, I would not
>>>>         expect it to lower memory requirements. (The overhead from
>>>>         additional partitions might raise them some, but I have not
>>>>         analyzed the code deeply enough to say one way or another
>>>>         about that.) As far as changing the number of threads goes,
>>>>         each thread does seem to maintain a local list of traversed
>>>>         k-mers (hidden in the C++ implementation) but I do not yet
>>>>         know how much that would impact memory usage. Have you tried
>>>>         using a fewer number of threads?
>>>>
>>>>         But, rather than guessing about causation, let's try to get
>>>>         some more diagnostic information. Does the script die
>>>>         immediately? (How long does the PBS job execute before
>>>>         failure?) Can you attach the output and error files for a
>>>>         job, and also the job script? What does
>>>>           qstat -f <job_id>
>>>>         where <job_id> is the ID of your running job, tell you about
>>>>         memory usage?
>>>>
>>>>         Thanks,
>>>>           Eric
>>>>
>>>>
>>>>
>>>>
>>>>         On Mon, Apr 8, 2013 at 3:34 AM, Jens-Konrad Preem
>>>>         <jpreem at ut.ee <mailto:jpreem at ut.ee>> wrote:
>>>>
>>>>             Hi,
>>>>             I am having trouble with completing a partition-graph.py
>>>>             job.
>>>>             No matter the configurations It seems to terminate with
>>>>             error messages hinting at low memory etc. *
>>>>             Does LOWering the subset size reduce the memory use,
>>>>             what about LOWering the amount of parallel threads?
>>>>             The graafik.ht <http://graafik.ht> is 5.2G large, I had
>>>>             the script running as a PBS job with 240 GB RAM
>>>>             allocated. (That's as much as I can get it, maybe I'll
>>>>             have an opportunity in the next week to double it, but I
>>>>             wouldn't count on it).
>>>>             Is it expected for the script to require so much RAM, or
>>>>             is there some bug or some misuse by my part. Would there
>>>>             be any configuration to get past this?
>>>>
>>>>             Jens-Konrad Preem, MSc., University of Tartu
>>>>
>>>>
>>>>
>>>>             * the latest configuration after I thought on smaller
>>>>             subset size
>>>>             ./khmer/scripts/partition-graph.py  --threads 24
>>>>             --subset-size 1e4 graafik
>>>>             terminated with
>>>>             cannot allocate memory for thread-local data: ABORT
>>>>
>>>>
>>>>             _______________________________________________
>>>>             khmer mailing list
>>>>             khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>             http://lists.idyll.org/listinfo/khmer
>>>>
>>>>
>>>>
>>>>
>>>>         --         Eric McDonald
>>>>         HPC/Cloud Software Engineer
>>>>           for the Institute for Cyber-Enabled Research (iCER)
>>>>           and the Laboratory for Genomics, Evolution, and
>>>>         Development (GED)
>>>>         Michigan State University
>>>>         P: 517-355-8733 <tel:517-355-8733>
>>>
>>>         --         Jens-Konrad Preem, MSc, University of Tartu
>>>
>>>
>>>         _______________________________________________
>>>         khmer mailing list
>>>         khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>         http://lists.idyll.org/listinfo/khmer
>>>
>>>
>>>
>>>
>>>     --     Eric McDonald
>>>     HPC/Cloud Software Engineer
>>>       for the Institute for Cyber-Enabled Research (iCER)
>>>       and the Laboratory for Genomics, Evolution, and Development (GED)
>>>     Michigan State University
>>>     P: 517-355-8733 <tel:517-355-8733>
>>>
>>>
>>>
>>>     --     Eric McDonald
>>>     HPC/Cloud Software Engineer
>>>       for the Institute for Cyber-Enabled Research (iCER)
>>>       and the Laboratory for Genomics, Evolution, and Development (GED)
>>>     Michigan State University
>>>     P: 517-355-8733 <tel:517-355-8733>
>>>
>>>
>>>     _______________________________________________
>>>     khmer mailing list
>>>     khmer at lists.idyll.org  <mailto:khmer at lists.idyll.org>
>>>     http://lists.idyll.org/listinfo/khmer
>>     OK.
>>     I post a failed run complete with PBS script, error log., and
>>     qstat-f snapshots at different times.
>>     I find it weird that I managed to complete the test run on
>>     iowa-corn50M which had a graph file even larger. Might the number
>>     of used threads pump up the memory? I used the sample commands
>>     from the web-page for corn. These used 4 threads at max.
>>     Jens-Konrad Preem
>>
>>     _______________________________________________
>>     khmer mailing list
>>     khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>     http://lists.idyll.org/listinfo/khmer
>>
>>
>>
>>
>> -- 
>> Eric McDonald
>> HPC/Cloud Software Engineer
>>   for the Institute for Cyber-Enabled Research (iCER)
>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>
> -- 
> Jens-Konrad Preem, MSc, University of Tartu
>

-- 
C. Titus Brown, ctb at msu.edu