[khmer] Fwd: parition-graph memory requirements

Fri Apr 12 19:35:57 PDT 2013

Jens-Konrad,

Thanks for providing this information.
 15:  resources_used.mem = 52379536kb
30: resources_used.mem = 90676068kb
45: resources_used.mem = 122543188kb
Definitely some ballooning memory use there.

One more thing you may wish to examine from the command line is:
  qmgr -c "l s" | grep 'resources_'
This will tell you about any default resources (such as physical memory)
that your PBS server is assigning to new jobs. That said, I do believe that
your jobs are exhausting available memory.
So, now the question is whether anything can be done about it. Unless
someone with more experience with the partitioning code decides to speak
up, I am going to have analyze your chosen parameters and the pieces of
code in question to see if I can deduce anything. I might not be able to do
this until Monday - I am too tired to do it tonight (here in US Eastern
time) and have a busy weekend ahead of me.

I promise I will get back to you with some better answers if no one else
decides to say anything. While you are waiting for a response and if you
want to test your hypothesis about the number of threads correlating to
increased memory use, then I would recommend using a smaller data set and
seeing what kind of scaling in the memory use you see as you change the
number of threads.

Have a good weekend,
  Eric

On Fri, Apr 12, 2013 at 7:30 AM, Jens-Konrad Preem <jpreem at ut.ee> wrote:

>  On 04/11/2013 02:58 AM, Eric McDonald wrote:
>
> Forgot to reply to all, in case the answer will help anyone else on the
> list....
>
> ---------- Forwarded message ----------
> From: Eric McDonald <emcd.msu at gmail.com>
> Date: Wed, Apr 10, 2013 at 7:57 PM
> Subject: Re: [khmer] parition-graph memory requirements
> To: Jens-Konrad Preem <jpreem at ut.ee>
>
>
> Hi,
>
>  Sorry for the delayed reply.
>
>  Thanks for sharing your job scripts. I notice that you are specifying
> the 'vmem' resource. However, if PBS is also enforcing a limit on the 'mem'
> resource (physical memory), then you may be encountering that limit. Do you
> know what default value is assigned by your site's PBS server for the 'mem'
> resource?
>
>  Again, if you run:
>   qstat -f <job_id>
> you should be able to determine both the resources allocated for the job
> and how much the job is actually using. Please let us know the results of
> this command, if you would like help interpreting them and figuring out how
> to change your PBS resource request, if necessary.
>
>  As a side note, smaller k-mer lengths mean that more k-mers are being
> extracted from each sequence. This means that the hash tables are being
> more densely populated. And, that means that you are more likely to need
> larger hash tables to avoid a significant false positive rate. But, I think
> a better thing to say is that the amount of memory used by the hash tables
> is independent of k-mer size. So, changing k-mer length does not affect
> memory usage for many parts of khmer. (I would have to look more closely to
> see how this affects the partitioning code.)
>
>  Hope that helps,
>   Eric
>
>
>
> On Wed, Apr 10, 2013 at 4:23 AM, Jens-Konrad Preem <jpreem at ut.ee> wrote:
>
>>  Hi,
>>
>> In an extreme act of foolishness I do seem to have lost my error logs. (I
>> have been messing with the different  scripts  here a lot and so got rid of
>> some of the outputs,  in some ill thought out "housekeeping" event).
>>
>> I do attach here a bunch of PBS scripts that I used to get as far as I
>> am. I did use a different script for most of the normalize and partition
>> pipeline, so I'd have time to look at the outputs and get a sense of time
>> taken for each. The scripts are in following order - supkhme(normalize),
>> suprem(filter-below), supload(load-graph), and finally
>> supart(partition-graph). (As can be seen I try to do the meta-genome
>> analysis as per the guide.txt)
>> All the previous scripts completed without complaint, producing the 5.2
>> Gb "graafik" graph.
>>
>> The partition graph had failed a few times after running an hour or so
>> always with error messages concerning memory. Now the latest script there
>> demands 240 Gb of memory which is maximum I can demand in the near future,
>> and still failed with an error message concerning memory.
>>
>> I am right now working on reproducing the error, so I can then supply you
>> with .logs and .error files, when no error occurs the better for me of
>> course.
>> I decided to try different k-values this time as suggested by
>> https://khmer.readthedocs.org/en/latest/guide.html (20 for
>> normalization, and 32 for partitioning) those should make the graph file
>> all the bigger - I used the smaller ones to avoid running out of memory but
>> as it doesn't seem to help then what the heck. ;D. Right now I am at the
>> load-graph stage with the new set. As it will complete in few hours I'll
>> put the partition-graph on the run and then we will see if it dies within
>> an hour. If so I'll post a new set of scripts and logs.
>>
>> Thank you for your time,
>> Jens-Konrad
>>
>>
>>
>>
>> On 04/10/2013 04:18 AM, Eric McDonald wrote:
>>
>> Hi Jens-Konrad,
>>
>>  Sorry for the delayed response. (I was on vacation yesterday and hoping
>> that someone more familiar with the partitioning code would answer.)
>>
>>  My understanding of the code is that decreasing the subset size will
>> increase the number of partitions but will not change the overall graph
>> coverage. Therefore, I would not expect it to lower memory requirements.
>> (The overhead from additional partitions might raise them some, but I have
>> not analyzed the code deeply enough to say one way or another about that.)
>> As far as changing the number of threads goes, each thread does seem to
>> maintain a local list of traversed k-mers (hidden in the C++
>> implementation) but I do not yet know how much that would impact memory
>> usage. Have you tried using a fewer number of threads?
>>
>>  But, rather than guessing about causation, let's try to get some more
>> diagnostic information. Does the script die immediately? (How long does the
>> PBS job execute before failure?) Can you attach the output and error files
>> for a job, and also the job script? What does
>>   qstat -f <job_id>
>> where <job_id> is the ID of your running job, tell you about memory usage?
>>
>>  Thanks,
>>   Eric
>>
>>
>>
>>
>> On Mon, Apr 8, 2013 at 3:34 AM, Jens-Konrad Preem <jpreem at ut.ee> wrote:
>>
>>> Hi,
>>> I am having trouble with completing a partition-graph.py job.
>>> No matter the configurations It seems to terminate with error messages
>>> hinting at low memory etc. *
>>> Does LOWering the subset size reduce the memory use, what about LOWering
>>> the amount of parallel threads?
>>> The graafik.ht is 5.2G large, I had the script running as a PBS job
>>> with 240 GB RAM allocated. (That's as much as I can get it, maybe I'll have
>>> an opportunity in the next week to double it, but I wouldn't count on it).
>>> Is it expected for the script to require so much RAM, or is there some
>>> bug or some misuse by my part. Would there be any configuration to get past
>>> this?
>>>
>>> Jens-Konrad Preem, MSc., University of Tartu
>>>
>>>
>>>
>>> * the latest configuration after I thought on smaller subset size
>>> ./khmer/scripts/partition-graph.py  --threads 24 --subset-size 1e4
>>> graafik
>>> terminated with
>>> cannot allocate memory for thread-local data: ABORT
>>>
>>>
>>> _______________________________________________
>>> khmer mailing list
>>> khmer at lists.idyll.org
>>> http://lists.idyll.org/listinfo/khmer
>>>
>>
>>
>>
>>  --
>>  Eric McDonald
>> HPC/Cloud Software Engineer
>>   for the Institute for Cyber-Enabled Research (iCER)
>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>>
>>
>>   --
>> Jens-Konrad Preem, MSc, University of Tartu
>>
>>
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org
>> http://lists.idyll.org/listinfo/khmer
>>
>>
>
>
>  --
>  Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>
>
>
>  --
>  Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>
>
> _______________________________________________
> khmer mailing listkhmer at lists.idyll.orghttp://lists.idyll.org/listinfo/khmer
>
>  OK.
> I post a failed run complete with PBS script, error log., and qstat-f
> snapshots at different times.
> I find it weird that I managed to complete the test run on iowa-corn50M
> which had a graph file even larger. Might the number of used threads pump
> up the memory? I used the sample commands from the web-page for corn. These
> used 4 threads at max.
> Jens-Konrad Preem
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
>
>

-- 
Eric McDonald
HPC/Cloud Software Engineer
  for the Institute for Cyber-Enabled Research (iCER)
  and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130412/84341663/attachment-0002.htm>