[khmer] Fwd: parition-graph memory requirements

Fri Apr 12 04:30:09 PDT 2013

On 04/11/2013 02:58 AM, Eric McDonald wrote:
> Forgot to reply to all, in case the answer will help anyone else on 
> the list....
>
> ---------- Forwarded message ----------
> From: *Eric McDonald* <emcd.msu at gmail.com <mailto:emcd.msu at gmail.com>>
> Date: Wed, Apr 10, 2013 at 7:57 PM
> Subject: Re: [khmer] parition-graph memory requirements
> To: Jens-Konrad Preem <jpreem at ut.ee <mailto:jpreem at ut.ee>>
>
>
> Hi,
>
> Sorry for the delayed reply.
>
> Thanks for sharing your job scripts. I notice that you are specifying 
> the 'vmem' resource. However, if PBS is also enforcing a limit on the 
> 'mem' resource (physical memory), then you may be encountering that 
> limit. Do you know what default value is assigned by your site's PBS 
> server for the 'mem' resource?
>
> Again, if you run:
>   qstat -f <job_id>
> you should be able to determine both the resources allocated for the 
> job and how much the job is actually using. Please let us know the 
> results of this command, if you would like help interpreting them and 
> figuring out how to change your PBS resource request, if necessary.
>
> As a side note, smaller k-mer lengths mean that more k-mers are being 
> extracted from each sequence. This means that the hash tables are 
> being more densely populated. And, that means that you are more likely 
> to need larger hash tables to avoid a significant false positive rate. 
> But, I think a better thing to say is that the amount of memory used 
> by the hash tables is independent of k-mer size. So, changing k-mer 
> length does not affect memory usage for many parts of khmer. (I would 
> have to look more closely to see how this affects the partitioning code.)
>
> Hope that helps,
>   Eric
>
>
>
> On Wed, Apr 10, 2013 at 4:23 AM, Jens-Konrad Preem <jpreem at ut.ee 
> <mailto:jpreem at ut.ee>> wrote:
>
>     Hi,
>
>     In an extreme act of foolishness I do seem to have lost my error
>     logs. (I have been messing with the different  scripts  here a lot
>     and so got rid of some of the outputs,  in some ill thought out
>     "housekeeping" event).
>
>     I do attach here a bunch of PBS scripts that I used to get as far
>     as I am. I did use a different script for most of the normalize
>     and partition pipeline, so I'd have time to look at the outputs
>     and get a sense of time taken for each. The scripts are in
>     following order - supkhme(normalize), suprem(filter-below),
>     supload(load-graph), and finally supart(partition-graph). (As can
>     be seen I try to do the meta-genome analysis as per the guide.txt)
>     All the previous scripts completed without complaint, producing
>     the 5.2 Gb "graafik" graph.
>
>     The partition graph had failed a few times after running an hour
>     or so always with error messages concerning memory. Now the latest
>     script there demands 240 Gb of memory which is maximum I can
>     demand in the near future, and still failed with an error message
>     concerning memory.
>
>     I am right now working on reproducing the error, so I can then
>     supply you with .logs and .error files, when no error occurs the
>     better for me of course.
>     I decided to try different k-values this time as suggested by
>     https://khmer.readthedocs.org/en/latest/guide.html (20 for
>     normalization, and 32 for partitioning) those should make the
>     graph file all the bigger - I used the smaller ones to avoid
>     running out of memory but as it doesn't seem to help then what the
>     heck. ;D. Right now I am at the load-graph stage with the new set.
>     As it will complete in few hours I'll put the partition-graph on
>     the run and then we will see if it dies within an hour. If so I'll
>     post a new set of scripts and logs.
>
>     Thank you for your time,
>     Jens-Konrad
>
>
>
>
>     On 04/10/2013 04:18 AM, Eric McDonald wrote:
>>     Hi Jens-Konrad,
>>
>>     Sorry for the delayed response. (I was on vacation yesterday and
>>     hoping that someone more familiar with the partitioning code
>>     would answer.)
>>
>>     My understanding of the code is that decreasing the subset size
>>     will increase the number of partitions but will not change the
>>     overall graph coverage. Therefore, I would not expect it to lower
>>     memory requirements. (The overhead from additional partitions
>>     might raise them some, but I have not analyzed the code deeply
>>     enough to say one way or another about that.) As far as changing
>>     the number of threads goes, each thread does seem to maintain a
>>     local list of traversed k-mers (hidden in the C++ implementation)
>>     but I do not yet know how much that would impact memory usage.
>>     Have you tried using a fewer number of threads?
>>
>>     But, rather than guessing about causation, let's try to get some
>>     more diagnostic information. Does the script die immediately?
>>     (How long does the PBS job execute before failure?) Can you
>>     attach the output and error files for a job, and also the job
>>     script? What does
>>       qstat -f <job_id>
>>     where <job_id> is the ID of your running job, tell you about
>>     memory usage?
>>
>>     Thanks,
>>       Eric
>>
>>
>>
>>
>>     On Mon, Apr 8, 2013 at 3:34 AM, Jens-Konrad Preem <jpreem at ut.ee
>>     <mailto:jpreem at ut.ee>> wrote:
>>
>>         Hi,
>>         I am having trouble with completing a partition-graph.py job.
>>         No matter the configurations It seems to terminate with error
>>         messages hinting at low memory etc. *
>>         Does LOWering the subset size reduce the memory use, what
>>         about LOWering the amount of parallel threads?
>>         The graafik.ht <http://graafik.ht> is 5.2G large, I had the
>>         script running as a PBS job with 240 GB RAM allocated.
>>         (That's as much as I can get it, maybe I'll have an
>>         opportunity in the next week to double it, but I wouldn't
>>         count on it).
>>         Is it expected for the script to require so much RAM, or is
>>         there some bug or some misuse by my part. Would there be any
>>         configuration to get past this?
>>
>>         Jens-Konrad Preem, MSc., University of Tartu
>>
>>
>>
>>         * the latest configuration after I thought on smaller subset size
>>         ./khmer/scripts/partition-graph.py  --threads 24
>>         --subset-size 1e4 graafik
>>         terminated with
>>         cannot allocate memory for thread-local data: ABORT
>>
>>
>>         _______________________________________________
>>         khmer mailing list
>>         khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>         http://lists.idyll.org/listinfo/khmer
>>
>>
>>
>>
>>     -- 
>>     Eric McDonald
>>     HPC/Cloud Software Engineer
>>       for the Institute for Cyber-Enabled Research (iCER)
>>       and the Laboratory for Genomics, Evolution, and Development (GED)
>>     Michigan State University
>>     P: 517-355-8733 <tel:517-355-8733>
>
>     -- 
>     Jens-Konrad Preem, MSc, University of Tartu
>
>
>     _______________________________________________
>     khmer mailing list
>     khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>     http://lists.idyll.org/listinfo/khmer
>
>
>
>
> -- 
> Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733 <tel:517-355-8733>
>
>
>
> -- 
> Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
OK.
I post a failed run complete with PBS script, error log., and qstat-f 
snapshots at different times.
I find it weird that I managed to complete the test run on iowa-corn50M 
which had a graph file even larger. Might the number of used threads 
pump up the memory? I used the sample commands from the web-page for 
corn. These used 4 threads at max.
Jens-Konrad Preem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130412/41c68ab1/attachment.htm>
-------------- next part --------------
Job Id: 19357.silinder.hpc.ut.ee
    Job_Name = partition-graph
    Job_Owner = jpreem at silinder.hpc.ut.ee
    resources_used.cput = 01:51:31
    resources_used.mem = 52379536kb
    resources_used.vmem = 53137536kb
    resources_used.walltime = 00:15:12
    job_state = R
    queue = regular
    server = silinder.hpc.ut.ee
    Checkpoint = u
    ctime = Fri Apr 12 12:06:29 2013
    Error_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.e19357
    exec_host = silinder.hpc.ut.ee/18+silinder.hpc.ut.ee/17+silinder.hpc.ut.ee
	/15+silinder.hpc.ut.ee/13+silinder.hpc.ut.ee/10+silinder.hpc.ut.ee/7+s
	ilinder.hpc.ut.ee/5+silinder.hpc.ut.ee/3
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = abe
    Mail_Users = jpreem at ut.ee
    mtime = Fri Apr 12 12:06:35 2013
    Output_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.o19357
    Priority = 0
    qtime = Fri Apr 12 12:06:29 2013
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=8
    Resource_List.vmem = 180gb
    Resource_List.walltime = 08:00:00
    session_id = 38881
    Variable_List = PBS_O_QUEUE=regular,PBS_O_HOST=silinder.hpc.ut.ee,
	PBS_O_HOME=/home/murakas/j/jpreem,PBS_O_LANG=en_US.utf8,
	PBS_O_LOGNAME=jpreem,
	PBS_O_PATH=/usr/lib64/qt-3.3/bin:/storage/software/bin:/usr/local/bin
	:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin,
	PBS_O_MAIL=/var/spool/mail/jpreem,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=silinder.hpc.ut.ee,PBS_O_INITDIR=/tmp/jpreem/,
	PBS_O_WORKDIR=/tmp/jpreem
    etime = Fri Apr 12 12:06:29 2013
    submit_args = supart
    start_time = Fri Apr 12 12:06:35 2013
    Walltime.Remaining = 27840
    start_count = 1
    fault_tolerant = False
    submit_host = silinder.hpc.ut.ee
    init_work_dir = /tmp/jpreem

-------------- next part --------------
Job Id: 19357.silinder.hpc.ut.ee
    Job_Name = partition-graph
    Job_Owner = jpreem at silinder.hpc.ut.ee
    resources_used.cput = 03:56:58
    resources_used.mem = 90676068kb
    resources_used.vmem = 92065952kb
    resources_used.walltime = 00:30:57
    job_state = R
    queue = regular
    server = silinder.hpc.ut.ee
    Checkpoint = u
    ctime = Fri Apr 12 12:06:29 2013
    Error_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.e19357
    exec_host = silinder.hpc.ut.ee/18+silinder.hpc.ut.ee/17+silinder.hpc.ut.ee
	/15+silinder.hpc.ut.ee/13+silinder.hpc.ut.ee/10+silinder.hpc.ut.ee/7+s
	ilinder.hpc.ut.ee/5+silinder.hpc.ut.ee/3
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = abe
    Mail_Users = jpreem at ut.ee
    mtime = Fri Apr 12 12:06:35 2013
    Output_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.o19357
    Priority = 0
    qtime = Fri Apr 12 12:06:29 2013
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=8
    Resource_List.vmem = 180gb
    Resource_List.walltime = 08:00:00
    session_id = 38881
    Variable_List = PBS_O_QUEUE=regular,PBS_O_HOST=silinder.hpc.ut.ee,
	PBS_O_HOME=/home/murakas/j/jpreem,PBS_O_LANG=en_US.utf8,
	PBS_O_LOGNAME=jpreem,
	PBS_O_PATH=/usr/lib64/qt-3.3/bin:/storage/software/bin:/usr/local/bin
	:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin,
	PBS_O_MAIL=/var/spool/mail/jpreem,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=silinder.hpc.ut.ee,PBS_O_INITDIR=/tmp/jpreem/,
	PBS_O_WORKDIR=/tmp/jpreem
    etime = Fri Apr 12 12:06:29 2013
    submit_args = supart
    start_time = Fri Apr 12 12:06:35 2013
    Walltime.Remaining = 26922
    start_count = 1
    fault_tolerant = False
    submit_host = silinder.hpc.ut.ee
    init_work_dir = /tmp/jpreem

-------------- next part --------------
Job Id: 19357.silinder.hpc.ut.ee
    Job_Name = partition-graph
    Job_Owner = jpreem at silinder.hpc.ut.ee
    resources_used.cput = 05:50:32
    resources_used.mem = 122543188kb
    resources_used.vmem = 123662496kb
    resources_used.walltime = 00:45:12
    job_state = R
    queue = regular
    server = silinder.hpc.ut.ee
    Checkpoint = u
    ctime = Fri Apr 12 12:06:29 2013
    Error_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.e19357
    exec_host = silinder.hpc.ut.ee/18+silinder.hpc.ut.ee/17+silinder.hpc.ut.ee
	/15+silinder.hpc.ut.ee/13+silinder.hpc.ut.ee/10+silinder.hpc.ut.ee/7+s
	ilinder.hpc.ut.ee/5+silinder.hpc.ut.ee/3
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = abe
    Mail_Users = jpreem at ut.ee
    mtime = Fri Apr 12 12:06:35 2013
    Output_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.o19357
    Priority = 0
    qtime = Fri Apr 12 12:06:29 2013
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=8
    Resource_List.vmem = 180gb
    Resource_List.walltime = 08:00:00
    session_id = 38881
    Variable_List = PBS_O_QUEUE=regular,PBS_O_HOST=silinder.hpc.ut.ee,
	PBS_O_HOME=/home/murakas/j/jpreem,PBS_O_LANG=en_US.utf8,
	PBS_O_LOGNAME=jpreem,
	PBS_O_PATH=/usr/lib64/qt-3.3/bin:/storage/software/bin:/usr/local/bin
	:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin,
	PBS_O_MAIL=/var/spool/mail/jpreem,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=silinder.hpc.ut.ee,PBS_O_INITDIR=/tmp/jpreem/,
	PBS_O_WORKDIR=/tmp/jpreem
    etime = Fri Apr 12 12:06:29 2013
    submit_args = supart
    start_time = Fri Apr 12 12:06:35 2013
    Walltime.Remaining = 26044
    start_count = 1
    fault_tolerant = False
    submit_host = silinder.hpc.ut.ee
    init_work_dir = /tmp/jpreem

-------------- next part --------------
cannot allocate memory for thread-local data: ABORT
-------------- next part --------------
#PBS -N partition-graph

#PBS -l nodes=1:ppn=8

#PBS -l vmem=180gb

#PBS -l walltime=8:00:00

# T88 alguses ja p2rast l6ppu saadetakse kiri
#PBS -M jpreem at ut.ee
#PBS -m abe

# M22ra t88 kodukataloogigks oma /storage/hpchome/<kasutajanimi> asuv kataloog. 
# P2rast 6ige kataloogi sisestamist eemalda rea algusest liigsed #
#PBS -d /tmp/jpreem/

# Kirjuta oma k2sud siia

source activate
./khmer/scripts/partition-graph.py  --threads 8 --subset-size 1e6 graafik