[khmer] Fwd: parition-graph memory requirements
Jens-Konrad Preem
jpreem at ut.ee
Fri Apr 12 04:30:09 PDT 2013
On 04/11/2013 02:58 AM, Eric McDonald wrote:
> Forgot to reply to all, in case the answer will help anyone else on
> the list....
>
> ---------- Forwarded message ----------
> From: *Eric McDonald* <emcd.msu at gmail.com <mailto:emcd.msu at gmail.com>>
> Date: Wed, Apr 10, 2013 at 7:57 PM
> Subject: Re: [khmer] parition-graph memory requirements
> To: Jens-Konrad Preem <jpreem at ut.ee <mailto:jpreem at ut.ee>>
>
>
> Hi,
>
> Sorry for the delayed reply.
>
> Thanks for sharing your job scripts. I notice that you are specifying
> the 'vmem' resource. However, if PBS is also enforcing a limit on the
> 'mem' resource (physical memory), then you may be encountering that
> limit. Do you know what default value is assigned by your site's PBS
> server for the 'mem' resource?
>
> Again, if you run:
> qstat -f <job_id>
> you should be able to determine both the resources allocated for the
> job and how much the job is actually using. Please let us know the
> results of this command, if you would like help interpreting them and
> figuring out how to change your PBS resource request, if necessary.
>
> As a side note, smaller k-mer lengths mean that more k-mers are being
> extracted from each sequence. This means that the hash tables are
> being more densely populated. And, that means that you are more likely
> to need larger hash tables to avoid a significant false positive rate.
> But, I think a better thing to say is that the amount of memory used
> by the hash tables is independent of k-mer size. So, changing k-mer
> length does not affect memory usage for many parts of khmer. (I would
> have to look more closely to see how this affects the partitioning code.)
>
> Hope that helps,
> Eric
>
>
>
> On Wed, Apr 10, 2013 at 4:23 AM, Jens-Konrad Preem <jpreem at ut.ee
> <mailto:jpreem at ut.ee>> wrote:
>
> Hi,
>
> In an extreme act of foolishness I do seem to have lost my error
> logs. (I have been messing with the different scripts here a lot
> and so got rid of some of the outputs, in some ill thought out
> "housekeeping" event).
>
> I do attach here a bunch of PBS scripts that I used to get as far
> as I am. I did use a different script for most of the normalize
> and partition pipeline, so I'd have time to look at the outputs
> and get a sense of time taken for each. The scripts are in
> following order - supkhme(normalize), suprem(filter-below),
> supload(load-graph), and finally supart(partition-graph). (As can
> be seen I try to do the meta-genome analysis as per the guide.txt)
> All the previous scripts completed without complaint, producing
> the 5.2 Gb "graafik" graph.
>
> The partition graph had failed a few times after running an hour
> or so always with error messages concerning memory. Now the latest
> script there demands 240 Gb of memory which is maximum I can
> demand in the near future, and still failed with an error message
> concerning memory.
>
> I am right now working on reproducing the error, so I can then
> supply you with .logs and .error files, when no error occurs the
> better for me of course.
> I decided to try different k-values this time as suggested by
> https://khmer.readthedocs.org/en/latest/guide.html (20 for
> normalization, and 32 for partitioning) those should make the
> graph file all the bigger - I used the smaller ones to avoid
> running out of memory but as it doesn't seem to help then what the
> heck. ;D. Right now I am at the load-graph stage with the new set.
> As it will complete in few hours I'll put the partition-graph on
> the run and then we will see if it dies within an hour. If so I'll
> post a new set of scripts and logs.
>
> Thank you for your time,
> Jens-Konrad
>
>
>
>
> On 04/10/2013 04:18 AM, Eric McDonald wrote:
>> Hi Jens-Konrad,
>>
>> Sorry for the delayed response. (I was on vacation yesterday and
>> hoping that someone more familiar with the partitioning code
>> would answer.)
>>
>> My understanding of the code is that decreasing the subset size
>> will increase the number of partitions but will not change the
>> overall graph coverage. Therefore, I would not expect it to lower
>> memory requirements. (The overhead from additional partitions
>> might raise them some, but I have not analyzed the code deeply
>> enough to say one way or another about that.) As far as changing
>> the number of threads goes, each thread does seem to maintain a
>> local list of traversed k-mers (hidden in the C++ implementation)
>> but I do not yet know how much that would impact memory usage.
>> Have you tried using a fewer number of threads?
>>
>> But, rather than guessing about causation, let's try to get some
>> more diagnostic information. Does the script die immediately?
>> (How long does the PBS job execute before failure?) Can you
>> attach the output and error files for a job, and also the job
>> script? What does
>> qstat -f <job_id>
>> where <job_id> is the ID of your running job, tell you about
>> memory usage?
>>
>> Thanks,
>> Eric
>>
>>
>>
>>
>> On Mon, Apr 8, 2013 at 3:34 AM, Jens-Konrad Preem <jpreem at ut.ee
>> <mailto:jpreem at ut.ee>> wrote:
>>
>> Hi,
>> I am having trouble with completing a partition-graph.py job.
>> No matter the configurations It seems to terminate with error
>> messages hinting at low memory etc. *
>> Does LOWering the subset size reduce the memory use, what
>> about LOWering the amount of parallel threads?
>> The graafik.ht <http://graafik.ht> is 5.2G large, I had the
>> script running as a PBS job with 240 GB RAM allocated.
>> (That's as much as I can get it, maybe I'll have an
>> opportunity in the next week to double it, but I wouldn't
>> count on it).
>> Is it expected for the script to require so much RAM, or is
>> there some bug or some misuse by my part. Would there be any
>> configuration to get past this?
>>
>> Jens-Konrad Preem, MSc., University of Tartu
>>
>>
>>
>> * the latest configuration after I thought on smaller subset size
>> ./khmer/scripts/partition-graph.py --threads 24
>> --subset-size 1e4 graafik
>> terminated with
>> cannot allocate memory for thread-local data: ABORT
>>
>>
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>> http://lists.idyll.org/listinfo/khmer
>>
>>
>>
>>
>> --
>> Eric McDonald
>> HPC/Cloud Software Engineer
>> for the Institute for Cyber-Enabled Research (iCER)
>> and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733 <tel:517-355-8733>
>
> --
> Jens-Konrad Preem, MSc, University of Tartu
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
> http://lists.idyll.org/listinfo/khmer
>
>
>
>
> --
> Eric McDonald
> HPC/Cloud Software Engineer
> for the Institute for Cyber-Enabled Research (iCER)
> and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733 <tel:517-355-8733>
>
>
>
> --
> Eric McDonald
> HPC/Cloud Software Engineer
> for the Institute for Cyber-Enabled Research (iCER)
> and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
OK.
I post a failed run complete with PBS script, error log., and qstat-f
snapshots at different times.
I find it weird that I managed to complete the test run on iowa-corn50M
which had a graph file even larger. Might the number of used threads
pump up the memory? I used the sample commands from the web-page for
corn. These used 4 threads at max.
Jens-Konrad Preem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130412/41c68ab1/attachment.htm>
-------------- next part --------------
Job Id: 19357.silinder.hpc.ut.ee
Job_Name = partition-graph
Job_Owner = jpreem at silinder.hpc.ut.ee
resources_used.cput = 01:51:31
resources_used.mem = 52379536kb
resources_used.vmem = 53137536kb
resources_used.walltime = 00:15:12
job_state = R
queue = regular
server = silinder.hpc.ut.ee
Checkpoint = u
ctime = Fri Apr 12 12:06:29 2013
Error_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.e19357
exec_host = silinder.hpc.ut.ee/18+silinder.hpc.ut.ee/17+silinder.hpc.ut.ee
/15+silinder.hpc.ut.ee/13+silinder.hpc.ut.ee/10+silinder.hpc.ut.ee/7+s
ilinder.hpc.ut.ee/5+silinder.hpc.ut.ee/3
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = abe
Mail_Users = jpreem at ut.ee
mtime = Fri Apr 12 12:06:35 2013
Output_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.o19357
Priority = 0
qtime = Fri Apr 12 12:06:29 2013
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=8
Resource_List.vmem = 180gb
Resource_List.walltime = 08:00:00
session_id = 38881
Variable_List = PBS_O_QUEUE=regular,PBS_O_HOST=silinder.hpc.ut.ee,
PBS_O_HOME=/home/murakas/j/jpreem,PBS_O_LANG=en_US.utf8,
PBS_O_LOGNAME=jpreem,
PBS_O_PATH=/usr/lib64/qt-3.3/bin:/storage/software/bin:/usr/local/bin
:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin,
PBS_O_MAIL=/var/spool/mail/jpreem,PBS_O_SHELL=/bin/bash,
PBS_SERVER=silinder.hpc.ut.ee,PBS_O_INITDIR=/tmp/jpreem/,
PBS_O_WORKDIR=/tmp/jpreem
etime = Fri Apr 12 12:06:29 2013
submit_args = supart
start_time = Fri Apr 12 12:06:35 2013
Walltime.Remaining = 27840
start_count = 1
fault_tolerant = False
submit_host = silinder.hpc.ut.ee
init_work_dir = /tmp/jpreem
-------------- next part --------------
Job Id: 19357.silinder.hpc.ut.ee
Job_Name = partition-graph
Job_Owner = jpreem at silinder.hpc.ut.ee
resources_used.cput = 03:56:58
resources_used.mem = 90676068kb
resources_used.vmem = 92065952kb
resources_used.walltime = 00:30:57
job_state = R
queue = regular
server = silinder.hpc.ut.ee
Checkpoint = u
ctime = Fri Apr 12 12:06:29 2013
Error_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.e19357
exec_host = silinder.hpc.ut.ee/18+silinder.hpc.ut.ee/17+silinder.hpc.ut.ee
/15+silinder.hpc.ut.ee/13+silinder.hpc.ut.ee/10+silinder.hpc.ut.ee/7+s
ilinder.hpc.ut.ee/5+silinder.hpc.ut.ee/3
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = abe
Mail_Users = jpreem at ut.ee
mtime = Fri Apr 12 12:06:35 2013
Output_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.o19357
Priority = 0
qtime = Fri Apr 12 12:06:29 2013
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=8
Resource_List.vmem = 180gb
Resource_List.walltime = 08:00:00
session_id = 38881
Variable_List = PBS_O_QUEUE=regular,PBS_O_HOST=silinder.hpc.ut.ee,
PBS_O_HOME=/home/murakas/j/jpreem,PBS_O_LANG=en_US.utf8,
PBS_O_LOGNAME=jpreem,
PBS_O_PATH=/usr/lib64/qt-3.3/bin:/storage/software/bin:/usr/local/bin
:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin,
PBS_O_MAIL=/var/spool/mail/jpreem,PBS_O_SHELL=/bin/bash,
PBS_SERVER=silinder.hpc.ut.ee,PBS_O_INITDIR=/tmp/jpreem/,
PBS_O_WORKDIR=/tmp/jpreem
etime = Fri Apr 12 12:06:29 2013
submit_args = supart
start_time = Fri Apr 12 12:06:35 2013
Walltime.Remaining = 26922
start_count = 1
fault_tolerant = False
submit_host = silinder.hpc.ut.ee
init_work_dir = /tmp/jpreem
-------------- next part --------------
Job Id: 19357.silinder.hpc.ut.ee
Job_Name = partition-graph
Job_Owner = jpreem at silinder.hpc.ut.ee
resources_used.cput = 05:50:32
resources_used.mem = 122543188kb
resources_used.vmem = 123662496kb
resources_used.walltime = 00:45:12
job_state = R
queue = regular
server = silinder.hpc.ut.ee
Checkpoint = u
ctime = Fri Apr 12 12:06:29 2013
Error_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.e19357
exec_host = silinder.hpc.ut.ee/18+silinder.hpc.ut.ee/17+silinder.hpc.ut.ee
/15+silinder.hpc.ut.ee/13+silinder.hpc.ut.ee/10+silinder.hpc.ut.ee/7+s
ilinder.hpc.ut.ee/5+silinder.hpc.ut.ee/3
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = abe
Mail_Users = jpreem at ut.ee
mtime = Fri Apr 12 12:06:35 2013
Output_Path = silinder.hpc.ut.ee:/tmp/jpreem/partition-graph.o19357
Priority = 0
qtime = Fri Apr 12 12:06:29 2013
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=8
Resource_List.vmem = 180gb
Resource_List.walltime = 08:00:00
session_id = 38881
Variable_List = PBS_O_QUEUE=regular,PBS_O_HOST=silinder.hpc.ut.ee,
PBS_O_HOME=/home/murakas/j/jpreem,PBS_O_LANG=en_US.utf8,
PBS_O_LOGNAME=jpreem,
PBS_O_PATH=/usr/lib64/qt-3.3/bin:/storage/software/bin:/usr/local/bin
:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin,
PBS_O_MAIL=/var/spool/mail/jpreem,PBS_O_SHELL=/bin/bash,
PBS_SERVER=silinder.hpc.ut.ee,PBS_O_INITDIR=/tmp/jpreem/,
PBS_O_WORKDIR=/tmp/jpreem
etime = Fri Apr 12 12:06:29 2013
submit_args = supart
start_time = Fri Apr 12 12:06:35 2013
Walltime.Remaining = 26044
start_count = 1
fault_tolerant = False
submit_host = silinder.hpc.ut.ee
init_work_dir = /tmp/jpreem
-------------- next part --------------
cannot allocate memory for thread-local data: ABORT
-------------- next part --------------
#PBS -N partition-graph
#PBS -l nodes=1:ppn=8
#PBS -l vmem=180gb
#PBS -l walltime=8:00:00
# T88 alguses ja p2rast l6ppu saadetakse kiri
#PBS -M jpreem at ut.ee
#PBS -m abe
# M22ra t88 kodukataloogigks oma /storage/hpchome/<kasutajanimi> asuv kataloog.
# P2rast 6ige kataloogi sisestamist eemalda rea algusest liigsed #
#PBS -d /tmp/jpreem/
# Kirjuta oma k2sud siia
source activate
./khmer/scripts/partition-graph.py --threads 8 --subset-size 1e6 graafik
More information about the khmer
mailing list