[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)

Eric McDonald emcd.msu at gmail.com
Thu Mar 21 05:50:45 PDT 2013


Thanks for the information, Alexis. If you are using 20 threads, then 441 /
20 is about 22 hours of elapsed time. So, it appears that all of the
threads are working. (There is the possibility that they could be
busy-waiting somewhere, but I didn't see any explicit opportunities for
that from reading the 'do-partition.py' code.) Since you haven't seen .pmap
files yet and since multithreaded execution is occurring, I expect that
execution is currently at the following place in the script:

https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57

I am not familiar with the 'do_subset_partition' code, but will try to
analyze it later today. However, I would also listen to what Adina is
saying - this step may just take a long time....

Eric

P.S. If you want to check on the output from the script, you could look in
/var/spool/PBS/mom_priv (or equivalent) on the node where the job is
running to see what the spooled output looks like thus far. (There should
be a file named with the job ID and either a ".ER" or ".OU" extension, if I
recall correctly, though it has been awhile since I have administered your
kind of batch system.) You may need David to do this as the permissions to
the directory are typically restrictive.



On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi <alexis.groppi at u-bordeaux2.fr
> wrote:

>  A precision :
>
> The file submitted to the script do-partition.py contains 2576771 reads
> (file.below)
> The job was launched with the following options :
> khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20 file.graphbase
> file.below
>
> Alexis
>
>
> Le 21/03/2013 10:13, Alexis Groppi a écrit :
>
> Hi Eric,
>
> The script  do-partition.py is now running since 22 hours.
> Only the file.info has been generated. No .pmap file were created.
>
> qstat -f gives :
>     resources_used.cput = 441:04:21
>     resources_used.mem = 12764228kb
>     resources_used.vmem = 13926732kb
>     resources_used.walltime = 22:05:56
>
> The amount of RAM on the server is 256 Go and the swap space is also 256 Go
>
> Your opinion ?
>
> Thanks
>
> Alexis
>
> Le 20/03/2013 16:43, Alexis Groppi a écrit :
>
> Hi Eric,
>
> Actually the previous job was terminated by the limit of the walltime.
> I relaunched the script.
> qstat -fr gives :
>     resources_used.cput = 93:23:08
>     resources_used.mem = 12341932kb
>     resources_used.vmem = 13271372kb
>     resources_used.walltime = 04:42:39
>
> At this moment only the file.info has been generated.
>
> Let's wait and see ...
>
> Thanks again
>
> Alexis
>
>
> Le 19/03/2013 21:50, Eric McDonald a écrit :
>
> Hi Alexis,
>
>  What does:
>   qstat -f <job-id>
> where <job-id> is the ID of your job tell you for the following fields:
>   resources_used.cput
>   resources_used.vmem
>
>  And how do those values compare to actual amount of elapsed time for the
> job, the amount of physical memory on the node, and the total memory (RAM +
> swap space) on the node?
> Just checking to make sure that everything is running as it should be and
> that your process is not heavily into swap or something like that.
>
>  Thanks,
>   Eric
>
>
>
> On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi <
> alexis.groppi at u-bordeaux2.fr> wrote:
>
>>  Hi Adina,
>>
>> First of all thanks for your answer and your advices :)
>> The script extract-partitions.py works !
>> For the do-partition.py on my second set, it runs since 32 hours. Should
>> it not have produced at least one temporary .pmap file ?
>>
>> Thanks again
>>
>> Alexis
>>
>> Le 19/03/2013 12:58, Adina Chuang Howe a écrit :
>>
>>
>>
>>  Message: 1
>>> Date: Tue, 19 Mar 2013 10:41:45 +0100
>>> From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr>
>>> Subject: [khmer] Duration of do-partition.py (very long !)
>>> To: khmer at lists.idyll.org
>>> Message-ID: <514832D9.7090207 at u-bordeaux2.fr>
>>> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>>>
>>> Hi Titus,
>>>
>>> After digital normalization and filter-below-abund, upon your advice I
>>> performed do.partition.py on 2 sets of data (approx 2.5 millions of
>>> reads (75 nt)) :
>>>
>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>> and
>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>
>>> For the first one I got a
>>> 174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info with the
>>> information : 33 subsets total
>>> Thereafter 33 files .pmap from 0.pmap to 32.pmap regurlarly were created
>>> and finally I got unique file
>>> 174r1_prinseq_good_bFr8.fasta.keep.below.part (all the .pmap files were
>>> deleted)
>>> This treatment lasted approx 56 hours.
>>>
>>> For the second set (174r2), do-partition.py is started since 32 hours
>>> but I only got the
>>> 174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info with the
>>> information : 35 subsets total
>>> And nothing more...
>>>
>>> Is this duration "normal" ?
>>>
>>
>>  Yes, this is typical.  The longest I've had it run is 3 weeks for very
>> large (billions of reads).  In general, partitioning is the most time
>> consuming of all the steps.  Once its finished, you'll have much smaller
>> files which can be assembled very quickly.  Since I run assembly on
>> multiple assembler and with multiple K lengths, this gain is often
>>  significant for me.
>>
>>  To get the actual partitioned files, you can use the following script:
>>
>>
>> https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>
>>  (The parameters for the threads are by default (4 threads))
>>> 33 subsets and only one file at the end ?
>>> Should I stop do-partition.py on the second set and re run it with more
>>> threads ?
>>>
>>>
>>  I'd suggest letting it run.
>>
>>  Best,
>> Adina
>>
>>
>>  _______________________________________________
>> khmer mailing listkhmer at lists.idyll.orghttp://lists.idyll.org/listinfo/khmer
>>
>>
>> --
>>
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org
>> http://lists.idyll.org/listinfo/khmer
>>
>>
>
>
>  --
>  Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>
>
> --
>
>
> --
>
>
> --
>



-- 
Eric McDonald
HPC/Cloud Software Engineer
  for the Institute for Cyber-Enabled Research (iCER)
  and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/cd9e015a/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/cd9e015a/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/cd9e015a/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/cd9e015a/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/cd9e015a/attachment-0011.png>


More information about the khmer mailing list