[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)

Thu Mar 21 07:15:33 PDT 2013

Hi Titus,

Thanks for your answer. The input file I use should not have this 
artefact because it comes after filter-below-abund treatment.
I will try with find-knots and then filter-stoptags.
For your last proposition : what is the size limit ?
Subsidiary question, Eric told me "Titus created a guide about what size 
hash table to generally use with certain kinds of data"
If possible I would be very interested to have this guide.

Thanks again

Alexis

Le 21/03/2013 14:14, C. Titus Brown a écrit :
> This long wait is probably a sign that you have a highly connected 
> graph. We usually attribute that to the presence of sequencing 
> artifacts, which have to be removed either via filter-below-abund or 
> find-knot; do-partition can't do it itself.  Take a look at the 
> handbook or the info on part large data.
>
> In your case I think your data may be small enough to assemble just 
> after diginorm.
>
> ---
> C. Titus Brown, ctb at msu.edu <mailto:ctb at msu.edu>
>
> On Mar 21, 2013, at 8:50, Eric McDonald <emcd.msu at gmail.com 
> <mailto:emcd.msu at gmail.com>> wrote:
>
>> Thanks for the information, Alexis. If you are using 20 threads, then 
>> 441 / 20 is about 22 hours of elapsed time. So, it appears that all 
>> of the threads are working. (There is the possibility that they could 
>> be busy-waiting somewhere, but I didn't see any explicit 
>> opportunities for that from reading the 'do-partition.py' code.) 
>> Since you haven't seen .pmap files yet and since multithreaded 
>> execution is occurring, I expect that execution is currently at the 
>> following place in the script:
>> https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57
>>
>> I am not familiar with the 'do_subset_partition' code, but will try 
>> to analyze it later today. However, I would also listen to what Adina 
>> is saying - this step may just take a long time....
>>
>> Eric
>>
>> P.S. If you want to check on the output from the script, you could 
>> look in /var/spool/PBS/mom_priv (or equivalent) on the node where the 
>> job is running to see what the spooled output looks like thus far. 
>> (There should be a file named with the job ID and either a ".ER" or 
>> ".OU" extension, if I recall correctly, though it has been awhile 
>> since I have administered your kind of batch system.) You may need 
>> David to do this as the permissions to the directory are typically 
>> restrictive.
>>
>>
>>
>> On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi 
>> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>> 
>> wrote:
>>
>>     A precision :
>>
>>     The file submitted to the script do-partition.py contains 2576771
>>     reads (file.below)
>>     The job was launched with the following options :
>>     khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20
>>     file.graphbase file.below
>>
>>     Alexis
>>
>>
>>     Le 21/03/2013 10:13, Alexis Groppi a écrit :
>>>     Hi Eric,
>>>
>>>     The script  do-partition.py is now running since 22 hours.
>>>     Only the file.info <http://file.info> has been generated. No
>>>     .pmap file were created.
>>>
>>>     qstat -f gives :
>>>         resources_used.cput = 441:04:21
>>>         resources_used.mem = 12764228kb
>>>         resources_used.vmem = 13926732kb
>>>         resources_used.walltime = 22:05:56
>>>
>>>     The amount of RAM on the server is 256 Go and the swap space is
>>>     also 256 Go
>>>
>>>     Your opinion ?
>>>
>>>     Thanks
>>>
>>>     Alexis
>>>
>>>     Le 20/03/2013 16:43, Alexis Groppi a écrit :
>>>>     Hi Eric,
>>>>
>>>>     Actually the previous job was terminated by the limit of the
>>>>     walltime.
>>>>     I relaunched the script.
>>>>     qstat -fr gives :
>>>>         resources_used.cput = 93:23:08
>>>>         resources_used.mem = 12341932kb
>>>>         resources_used.vmem = 13271372kb
>>>>         resources_used.walltime = 04:42:39
>>>>
>>>>     At this moment only the file.info <http://file.info> has been
>>>>     generated.
>>>>
>>>>     Let's wait and see ...
>>>>
>>>>     Thanks again
>>>>
>>>>     Alexis
>>>>
>>>>
>>>>     Le 19/03/2013 21:50, Eric McDonald a écrit :
>>>>>     Hi Alexis,
>>>>>
>>>>>     What does:
>>>>>       qstat -f <job-id>
>>>>>     where <job-id> is the ID of your job tell you for the
>>>>>     following fields:
>>>>>       resources_used.cput
>>>>>       resources_used.vmem
>>>>>
>>>>>     And how do those values compare to actual amount of elapsed
>>>>>     time for the job, the amount of physical memory on the node,
>>>>>     and the total memory (RAM + swap space) on the node?
>>>>>     Just checking to make sure that everything is running as it
>>>>>     should be and that your process is not heavily into swap or
>>>>>     something like that.
>>>>>
>>>>>     Thanks,
>>>>>       Eric
>>>>>
>>>>>
>>>>>
>>>>>     On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi
>>>>>     <alexis.groppi at u-bordeaux2.fr
>>>>>     <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>>
>>>>>         Hi Adina,
>>>>>
>>>>>         First of all thanks for your answer and your advices :)
>>>>>         The script extract-partitions.py works !
>>>>>         For the do-partition.py on my second set, it runs since 32
>>>>>         hours. Should it not have produced at least one temporary
>>>>>         .pmap file ?
>>>>>
>>>>>         Thanks again
>>>>>
>>>>>         Alexis
>>>>>
>>>>>         Le 19/03/2013 12:58, Adina Chuang Howe a écrit :
>>>>>>
>>>>>>
>>>>>>             Message: 1
>>>>>>             Date: Tue, 19 Mar 2013 10:41:45 +0100
>>>>>>             From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>>>>>>             <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>>>>             Subject: [khmer] Duration of do-partition.py (very
>>>>>>             long !)
>>>>>>             To: khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>             Message-ID: <514832D9.7090207 at u-bordeaux2.fr
>>>>>>             <mailto:514832D9.7090207 at u-bordeaux2.fr>>
>>>>>>             Content-Type: text/plain; charset="iso-8859-1";
>>>>>>             Format="flowed"
>>>>>>
>>>>>>             Hi Titus,
>>>>>>
>>>>>>             After digital normalization and filter-below-abund,
>>>>>>             upon your advice I
>>>>>>             performed do.partition.py <http://do.partition.py> on
>>>>>>             2 sets of data (approx 2.5 millions of
>>>>>>             reads (75 nt)) :
>>>>>>
>>>>>>             /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>             /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>>>>>             /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>>>>>             and
>>>>>>             /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>             /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>>>>>             /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>>>>
>>>>>>             For the first one I got a
>>>>>>             174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info
>>>>>>             <http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info>
>>>>>>             with the
>>>>>>             information : 33 subsets total
>>>>>>             Thereafter 33 files .pmap from 0.pmap to 32.pmap
>>>>>>             regurlarly were created
>>>>>>             and finally I got unique file
>>>>>>             174r1_prinseq_good_bFr8.fasta.keep.below.part (all
>>>>>>             the .pmap files were
>>>>>>             deleted)
>>>>>>             This treatment lasted approx 56 hours.
>>>>>>
>>>>>>             For the second set (174r2), do-partition.py is
>>>>>>             started since 32 hours
>>>>>>             but I only got the
>>>>>>             174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info
>>>>>>             <http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info>
>>>>>>             with the
>>>>>>             information : 35 subsets total
>>>>>>             And nothing more...
>>>>>>
>>>>>>             Is this duration "normal" ?
>>>>>>
>>>>>>
>>>>>>         Yes, this is typical.  The longest I've had it run is 3
>>>>>>         weeks for very large (billions of reads).  In general,
>>>>>>         partitioning is the most time consuming of all the steps.
>>>>>>          Once its finished, you'll have much smaller files which
>>>>>>         can be assembled very quickly.  Since I run assembly on
>>>>>>         multiple assembler and with multiple K lengths, this gain
>>>>>>         is often  significant for me.
>>>>>>
>>>>>>         To get the actual partitioned files, you can use the
>>>>>>         following script:
>>>>>>
>>>>>>         https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>>>>>
>>>>>>             (The parameters for the threads are by default (4
>>>>>>             threads))
>>>>>>             33 subsets and only one file at the end ?
>>>>>>             Should I stop do-partition.py on the second set and
>>>>>>             re run it with more
>>>>>>             threads ?
>>>>>>
>>>>>>
>>>>>>         I'd suggest letting it run.
>>>>>>
>>>>>>         Best,
>>>>>>         Adina
>>>>>>
>>>>>>
>>>>>>         _______________________________________________
>>>>>>         khmer mailing list
>>>>>>         khmer at lists.idyll.org  <mailto:khmer at lists.idyll.org>
>>>>>>         http://lists.idyll.org/listinfo/khmer
>>>>>
>>>>>         -- 
>>>>>         <mime-attachment.png>
>>>>>
>>>>>         _______________________________________________
>>>>>         khmer mailing list
>>>>>         khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>         http://lists.idyll.org/listinfo/khmer
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     -- 
>>>>>     Eric McDonald
>>>>>     HPC/Cloud Software Engineer
>>>>>       for the Institute for Cyber-Enabled Research (iCER)
>>>>>       and the Laboratory for Genomics, Evolution, and Development
>>>>>     (GED)
>>>>>     Michigan State University
>>>>>     P: 517-355-8733 <tel:517-355-8733>
>>>>
>>>>     -- 
>>>>     <mime-attachment.png>
>>>
>>>     -- 
>>>     <mime-attachment.png>
>>
>>     -- 
>>     <Signature_Mail_A_Groppi.png>
>>
>>
>>
>>
>> -- 
>> Eric McDonald
>> HPC/Cloud Software Engineer
>>   for the Institute for Cyber-Enabled Research (iCER)
>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>> http://lists.idyll.org/listinfo/khmer

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/8b2d7c16/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/8b2d7c16/attachment-0002.png>