[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)

Thu Mar 21 07:28:20 PDT 2013

On Thu, Mar 21, 2013 at 03:15:33PM +0100, Alexis Groppi wrote:
> Thanks for your answer. The input file I use should not have this  
> artefact because it comes after filter-below-abund treatment.
> I will try with find-knots and then filter-stoptags.
> For your last proposition : what is the size limit ?
> Subsidiary question, Eric told me "Titus created a guide about what size  
> hash table to generally use with certain kinds of data"
> If possible I would be very interested to have this guide.

http://khmer.readthedocs.org/en/latest/

http://khmer.readthedocs.org/en/latest/choosing-hash-sizes.html

OK, you may have to use the find-knots stuff --

http://khmer.readthedocs.org/en/latest/partitioning-big-data.html

cheers,
--titus

> Le 21/03/2013 14:14, C. Titus Brown a ??crit :
>> This long wait is probably a sign that you have a highly connected  
>> graph. We usually attribute that to the presence of sequencing  
>> artifacts, which have to be removed either via filter-below-abund or  
>> find-knot; do-partition can't do it itself.  Take a look at the  
>> handbook or the info on part large data.
>>
>> In your case I think your data may be small enough to assemble just  
>> after diginorm.
>>
>> ---
>> C. Titus Brown, ctb at msu.edu <mailto:ctb at msu.edu>
>>
>> On Mar 21, 2013, at 8:50, Eric McDonald <emcd.msu at gmail.com  
>> <mailto:emcd.msu at gmail.com>> wrote:
>>
>>> Thanks for the information, Alexis. If you are using 20 threads, then 
>>> 441 / 20 is about 22 hours of elapsed time. So, it appears that all  
>>> of the threads are working. (There is the possibility that they could 
>>> be busy-waiting somewhere, but I didn't see any explicit  
>>> opportunities for that from reading the 'do-partition.py' code.)  
>>> Since you haven't seen .pmap files yet and since multithreaded  
>>> execution is occurring, I expect that execution is currently at the  
>>> following place in the script:
>>> https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57
>>>
>>> I am not familiar with the 'do_subset_partition' code, but will try  
>>> to analyze it later today. However, I would also listen to what Adina 
>>> is saying - this step may just take a long time....
>>>
>>> Eric
>>>
>>> P.S. If you want to check on the output from the script, you could  
>>> look in /var/spool/PBS/mom_priv (or equivalent) on the node where the 
>>> job is running to see what the spooled output looks like thus far.  
>>> (There should be a file named with the job ID and either a ".ER" or  
>>> ".OU" extension, if I recall correctly, though it has been awhile  
>>> since I have administered your kind of batch system.) You may need  
>>> David to do this as the permissions to the directory are typically  
>>> restrictive.
>>>
>>>
>>>
>>> On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi  
>>> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>>  
>>> wrote:
>>>
>>>     A precision :
>>>
>>>     The file submitted to the script do-partition.py contains 2576771
>>>     reads (file.below)
>>>     The job was launched with the following options :
>>>     khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20
>>>     file.graphbase file.below
>>>
>>>     Alexis
>>>
>>>
>>>     Le 21/03/2013 10:13, Alexis Groppi a ??crit :
>>>>     Hi Eric,
>>>>
>>>>     The script  do-partition.py is now running since 22 hours.
>>>>     Only the file.info <http://file.info> has been generated. No
>>>>     .pmap file were created.
>>>>
>>>>     qstat -f gives :
>>>>         resources_used.cput = 441:04:21
>>>>         resources_used.mem = 12764228kb
>>>>         resources_used.vmem = 13926732kb
>>>>         resources_used.walltime = 22:05:56
>>>>
>>>>     The amount of RAM on the server is 256 Go and the swap space is
>>>>     also 256 Go
>>>>
>>>>     Your opinion ?
>>>>
>>>>     Thanks
>>>>
>>>>     Alexis
>>>>
>>>>     Le 20/03/2013 16:43, Alexis Groppi a ??crit :
>>>>>     Hi Eric,
>>>>>
>>>>>     Actually the previous job was terminated by the limit of the
>>>>>     walltime.
>>>>>     I relaunched the script.
>>>>>     qstat -fr gives :
>>>>>         resources_used.cput = 93:23:08
>>>>>         resources_used.mem = 12341932kb
>>>>>         resources_used.vmem = 13271372kb
>>>>>         resources_used.walltime = 04:42:39
>>>>>
>>>>>     At this moment only the file.info <http://file.info> has been
>>>>>     generated.
>>>>>
>>>>>     Let's wait and see ...
>>>>>
>>>>>     Thanks again
>>>>>
>>>>>     Alexis
>>>>>
>>>>>
>>>>>     Le 19/03/2013 21:50, Eric McDonald a ??crit :
>>>>>>     Hi Alexis,
>>>>>>
>>>>>>     What does:
>>>>>>       qstat -f <job-id>
>>>>>>     where <job-id> is the ID of your job tell you for the
>>>>>>     following fields:
>>>>>>       resources_used.cput
>>>>>>       resources_used.vmem
>>>>>>
>>>>>>     And how do those values compare to actual amount of elapsed
>>>>>>     time for the job, the amount of physical memory on the node,
>>>>>>     and the total memory (RAM + swap space) on the node?
>>>>>>     Just checking to make sure that everything is running as it
>>>>>>     should be and that your process is not heavily into swap or
>>>>>>     something like that.
>>>>>>
>>>>>>     Thanks,
>>>>>>       Eric
>>>>>>
>>>>>>
>>>>>>
>>>>>>     On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi
>>>>>>     <alexis.groppi at u-bordeaux2.fr
>>>>>>     <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>>>
>>>>>>         Hi Adina,
>>>>>>
>>>>>>         First of all thanks for your answer and your advices :)
>>>>>>         The script extract-partitions.py works !
>>>>>>         For the do-partition.py on my second set, it runs since 32
>>>>>>         hours. Should it not have produced at least one temporary
>>>>>>         .pmap file ?
>>>>>>
>>>>>>         Thanks again
>>>>>>
>>>>>>         Alexis
>>>>>>
>>>>>>         Le 19/03/2013 12:58, Adina Chuang Howe a ??crit :
>>>>>>>
>>>>>>>
>>>>>>>             Message: 1
>>>>>>>             Date: Tue, 19 Mar 2013 10:41:45 +0100
>>>>>>>             From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>>>>>>>             <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>>>>>             Subject: [khmer] Duration of do-partition.py (very
>>>>>>>             long !)
>>>>>>>             To: khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>>             Message-ID: <514832D9.7090207 at u-bordeaux2.fr
>>>>>>>             <mailto:514832D9.7090207 at u-bordeaux2.fr>>
>>>>>>>             Content-Type: text/plain; charset="iso-8859-1";
>>>>>>>             Format="flowed"
>>>>>>>
>>>>>>>             Hi Titus,
>>>>>>>
>>>>>>>             After digital normalization and filter-below-abund,
>>>>>>>             upon your advice I
>>>>>>>             performed do.partition.py <http://do.partition.py> on
>>>>>>>             2 sets of data (approx 2.5 millions of
>>>>>>>             reads (75 nt)) :
>>>>>>>
>>>>>>>             /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>>             /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>>>>>>             /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>>>>>>             and
>>>>>>>             /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>>             /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>>>>>>             /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>>>>>
>>>>>>>             For the first one I got a
>>>>>>>             174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info
>>>>>>>             <http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info>
>>>>>>>             with the
>>>>>>>             information : 33 subsets total
>>>>>>>             Thereafter 33 files .pmap from 0.pmap to 32.pmap
>>>>>>>             regurlarly were created
>>>>>>>             and finally I got unique file
>>>>>>>             174r1_prinseq_good_bFr8.fasta.keep.below.part (all
>>>>>>>             the .pmap files were
>>>>>>>             deleted)
>>>>>>>             This treatment lasted approx 56 hours.
>>>>>>>
>>>>>>>             For the second set (174r2), do-partition.py is
>>>>>>>             started since 32 hours
>>>>>>>             but I only got the
>>>>>>>             174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info
>>>>>>>             <http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info>
>>>>>>>             with the
>>>>>>>             information : 35 subsets total
>>>>>>>             And nothing more...
>>>>>>>
>>>>>>>             Is this duration "normal" ?
>>>>>>>
>>>>>>>
>>>>>>>         Yes, this is typical.  The longest I've had it run is 3
>>>>>>>         weeks for very large (billions of reads).  In general,
>>>>>>>         partitioning is the most time consuming of all the steps.
>>>>>>>          Once its finished, you'll have much smaller files which
>>>>>>>         can be assembled very quickly.  Since I run assembly on
>>>>>>>         multiple assembler and with multiple K lengths, this gain
>>>>>>>         is often  significant for me.
>>>>>>>
>>>>>>>         To get the actual partitioned files, you can use the
>>>>>>>         following script:
>>>>>>>
>>>>>>>         https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>>>>>>
>>>>>>>             (The parameters for the threads are by default (4
>>>>>>>             threads))
>>>>>>>             33 subsets and only one file at the end ?
>>>>>>>             Should I stop do-partition.py on the second set and
>>>>>>>             re run it with more
>>>>>>>             threads ?
>>>>>>>
>>>>>>>
>>>>>>>         I'd suggest letting it run.
>>>>>>>
>>>>>>>         Best,
>>>>>>>         Adina
>>>>>>>
>>>>>>>
>>>>>>>         _______________________________________________
>>>>>>>         khmer mailing list
>>>>>>>         khmer at lists.idyll.org  <mailto:khmer at lists.idyll.org>
>>>>>>>         http://lists.idyll.org/listinfo/khmer
>>>>>>
>>>>>>         --         <mime-attachment.png>
>>>>>>
>>>>>>         _______________________________________________
>>>>>>         khmer mailing list
>>>>>>         khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>         http://lists.idyll.org/listinfo/khmer
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>     --     Eric McDonald
>>>>>>     HPC/Cloud Software Engineer
>>>>>>       for the Institute for Cyber-Enabled Research (iCER)
>>>>>>       and the Laboratory for Genomics, Evolution, and Development
>>>>>>     (GED)
>>>>>>     Michigan State University
>>>>>>     P: 517-355-8733 <tel:517-355-8733>
>>>>>
>>>>>     --     <mime-attachment.png>
>>>>
>>>>     --     <mime-attachment.png>
>>>
>>>     --     <Signature_Mail_A_Groppi.png>
>>>
>>>
>>>
>>>
>>> -- 
>>> Eric McDonald
>>> HPC/Cloud Software Engineer
>>>   for the Institute for Cyber-Enabled Research (iCER)
>>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>>> Michigan State University
>>> P: 517-355-8733
>>> _______________________________________________
>>> khmer mailing list
>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>> http://lists.idyll.org/listinfo/khmer
>
> -- 

> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer


-- 
C. Titus Brown, ctb at msu.edu