[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)
C. Titus Brown
ctb at msu.edu
Thu Mar 21 07:28:20 PDT 2013
On Thu, Mar 21, 2013 at 03:15:33PM +0100, Alexis Groppi wrote:
> Thanks for your answer. The input file I use should not have this
> artefact because it comes after filter-below-abund treatment.
> I will try with find-knots and then filter-stoptags.
> For your last proposition : what is the size limit ?
> Subsidiary question, Eric told me "Titus created a guide about what size
> hash table to generally use with certain kinds of data"
> If possible I would be very interested to have this guide.
http://khmer.readthedocs.org/en/latest/
http://khmer.readthedocs.org/en/latest/choosing-hash-sizes.html
OK, you may have to use the find-knots stuff --
http://khmer.readthedocs.org/en/latest/partitioning-big-data.html
cheers,
--titus
> Le 21/03/2013 14:14, C. Titus Brown a ??crit :
>> This long wait is probably a sign that you have a highly connected
>> graph. We usually attribute that to the presence of sequencing
>> artifacts, which have to be removed either via filter-below-abund or
>> find-knot; do-partition can't do it itself. Take a look at the
>> handbook or the info on part large data.
>>
>> In your case I think your data may be small enough to assemble just
>> after diginorm.
>>
>> ---
>> C. Titus Brown, ctb at msu.edu <mailto:ctb at msu.edu>
>>
>> On Mar 21, 2013, at 8:50, Eric McDonald <emcd.msu at gmail.com
>> <mailto:emcd.msu at gmail.com>> wrote:
>>
>>> Thanks for the information, Alexis. If you are using 20 threads, then
>>> 441 / 20 is about 22 hours of elapsed time. So, it appears that all
>>> of the threads are working. (There is the possibility that they could
>>> be busy-waiting somewhere, but I didn't see any explicit
>>> opportunities for that from reading the 'do-partition.py' code.)
>>> Since you haven't seen .pmap files yet and since multithreaded
>>> execution is occurring, I expect that execution is currently at the
>>> following place in the script:
>>> https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57
>>>
>>> I am not familiar with the 'do_subset_partition' code, but will try
>>> to analyze it later today. However, I would also listen to what Adina
>>> is saying - this step may just take a long time....
>>>
>>> Eric
>>>
>>> P.S. If you want to check on the output from the script, you could
>>> look in /var/spool/PBS/mom_priv (or equivalent) on the node where the
>>> job is running to see what the spooled output looks like thus far.
>>> (There should be a file named with the job ID and either a ".ER" or
>>> ".OU" extension, if I recall correctly, though it has been awhile
>>> since I have administered your kind of batch system.) You may need
>>> David to do this as the permissions to the directory are typically
>>> restrictive.
>>>
>>>
>>>
>>> On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi
>>> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>>
>>> wrote:
>>>
>>> A precision :
>>>
>>> The file submitted to the script do-partition.py contains 2576771
>>> reads (file.below)
>>> The job was launched with the following options :
>>> khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20
>>> file.graphbase file.below
>>>
>>> Alexis
>>>
>>>
>>> Le 21/03/2013 10:13, Alexis Groppi a ??crit :
>>>> Hi Eric,
>>>>
>>>> The script do-partition.py is now running since 22 hours.
>>>> Only the file.info <http://file.info> has been generated. No
>>>> .pmap file were created.
>>>>
>>>> qstat -f gives :
>>>> resources_used.cput = 441:04:21
>>>> resources_used.mem = 12764228kb
>>>> resources_used.vmem = 13926732kb
>>>> resources_used.walltime = 22:05:56
>>>>
>>>> The amount of RAM on the server is 256 Go and the swap space is
>>>> also 256 Go
>>>>
>>>> Your opinion ?
>>>>
>>>> Thanks
>>>>
>>>> Alexis
>>>>
>>>> Le 20/03/2013 16:43, Alexis Groppi a ??crit :
>>>>> Hi Eric,
>>>>>
>>>>> Actually the previous job was terminated by the limit of the
>>>>> walltime.
>>>>> I relaunched the script.
>>>>> qstat -fr gives :
>>>>> resources_used.cput = 93:23:08
>>>>> resources_used.mem = 12341932kb
>>>>> resources_used.vmem = 13271372kb
>>>>> resources_used.walltime = 04:42:39
>>>>>
>>>>> At this moment only the file.info <http://file.info> has been
>>>>> generated.
>>>>>
>>>>> Let's wait and see ...
>>>>>
>>>>> Thanks again
>>>>>
>>>>> Alexis
>>>>>
>>>>>
>>>>> Le 19/03/2013 21:50, Eric McDonald a ??crit :
>>>>>> Hi Alexis,
>>>>>>
>>>>>> What does:
>>>>>> qstat -f <job-id>
>>>>>> where <job-id> is the ID of your job tell you for the
>>>>>> following fields:
>>>>>> resources_used.cput
>>>>>> resources_used.vmem
>>>>>>
>>>>>> And how do those values compare to actual amount of elapsed
>>>>>> time for the job, the amount of physical memory on the node,
>>>>>> and the total memory (RAM + swap space) on the node?
>>>>>> Just checking to make sure that everything is running as it
>>>>>> should be and that your process is not heavily into swap or
>>>>>> something like that.
>>>>>>
>>>>>> Thanks,
>>>>>> Eric
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi
>>>>>> <alexis.groppi at u-bordeaux2.fr
>>>>>> <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>>>
>>>>>> Hi Adina,
>>>>>>
>>>>>> First of all thanks for your answer and your advices :)
>>>>>> The script extract-partitions.py works !
>>>>>> For the do-partition.py on my second set, it runs since 32
>>>>>> hours. Should it not have produced at least one temporary
>>>>>> .pmap file ?
>>>>>>
>>>>>> Thanks again
>>>>>>
>>>>>> Alexis
>>>>>>
>>>>>> Le 19/03/2013 12:58, Adina Chuang Howe a ??crit :
>>>>>>>
>>>>>>>
>>>>>>> Message: 1
>>>>>>> Date: Tue, 19 Mar 2013 10:41:45 +0100
>>>>>>> From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>>>>>>> <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>>>>> Subject: [khmer] Duration of do-partition.py (very
>>>>>>> long !)
>>>>>>> To: khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>> Message-ID: <514832D9.7090207 at u-bordeaux2.fr
>>>>>>> <mailto:514832D9.7090207 at u-bordeaux2.fr>>
>>>>>>> Content-Type: text/plain; charset="iso-8859-1";
>>>>>>> Format="flowed"
>>>>>>>
>>>>>>> Hi Titus,
>>>>>>>
>>>>>>> After digital normalization and filter-below-abund,
>>>>>>> upon your advice I
>>>>>>> performed do.partition.py <http://do.partition.py> on
>>>>>>> 2 sets of data (approx 2.5 millions of
>>>>>>> reads (75 nt)) :
>>>>>>>
>>>>>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>>>>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>>>>>> and
>>>>>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>>>>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>>>>>
>>>>>>> For the first one I got a
>>>>>>> 174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info
>>>>>>> <http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info>
>>>>>>> with the
>>>>>>> information : 33 subsets total
>>>>>>> Thereafter 33 files .pmap from 0.pmap to 32.pmap
>>>>>>> regurlarly were created
>>>>>>> and finally I got unique file
>>>>>>> 174r1_prinseq_good_bFr8.fasta.keep.below.part (all
>>>>>>> the .pmap files were
>>>>>>> deleted)
>>>>>>> This treatment lasted approx 56 hours.
>>>>>>>
>>>>>>> For the second set (174r2), do-partition.py is
>>>>>>> started since 32 hours
>>>>>>> but I only got the
>>>>>>> 174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info
>>>>>>> <http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info>
>>>>>>> with the
>>>>>>> information : 35 subsets total
>>>>>>> And nothing more...
>>>>>>>
>>>>>>> Is this duration "normal" ?
>>>>>>>
>>>>>>>
>>>>>>> Yes, this is typical. The longest I've had it run is 3
>>>>>>> weeks for very large (billions of reads). In general,
>>>>>>> partitioning is the most time consuming of all the steps.
>>>>>>> Once its finished, you'll have much smaller files which
>>>>>>> can be assembled very quickly. Since I run assembly on
>>>>>>> multiple assembler and with multiple K lengths, this gain
>>>>>>> is often significant for me.
>>>>>>>
>>>>>>> To get the actual partitioned files, you can use the
>>>>>>> following script:
>>>>>>>
>>>>>>> https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>>>>>>
>>>>>>> (The parameters for the threads are by default (4
>>>>>>> threads))
>>>>>>> 33 subsets and only one file at the end ?
>>>>>>> Should I stop do-partition.py on the second set and
>>>>>>> re run it with more
>>>>>>> threads ?
>>>>>>>
>>>>>>>
>>>>>>> I'd suggest letting it run.
>>>>>>>
>>>>>>> Best,
>>>>>>> Adina
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> khmer mailing list
>>>>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>>
>>>>>> -- <mime-attachment.png>
>>>>>>
>>>>>> _______________________________________________
>>>>>> khmer mailing list
>>>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric McDonald
>>>>>> HPC/Cloud Software Engineer
>>>>>> for the Institute for Cyber-Enabled Research (iCER)
>>>>>> and the Laboratory for Genomics, Evolution, and Development
>>>>>> (GED)
>>>>>> Michigan State University
>>>>>> P: 517-355-8733 <tel:517-355-8733>
>>>>>
>>>>> -- <mime-attachment.png>
>>>>
>>>> -- <mime-attachment.png>
>>>
>>> -- <Signature_Mail_A_Groppi.png>
>>>
>>>
>>>
>>>
>>> --
>>> Eric McDonald
>>> HPC/Cloud Software Engineer
>>> for the Institute for Cyber-Enabled Research (iCER)
>>> and the Laboratory for Genomics, Evolution, and Development (GED)
>>> Michigan State University
>>> P: 517-355-8733
>>> _______________________________________________
>>> khmer mailing list
>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>> http://lists.idyll.org/listinfo/khmer
>
> --
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
--
C. Titus Brown, ctb at msu.edu
More information about the khmer
mailing list