[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)
Alexis Groppi
alexis.groppi at u-bordeaux2.fr
Thu Mar 21 07:15:33 PDT 2013
Hi Titus,
Thanks for your answer. The input file I use should not have this
artefact because it comes after filter-below-abund treatment.
I will try with find-knots and then filter-stoptags.
For your last proposition : what is the size limit ?
Subsidiary question, Eric told me "Titus created a guide about what size
hash table to generally use with certain kinds of data"
If possible I would be very interested to have this guide.
Thanks again
Alexis
Le 21/03/2013 14:14, C. Titus Brown a écrit :
> This long wait is probably a sign that you have a highly connected
> graph. We usually attribute that to the presence of sequencing
> artifacts, which have to be removed either via filter-below-abund or
> find-knot; do-partition can't do it itself. Take a look at the
> handbook or the info on part large data.
>
> In your case I think your data may be small enough to assemble just
> after diginorm.
>
> ---
> C. Titus Brown, ctb at msu.edu <mailto:ctb at msu.edu>
>
> On Mar 21, 2013, at 8:50, Eric McDonald <emcd.msu at gmail.com
> <mailto:emcd.msu at gmail.com>> wrote:
>
>> Thanks for the information, Alexis. If you are using 20 threads, then
>> 441 / 20 is about 22 hours of elapsed time. So, it appears that all
>> of the threads are working. (There is the possibility that they could
>> be busy-waiting somewhere, but I didn't see any explicit
>> opportunities for that from reading the 'do-partition.py' code.)
>> Since you haven't seen .pmap files yet and since multithreaded
>> execution is occurring, I expect that execution is currently at the
>> following place in the script:
>> https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57
>>
>> I am not familiar with the 'do_subset_partition' code, but will try
>> to analyze it later today. However, I would also listen to what Adina
>> is saying - this step may just take a long time....
>>
>> Eric
>>
>> P.S. If you want to check on the output from the script, you could
>> look in /var/spool/PBS/mom_priv (or equivalent) on the node where the
>> job is running to see what the spooled output looks like thus far.
>> (There should be a file named with the job ID and either a ".ER" or
>> ".OU" extension, if I recall correctly, though it has been awhile
>> since I have administered your kind of batch system.) You may need
>> David to do this as the permissions to the directory are typically
>> restrictive.
>>
>>
>>
>> On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi
>> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>>
>> wrote:
>>
>> A precision :
>>
>> The file submitted to the script do-partition.py contains 2576771
>> reads (file.below)
>> The job was launched with the following options :
>> khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20
>> file.graphbase file.below
>>
>> Alexis
>>
>>
>> Le 21/03/2013 10:13, Alexis Groppi a écrit :
>>> Hi Eric,
>>>
>>> The script do-partition.py is now running since 22 hours.
>>> Only the file.info <http://file.info> has been generated. No
>>> .pmap file were created.
>>>
>>> qstat -f gives :
>>> resources_used.cput = 441:04:21
>>> resources_used.mem = 12764228kb
>>> resources_used.vmem = 13926732kb
>>> resources_used.walltime = 22:05:56
>>>
>>> The amount of RAM on the server is 256 Go and the swap space is
>>> also 256 Go
>>>
>>> Your opinion ?
>>>
>>> Thanks
>>>
>>> Alexis
>>>
>>> Le 20/03/2013 16:43, Alexis Groppi a écrit :
>>>> Hi Eric,
>>>>
>>>> Actually the previous job was terminated by the limit of the
>>>> walltime.
>>>> I relaunched the script.
>>>> qstat -fr gives :
>>>> resources_used.cput = 93:23:08
>>>> resources_used.mem = 12341932kb
>>>> resources_used.vmem = 13271372kb
>>>> resources_used.walltime = 04:42:39
>>>>
>>>> At this moment only the file.info <http://file.info> has been
>>>> generated.
>>>>
>>>> Let's wait and see ...
>>>>
>>>> Thanks again
>>>>
>>>> Alexis
>>>>
>>>>
>>>> Le 19/03/2013 21:50, Eric McDonald a écrit :
>>>>> Hi Alexis,
>>>>>
>>>>> What does:
>>>>> qstat -f <job-id>
>>>>> where <job-id> is the ID of your job tell you for the
>>>>> following fields:
>>>>> resources_used.cput
>>>>> resources_used.vmem
>>>>>
>>>>> And how do those values compare to actual amount of elapsed
>>>>> time for the job, the amount of physical memory on the node,
>>>>> and the total memory (RAM + swap space) on the node?
>>>>> Just checking to make sure that everything is running as it
>>>>> should be and that your process is not heavily into swap or
>>>>> something like that.
>>>>>
>>>>> Thanks,
>>>>> Eric
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi
>>>>> <alexis.groppi at u-bordeaux2.fr
>>>>> <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>>
>>>>> Hi Adina,
>>>>>
>>>>> First of all thanks for your answer and your advices :)
>>>>> The script extract-partitions.py works !
>>>>> For the do-partition.py on my second set, it runs since 32
>>>>> hours. Should it not have produced at least one temporary
>>>>> .pmap file ?
>>>>>
>>>>> Thanks again
>>>>>
>>>>> Alexis
>>>>>
>>>>> Le 19/03/2013 12:58, Adina Chuang Howe a écrit :
>>>>>>
>>>>>>
>>>>>> Message: 1
>>>>>> Date: Tue, 19 Mar 2013 10:41:45 +0100
>>>>>> From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>>>>>> <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>>>> Subject: [khmer] Duration of do-partition.py (very
>>>>>> long !)
>>>>>> To: khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>> Message-ID: <514832D9.7090207 at u-bordeaux2.fr
>>>>>> <mailto:514832D9.7090207 at u-bordeaux2.fr>>
>>>>>> Content-Type: text/plain; charset="iso-8859-1";
>>>>>> Format="flowed"
>>>>>>
>>>>>> Hi Titus,
>>>>>>
>>>>>> After digital normalization and filter-below-abund,
>>>>>> upon your advice I
>>>>>> performed do.partition.py <http://do.partition.py> on
>>>>>> 2 sets of data (approx 2.5 millions of
>>>>>> reads (75 nt)) :
>>>>>>
>>>>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>>>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>>>>> and
>>>>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>>>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>>>>
>>>>>> For the first one I got a
>>>>>> 174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info
>>>>>> <http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info>
>>>>>> with the
>>>>>> information : 33 subsets total
>>>>>> Thereafter 33 files .pmap from 0.pmap to 32.pmap
>>>>>> regurlarly were created
>>>>>> and finally I got unique file
>>>>>> 174r1_prinseq_good_bFr8.fasta.keep.below.part (all
>>>>>> the .pmap files were
>>>>>> deleted)
>>>>>> This treatment lasted approx 56 hours.
>>>>>>
>>>>>> For the second set (174r2), do-partition.py is
>>>>>> started since 32 hours
>>>>>> but I only got the
>>>>>> 174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info
>>>>>> <http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info>
>>>>>> with the
>>>>>> information : 35 subsets total
>>>>>> And nothing more...
>>>>>>
>>>>>> Is this duration "normal" ?
>>>>>>
>>>>>>
>>>>>> Yes, this is typical. The longest I've had it run is 3
>>>>>> weeks for very large (billions of reads). In general,
>>>>>> partitioning is the most time consuming of all the steps.
>>>>>> Once its finished, you'll have much smaller files which
>>>>>> can be assembled very quickly. Since I run assembly on
>>>>>> multiple assembler and with multiple K lengths, this gain
>>>>>> is often significant for me.
>>>>>>
>>>>>> To get the actual partitioned files, you can use the
>>>>>> following script:
>>>>>>
>>>>>> https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>>>>>
>>>>>> (The parameters for the threads are by default (4
>>>>>> threads))
>>>>>> 33 subsets and only one file at the end ?
>>>>>> Should I stop do-partition.py on the second set and
>>>>>> re run it with more
>>>>>> threads ?
>>>>>>
>>>>>>
>>>>>> I'd suggest letting it run.
>>>>>>
>>>>>> Best,
>>>>>> Adina
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> khmer mailing list
>>>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>
>>>>> --
>>>>> <mime-attachment.png>
>>>>>
>>>>> _______________________________________________
>>>>> khmer mailing list
>>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Eric McDonald
>>>>> HPC/Cloud Software Engineer
>>>>> for the Institute for Cyber-Enabled Research (iCER)
>>>>> and the Laboratory for Genomics, Evolution, and Development
>>>>> (GED)
>>>>> Michigan State University
>>>>> P: 517-355-8733 <tel:517-355-8733>
>>>>
>>>> --
>>>> <mime-attachment.png>
>>>
>>> --
>>> <mime-attachment.png>
>>
>> --
>> <Signature_Mail_A_Groppi.png>
>>
>>
>>
>>
>> --
>> Eric McDonald
>> HPC/Cloud Software Engineer
>> for the Institute for Cyber-Enabled Research (iCER)
>> and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>> http://lists.idyll.org/listinfo/khmer
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/8b2d7c16/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/8b2d7c16/attachment-0002.png>
More information about the khmer
mailing list