[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)
Alexis Groppi
alexis.groppi at u-bordeaux2.fr
Thu Mar 21 08:51:30 PDT 2013
Sorry for bothering you, but it's not clear for me :
For removing the artefacts :
Should I apply find-knots on my file.below ? (after
normalize-by-median.py, load-into-counting.py and filter-below-abund.py)
Then filter-stoptags ?
And then will I have data ready for assembly or should I perform
do-partition.py ? (on these artefact free data)
Thanks
Alexis
Le 21/03/2013 15:28, C. Titus Brown a écrit :
> On Thu, Mar 21, 2013 at 03:15:33PM +0100, Alexis Groppi wrote:
>> Thanks for your answer. The input file I use should not have this
>> artefact because it comes after filter-below-abund treatment.
>> I will try with find-knots and then filter-stoptags.
>> For your last proposition : what is the size limit ?
>> Subsidiary question, Eric told me "Titus created a guide about what size
>> hash table to generally use with certain kinds of data"
>> If possible I would be very interested to have this guide.
> http://khmer.readthedocs.org/en/latest/
>
> http://khmer.readthedocs.org/en/latest/choosing-hash-sizes.html
>
> OK, you may have to use the find-knots stuff --
>
> http://khmer.readthedocs.org/en/latest/partitioning-big-data.html
>
> cheers,
> --titus
>
>> Le 21/03/2013 14:14, C. Titus Brown a ??crit :
>>> This long wait is probably a sign that you have a highly connected
>>> graph. We usually attribute that to the presence of sequencing
>>> artifacts, which have to be removed either via filter-below-abund or
>>> find-knot; do-partition can't do it itself. Take a look at the
>>> handbook or the info on part large data.
>>>
>>> In your case I think your data may be small enough to assemble just
>>> after diginorm.
>>>
>>> ---
>>> C. Titus Brown, ctb at msu.edu <mailto:ctb at msu.edu>
>>>
>>> On Mar 21, 2013, at 8:50, Eric McDonald <emcd.msu at gmail.com
>>> <mailto:emcd.msu at gmail.com>> wrote:
>>>
>>>> Thanks for the information, Alexis. If you are using 20 threads, then
>>>> 441 / 20 is about 22 hours of elapsed time. So, it appears that all
>>>> of the threads are working. (There is the possibility that they could
>>>> be busy-waiting somewhere, but I didn't see any explicit
>>>> opportunities for that from reading the 'do-partition.py' code.)
>>>> Since you haven't seen .pmap files yet and since multithreaded
>>>> execution is occurring, I expect that execution is currently at the
>>>> following place in the script:
>>>> https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57
>>>>
>>>> I am not familiar with the 'do_subset_partition' code, but will try
>>>> to analyze it later today. However, I would also listen to what Adina
>>>> is saying - this step may just take a long time....
>>>>
>>>> Eric
>>>>
>>>> P.S. If you want to check on the output from the script, you could
>>>> look in /var/spool/PBS/mom_priv (or equivalent) on the node where the
>>>> job is running to see what the spooled output looks like thus far.
>>>> (There should be a file named with the job ID and either a ".ER" or
>>>> ".OU" extension, if I recall correctly, though it has been awhile
>>>> since I have administered your kind of batch system.) You may need
>>>> David to do this as the permissions to the directory are typically
>>>> restrictive.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi
>>>> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>> wrote:
>>>>
>>>> A precision :
>>>>
>>>> The file submitted to the script do-partition.py contains 2576771
>>>> reads (file.below)
>>>> The job was launched with the following options :
>>>> khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20
>>>> file.graphbase file.below
>>>>
>>>> Alexis
>>>>
>>>>
>>>> Le 21/03/2013 10:13, Alexis Groppi a ??crit :
>>>>> Hi Eric,
>>>>>
>>>>> The script do-partition.py is now running since 22 hours.
>>>>> Only the file.info <http://file.info> has been generated. No
>>>>> .pmap file were created.
>>>>>
>>>>> qstat -f gives :
>>>>> resources_used.cput = 441:04:21
>>>>> resources_used.mem = 12764228kb
>>>>> resources_used.vmem = 13926732kb
>>>>> resources_used.walltime = 22:05:56
>>>>>
>>>>> The amount of RAM on the server is 256 Go and the swap space is
>>>>> also 256 Go
>>>>>
>>>>> Your opinion ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Alexis
>>>>>
>>>>> Le 20/03/2013 16:43, Alexis Groppi a ??crit :
>>>>>> Hi Eric,
>>>>>>
>>>>>> Actually the previous job was terminated by the limit of the
>>>>>> walltime.
>>>>>> I relaunched the script.
>>>>>> qstat -fr gives :
>>>>>> resources_used.cput = 93:23:08
>>>>>> resources_used.mem = 12341932kb
>>>>>> resources_used.vmem = 13271372kb
>>>>>> resources_used.walltime = 04:42:39
>>>>>>
>>>>>> At this moment only the file.info <http://file.info> has been
>>>>>> generated.
>>>>>>
>>>>>> Let's wait and see ...
>>>>>>
>>>>>> Thanks again
>>>>>>
>>>>>> Alexis
>>>>>>
>>>>>>
>>>>>> Le 19/03/2013 21:50, Eric McDonald a ??crit :
>>>>>>> Hi Alexis,
>>>>>>>
>>>>>>> What does:
>>>>>>> qstat -f <job-id>
>>>>>>> where <job-id> is the ID of your job tell you for the
>>>>>>> following fields:
>>>>>>> resources_used.cput
>>>>>>> resources_used.vmem
>>>>>>>
>>>>>>> And how do those values compare to actual amount of elapsed
>>>>>>> time for the job, the amount of physical memory on the node,
>>>>>>> and the total memory (RAM + swap space) on the node?
>>>>>>> Just checking to make sure that everything is running as it
>>>>>>> should be and that your process is not heavily into swap or
>>>>>>> something like that.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Eric
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi
>>>>>>> <alexis.groppi at u-bordeaux2.fr
>>>>>>> <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>>>>
>>>>>>> Hi Adina,
>>>>>>>
>>>>>>> First of all thanks for your answer and your advices :)
>>>>>>> The script extract-partitions.py works !
>>>>>>> For the do-partition.py on my second set, it runs since 32
>>>>>>> hours. Should it not have produced at least one temporary
>>>>>>> .pmap file ?
>>>>>>>
>>>>>>> Thanks again
>>>>>>>
>>>>>>> Alexis
>>>>>>>
>>>>>>> Le 19/03/2013 12:58, Adina Chuang Howe a ??crit :
>>>>>>>>
>>>>>>>> Message: 1
>>>>>>>> Date: Tue, 19 Mar 2013 10:41:45 +0100
>>>>>>>> From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>>>>>>>> <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>>>>>> Subject: [khmer] Duration of do-partition.py (very
>>>>>>>> long !)
>>>>>>>> To: khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>>> Message-ID: <514832D9.7090207 at u-bordeaux2.fr
>>>>>>>> <mailto:514832D9.7090207 at u-bordeaux2.fr>>
>>>>>>>> Content-Type: text/plain; charset="iso-8859-1";
>>>>>>>> Format="flowed"
>>>>>>>>
>>>>>>>> Hi Titus,
>>>>>>>>
>>>>>>>> After digital normalization and filter-below-abund,
>>>>>>>> upon your advice I
>>>>>>>> performed do.partition.py <http://do.partition.py> on
>>>>>>>> 2 sets of data (approx 2.5 millions of
>>>>>>>> reads (75 nt)) :
>>>>>>>>
>>>>>>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>>>>>>> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>>>>>>> and
>>>>>>>> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>>>>>>> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>>>>>>
>>>>>>>> For the first one I got a
>>>>>>>> 174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info
>>>>>>>> <http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info>
>>>>>>>> with the
>>>>>>>> information : 33 subsets total
>>>>>>>> Thereafter 33 files .pmap from 0.pmap to 32.pmap
>>>>>>>> regurlarly were created
>>>>>>>> and finally I got unique file
>>>>>>>> 174r1_prinseq_good_bFr8.fasta.keep.below.part (all
>>>>>>>> the .pmap files were
>>>>>>>> deleted)
>>>>>>>> This treatment lasted approx 56 hours.
>>>>>>>>
>>>>>>>> For the second set (174r2), do-partition.py is
>>>>>>>> started since 32 hours
>>>>>>>> but I only got the
>>>>>>>> 174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info
>>>>>>>> <http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info>
>>>>>>>> with the
>>>>>>>> information : 35 subsets total
>>>>>>>> And nothing more...
>>>>>>>>
>>>>>>>> Is this duration "normal" ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, this is typical. The longest I've had it run is 3
>>>>>>>> weeks for very large (billions of reads). In general,
>>>>>>>> partitioning is the most time consuming of all the steps.
>>>>>>>> Once its finished, you'll have much smaller files which
>>>>>>>> can be assembled very quickly. Since I run assembly on
>>>>>>>> multiple assembler and with multiple K lengths, this gain
>>>>>>>> is often significant for me.
>>>>>>>>
>>>>>>>> To get the actual partitioned files, you can use the
>>>>>>>> following script:
>>>>>>>>
>>>>>>>> https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>>>>>>>
>>>>>>>> (The parameters for the threads are by default (4
>>>>>>>> threads))
>>>>>>>> 33 subsets and only one file at the end ?
>>>>>>>> Should I stop do-partition.py on the second set and
>>>>>>>> re run it with more
>>>>>>>> threads ?
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd suggest letting it run.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Adina
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> khmer mailing list
>>>>>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>>> -- <mime-attachment.png>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> khmer mailing list
>>>>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Eric McDonald
>>>>>>> HPC/Cloud Software Engineer
>>>>>>> for the Institute for Cyber-Enabled Research (iCER)
>>>>>>> and the Laboratory for Genomics, Evolution, and Development
>>>>>>> (GED)
>>>>>>> Michigan State University
>>>>>>> P: 517-355-8733 <tel:517-355-8733>
>>>>>> -- <mime-attachment.png>
>>>>> -- <mime-attachment.png>
>>>> -- <Signature_Mail_A_Groppi.png>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eric McDonald
>>>> HPC/Cloud Software Engineer
>>>> for the Institute for Cyber-Enabled Research (iCER)
>>>> and the Laboratory for Genomics, Evolution, and Development (GED)
>>>> Michigan State University
>>>> P: 517-355-8733
>>>> _______________________________________________
>>>> khmer mailing list
>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>> http://lists.idyll.org/listinfo/khmer
>> --
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org
>> http://lists.idyll.org/listinfo/khmer
>
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/b539eb42/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/b539eb42/attachment-0002.png>
More information about the khmer
mailing list