[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)

Alexis Groppi alexis.groppi at u-bordeaux2.fr
Thu Mar 21 08:51:30 PDT 2013


Sorry for bothering you, but it's not clear for me :

For removing the artefacts :
Should I apply find-knots on my file.below ? (after 
normalize-by-median.py, load-into-counting.py and filter-below-abund.py)
Then filter-stoptags ?
And then will I have data ready for assembly or should I perform 
do-partition.py ? (on these artefact free data)

Thanks

Alexis
Le 21/03/2013 15:28, C. Titus Brown a écrit :
> On Thu, Mar 21, 2013 at 03:15:33PM +0100, Alexis Groppi wrote:
>> Thanks for your answer. The input file I use should not have this
>> artefact because it comes after filter-below-abund treatment.
>> I will try with find-knots and then filter-stoptags.
>> For your last proposition : what is the size limit ?
>> Subsidiary question, Eric told me "Titus created a guide about what size
>> hash table to generally use with certain kinds of data"
>> If possible I would be very interested to have this guide.
> http://khmer.readthedocs.org/en/latest/
>
> http://khmer.readthedocs.org/en/latest/choosing-hash-sizes.html
>
> OK, you may have to use the find-knots stuff --
>
> http://khmer.readthedocs.org/en/latest/partitioning-big-data.html
>
> cheers,
> --titus
>
>> Le 21/03/2013 14:14, C. Titus Brown a ??crit :
>>> This long wait is probably a sign that you have a highly connected
>>> graph. We usually attribute that to the presence of sequencing
>>> artifacts, which have to be removed either via filter-below-abund or
>>> find-knot; do-partition can't do it itself.  Take a look at the
>>> handbook or the info on part large data.
>>>
>>> In your case I think your data may be small enough to assemble just
>>> after diginorm.
>>>
>>> ---
>>> C. Titus Brown, ctb at msu.edu <mailto:ctb at msu.edu>
>>>
>>> On Mar 21, 2013, at 8:50, Eric McDonald <emcd.msu at gmail.com
>>> <mailto:emcd.msu at gmail.com>> wrote:
>>>
>>>> Thanks for the information, Alexis. If you are using 20 threads, then
>>>> 441 / 20 is about 22 hours of elapsed time. So, it appears that all
>>>> of the threads are working. (There is the possibility that they could
>>>> be busy-waiting somewhere, but I didn't see any explicit
>>>> opportunities for that from reading the 'do-partition.py' code.)
>>>> Since you haven't seen .pmap files yet and since multithreaded
>>>> execution is occurring, I expect that execution is currently at the
>>>> following place in the script:
>>>> https://github.com/ged-lab/khmer/blob/bleeding-edge/scripts/do-partition.py#L57
>>>>
>>>> I am not familiar with the 'do_subset_partition' code, but will try
>>>> to analyze it later today. However, I would also listen to what Adina
>>>> is saying - this step may just take a long time....
>>>>
>>>> Eric
>>>>
>>>> P.S. If you want to check on the output from the script, you could
>>>> look in /var/spool/PBS/mom_priv (or equivalent) on the node where the
>>>> job is running to see what the spooled output looks like thus far.
>>>> (There should be a file named with the job ID and either a ".ER" or
>>>> ".OU" extension, if I recall correctly, though it has been awhile
>>>> since I have administered your kind of batch system.) You may need
>>>> David to do this as the permissions to the directory are typically
>>>> restrictive.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 21, 2013 at 5:40 AM, Alexis Groppi
>>>> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>> wrote:
>>>>
>>>>      A precision :
>>>>
>>>>      The file submitted to the script do-partition.py contains 2576771
>>>>      reads (file.below)
>>>>      The job was launched with the following options :
>>>>      khmer-BETA/scripts/do-partition.py -k 20 -x 1e9 -T 20
>>>>      file.graphbase file.below
>>>>
>>>>      Alexis
>>>>
>>>>
>>>>      Le 21/03/2013 10:13, Alexis Groppi a ??crit :
>>>>>      Hi Eric,
>>>>>
>>>>>      The script  do-partition.py is now running since 22 hours.
>>>>>      Only the file.info <http://file.info> has been generated. No
>>>>>      .pmap file were created.
>>>>>
>>>>>      qstat -f gives :
>>>>>          resources_used.cput = 441:04:21
>>>>>          resources_used.mem = 12764228kb
>>>>>          resources_used.vmem = 13926732kb
>>>>>          resources_used.walltime = 22:05:56
>>>>>
>>>>>      The amount of RAM on the server is 256 Go and the swap space is
>>>>>      also 256 Go
>>>>>
>>>>>      Your opinion ?
>>>>>
>>>>>      Thanks
>>>>>
>>>>>      Alexis
>>>>>
>>>>>      Le 20/03/2013 16:43, Alexis Groppi a ??crit :
>>>>>>      Hi Eric,
>>>>>>
>>>>>>      Actually the previous job was terminated by the limit of the
>>>>>>      walltime.
>>>>>>      I relaunched the script.
>>>>>>      qstat -fr gives :
>>>>>>          resources_used.cput = 93:23:08
>>>>>>          resources_used.mem = 12341932kb
>>>>>>          resources_used.vmem = 13271372kb
>>>>>>          resources_used.walltime = 04:42:39
>>>>>>
>>>>>>      At this moment only the file.info <http://file.info> has been
>>>>>>      generated.
>>>>>>
>>>>>>      Let's wait and see ...
>>>>>>
>>>>>>      Thanks again
>>>>>>
>>>>>>      Alexis
>>>>>>
>>>>>>
>>>>>>      Le 19/03/2013 21:50, Eric McDonald a ??crit :
>>>>>>>      Hi Alexis,
>>>>>>>
>>>>>>>      What does:
>>>>>>>        qstat -f <job-id>
>>>>>>>      where <job-id> is the ID of your job tell you for the
>>>>>>>      following fields:
>>>>>>>        resources_used.cput
>>>>>>>        resources_used.vmem
>>>>>>>
>>>>>>>      And how do those values compare to actual amount of elapsed
>>>>>>>      time for the job, the amount of physical memory on the node,
>>>>>>>      and the total memory (RAM + swap space) on the node?
>>>>>>>      Just checking to make sure that everything is running as it
>>>>>>>      should be and that your process is not heavily into swap or
>>>>>>>      something like that.
>>>>>>>
>>>>>>>      Thanks,
>>>>>>>        Eric
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      On Tue, Mar 19, 2013 at 11:23 AM, Alexis Groppi
>>>>>>>      <alexis.groppi at u-bordeaux2.fr
>>>>>>>      <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>>>>
>>>>>>>          Hi Adina,
>>>>>>>
>>>>>>>          First of all thanks for your answer and your advices :)
>>>>>>>          The script extract-partitions.py works !
>>>>>>>          For the do-partition.py on my second set, it runs since 32
>>>>>>>          hours. Should it not have produced at least one temporary
>>>>>>>          .pmap file ?
>>>>>>>
>>>>>>>          Thanks again
>>>>>>>
>>>>>>>          Alexis
>>>>>>>
>>>>>>>          Le 19/03/2013 12:58, Adina Chuang Howe a ??crit :
>>>>>>>>
>>>>>>>>              Message: 1
>>>>>>>>              Date: Tue, 19 Mar 2013 10:41:45 +0100
>>>>>>>>              From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>>>>>>>>              <mailto:alexis.groppi at u-bordeaux2.fr>>
>>>>>>>>              Subject: [khmer] Duration of do-partition.py (very
>>>>>>>>              long !)
>>>>>>>>              To: khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>>>              Message-ID: <514832D9.7090207 at u-bordeaux2.fr
>>>>>>>>              <mailto:514832D9.7090207 at u-bordeaux2.fr>>
>>>>>>>>              Content-Type: text/plain; charset="iso-8859-1";
>>>>>>>>              Format="flowed"
>>>>>>>>
>>>>>>>>              Hi Titus,
>>>>>>>>
>>>>>>>>              After digital normalization and filter-below-abund,
>>>>>>>>              upon your advice I
>>>>>>>>              performed do.partition.py <http://do.partition.py> on
>>>>>>>>              2 sets of data (approx 2.5 millions of
>>>>>>>>              reads (75 nt)) :
>>>>>>>>
>>>>>>>>              /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>>>              /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
>>>>>>>>              /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
>>>>>>>>              and
>>>>>>>>              /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
>>>>>>>>              /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
>>>>>>>>              /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>>>>>>>>
>>>>>>>>              For the first one I got a
>>>>>>>>              174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info
>>>>>>>>              <http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info>
>>>>>>>>              with the
>>>>>>>>              information : 33 subsets total
>>>>>>>>              Thereafter 33 files .pmap from 0.pmap to 32.pmap
>>>>>>>>              regurlarly were created
>>>>>>>>              and finally I got unique file
>>>>>>>>              174r1_prinseq_good_bFr8.fasta.keep.below.part (all
>>>>>>>>              the .pmap files were
>>>>>>>>              deleted)
>>>>>>>>              This treatment lasted approx 56 hours.
>>>>>>>>
>>>>>>>>              For the second set (174r2), do-partition.py is
>>>>>>>>              started since 32 hours
>>>>>>>>              but I only got the
>>>>>>>>              174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info
>>>>>>>>              <http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info>
>>>>>>>>              with the
>>>>>>>>              information : 35 subsets total
>>>>>>>>              And nothing more...
>>>>>>>>
>>>>>>>>              Is this duration "normal" ?
>>>>>>>>
>>>>>>>>
>>>>>>>>          Yes, this is typical.  The longest I've had it run is 3
>>>>>>>>          weeks for very large (billions of reads).  In general,
>>>>>>>>          partitioning is the most time consuming of all the steps.
>>>>>>>>           Once its finished, you'll have much smaller files which
>>>>>>>>          can be assembled very quickly.  Since I run assembly on
>>>>>>>>          multiple assembler and with multiple K lengths, this gain
>>>>>>>>          is often  significant for me.
>>>>>>>>
>>>>>>>>          To get the actual partitioned files, you can use the
>>>>>>>>          following script:
>>>>>>>>
>>>>>>>>          https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py
>>>>>>>>
>>>>>>>>              (The parameters for the threads are by default (4
>>>>>>>>              threads))
>>>>>>>>              33 subsets and only one file at the end ?
>>>>>>>>              Should I stop do-partition.py on the second set and
>>>>>>>>              re run it with more
>>>>>>>>              threads ?
>>>>>>>>
>>>>>>>>
>>>>>>>>          I'd suggest letting it run.
>>>>>>>>
>>>>>>>>          Best,
>>>>>>>>          Adina
>>>>>>>>
>>>>>>>>
>>>>>>>>          _______________________________________________
>>>>>>>>          khmer mailing list
>>>>>>>>          khmer at lists.idyll.org  <mailto:khmer at lists.idyll.org>
>>>>>>>>          http://lists.idyll.org/listinfo/khmer
>>>>>>>          --         <mime-attachment.png>
>>>>>>>
>>>>>>>          _______________________________________________
>>>>>>>          khmer mailing list
>>>>>>>          khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>>>>          http://lists.idyll.org/listinfo/khmer
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      --     Eric McDonald
>>>>>>>      HPC/Cloud Software Engineer
>>>>>>>        for the Institute for Cyber-Enabled Research (iCER)
>>>>>>>        and the Laboratory for Genomics, Evolution, and Development
>>>>>>>      (GED)
>>>>>>>      Michigan State University
>>>>>>>      P: 517-355-8733 <tel:517-355-8733>
>>>>>>      --     <mime-attachment.png>
>>>>>      --     <mime-attachment.png>
>>>>      --     <Signature_Mail_A_Groppi.png>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Eric McDonald
>>>> HPC/Cloud Software Engineer
>>>>    for the Institute for Cyber-Enabled Research (iCER)
>>>>    and the Laboratory for Genomics, Evolution, and Development (GED)
>>>> Michigan State University
>>>> P: 517-355-8733
>>>> _______________________________________________
>>>> khmer mailing list
>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>> http://lists.idyll.org/listinfo/khmer
>> -- 
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org
>> http://lists.idyll.org/listinfo/khmer
>

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/b539eb42/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130321/b539eb42/attachment-0002.png>


More information about the khmer mailing list