<br><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Message: 1<br>
Date: Tue, 19 Mar 2013 10:41:45 +0100<br>
From: Alexis Groppi <<a href="mailto:alexis.groppi@u-bordeaux2.fr">alexis.groppi@u-bordeaux2.fr</a>><br>
Subject: [khmer] Duration of do-partition.py (very long !)<br>
To: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>
Message-ID: <<a href="mailto:514832D9.7090207@u-bordeaux2.fr">514832D9.7090207@u-bordeaux2.fr</a>><br>
Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"<br>
<br>
Hi Titus,<br>
<br>
After digital normalization and filter-below-abund, upon your advice I<br>
performed <a href="http://do.partition.py" target="_blank">do.partition.py</a> on 2 sets of data (approx 2.5 millions of<br>
reads (75 nt)) :<br>
<br>
/khmer-BETA/scripts/do-partition.py -k 20 -x 1e9<br>
/ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase<br>
/ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below<br>
and<br>
/khmer-BETA/scripts/do-partition.py -k 20 -x 1e9<br>
/ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase<br>
/ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below<br>
<br>
For the first one I got a<br>
<a href="http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info" target="_blank">174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info</a> with the<br>
information : 33 subsets total<br>
Thereafter 33 files .pmap from 0.pmap to 32.pmap regurlarly were created<br>
and finally I got unique file<br>
174r1_prinseq_good_bFr8.fasta.keep.below.part (all the .pmap files were<br>
deleted)<br>
This treatment lasted approx 56 hours.<br>
<br>
For the second set (174r2), do-partition.py is started since 32 hours<br>
but I only got the<br>
<a href="http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info" target="_blank">174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info</a> with the<br>
information : 35 subsets total<br>
And nothing more...<br>
<br>
Is this duration "normal" ?<br></blockquote><div><br></div><div>Yes, this is typical. The longest I've had it run is 3 weeks for very large (billions of reads). In general, partitioning is the most time consuming of all the steps. Once its finished, you'll have much smaller files which can be assembled very quickly. Since I run assembly on multiple assembler and with multiple K lengths, this gain is often significant for me. </div>
<div><br></div><div>To get the actual partitioned files, you can use the following script:</div><div><br></div><div><a href="https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py">https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py</a></div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
(The parameters for the threads are by default (4 threads))<br>
33 subsets and only one file at the end ?<br>
Should I stop do-partition.py on the second set and re run it with more<br>
threads ?<br>
<br></blockquote><div><br></div><div>I'd suggest letting it run.</div><div><br></div><div>Best,</div><div>Adina</div></div>