<br><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Message: 1<br>

Date: Tue, 19 Mar 2013 10:41:45 +0100<br>

From: Alexis Groppi &lt;<a href="mailto:alexis.groppi@u-bordeaux2.fr">alexis.groppi@u-bordeaux2.fr</a>&gt;<br>

Subject: [khmer] Duration of do-partition.py (very long !)<br>

To: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br>

Message-ID: &lt;<a href="mailto:514832D9.7090207@u-bordeaux2.fr">514832D9.7090207@u-bordeaux2.fr</a>&gt;<br>

Content-Type: text/plain; charset=&quot;iso-8859-1&quot;; Format=&quot;flowed&quot;<br>

<br>

Hi Titus,<br>

<br>

After digital normalization and filter-below-abund, upon your advice I<br>

performed <a href="http://do.partition.py" target="_blank">do.partition.py</a> on 2 sets of data (approx 2.5 millions of<br>

reads (75 nt)) :<br>

<br>

/khmer-BETA/scripts/do-partition.py -k 20 -x 1e9<br>

/ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase<br>

/ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below<br>

and<br>

/khmer-BETA/scripts/do-partition.py -k 20 -x 1e9<br>

/ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase<br>

/ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below<br>

<br>

For the first one I got a<br>

<a href="http://174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info" target="_blank">174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info</a> with the<br>

information : 33 subsets total<br>

Thereafter 33 files .pmap from 0.pmap to 32.pmap regurlarly were created<br>

and finally I got unique file<br>

174r1_prinseq_good_bFr8.fasta.keep.below.part (all the .pmap files were<br>

deleted)<br>

This treatment lasted approx 56 hours.<br>

<br>

For the second set (174r2), do-partition.py is started since 32 hours<br>

but I only got the<br>

<a href="http://174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info" target="_blank">174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info</a> with the<br>

information : 35 subsets total<br>

And nothing more...<br>

<br>

Is this duration &quot;normal&quot; ?<br></blockquote><div><br></div><div>Yes, this is typical.  The longest I&#39;ve had it run is 3 weeks for very large (billions of reads).  In general, partitioning is the most time consuming of all the steps.  Once its finished, you&#39;ll have much smaller files which can be assembled very quickly.  Since I run assembly on multiple assembler and with multiple K lengths, this gain is often  significant for me.  </div>

<div><br></div><div>To get the actual partitioned files, you can use the following script:</div><div><br></div><div><a href="https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py">https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py</a></div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

(The parameters for the threads are by default (4 threads))<br>

33 subsets and only one file at the end ?<br>

Should I stop do-partition.py on the second set and re run it with more<br>

threads ?<br>

<br></blockquote><div><br></div><div>I&#39;d suggest letting it run.</div><div><br></div><div>Best,</div><div>Adina</div></div>