[khmer] Duration of do-partition.py (very long !) (Alexis Groppi)

Adina Chuang Howe adina.chuang at gmail.com
Tue Mar 19 04:58:35 PDT 2013


Message: 1
> Date: Tue, 19 Mar 2013 10:41:45 +0100
> From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr>
> Subject: [khmer] Duration of do-partition.py (very long !)
> To: khmer at lists.idyll.org
> Message-ID: <514832D9.7090207 at u-bordeaux2.fr>
> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>
> Hi Titus,
>
> After digital normalization and filter-below-abund, upon your advice I
> performed do.partition.py on 2 sets of data (approx 2.5 millions of
> reads (75 nt)) :
>
> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below.graphbase
> /ag/khmer/Sample_174/174r1_prinseq_good_bFr8.fasta.keep.below
> and
> /khmer-BETA/scripts/do-partition.py -k 20 -x 1e9
> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase
> /ag/khmer/Sample_174/174r2_prinseq_good_1lIQ.fasta.keep.below
>
> For the first one I got a
> 174r1_prinseq_good_bFr8.fasta.keep.below.graphbase.info with the
> information : 33 subsets total
> Thereafter 33 files .pmap from 0.pmap to 32.pmap regurlarly were created
> and finally I got unique file
> 174r1_prinseq_good_bFr8.fasta.keep.below.part (all the .pmap files were
> deleted)
> This treatment lasted approx 56 hours.
>
> For the second set (174r2), do-partition.py is started since 32 hours
> but I only got the
> 174r2_prinseq_good_1lIQ.fasta.keep.below.graphbase.info with the
> information : 35 subsets total
> And nothing more...
>
> Is this duration "normal" ?
>

Yes, this is typical.  The longest I've had it run is 3 weeks for very
large (billions of reads).  In general, partitioning is the most time
consuming of all the steps.  Once its finished, you'll have much smaller
files which can be assembled very quickly.  Since I run assembly on
multiple assembler and with multiple K lengths, this gain is often
 significant for me.

To get the actual partitioned files, you can use the following script:

https://github.com/ged-lab/khmer/blob/master/scripts/extract-partitions.py

(The parameters for the threads are by default (4 threads))
> 33 subsets and only one file at the end ?
> Should I stop do-partition.py on the second set and re run it with more
> threads ?
>
>
I'd suggest letting it run.

Best,
Adina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130319/95e2d2a3/attachment-0002.htm>


More information about the khmer mailing list