[khmer] How to speed up the filter-below-abund script ?
Alexis Groppi
alexis.groppi at u-bordeaux2.fr
Wed Mar 13 09:24:15 PDT 2013
Hi,
Thanks for investigating ! ;)
Le 13/03/2013 16:54, Eric McDonald a écrit :
> Thanks for the information, Alexis.
>
> The .below is an output file; it is empty because the script crashed
> before anything was written to it.
>
> I have some more ideas about what may be happening. But, rather than
> speculating wildly, I want to reason in a slightly more data-driven
> manner. So, could you please attach a copy of your job script, which
> has all of the commands that you are running? (Among other things, I
> want to see the parameters that you are using to construct the .kh file.)
I attach the bash file with all the previous steps performed.
> Also, can you please show us the output of the following command:
> head -20 /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
Here it is :
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:3210:1066
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAAAAGAAGA
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:7734:1064
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATATCGTATGCCGTCTTCTGCTTGAAAAAAAAGT
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:8363:1068
CGCNGAAGCATTTGTCGCACGGCTTGCGAAAGCAGGCGTGATCGCGCGCGATCCAAGATCGGAAGAGCACACGTC
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:9951:1066
GATNGGAAGAGCATACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAGCACACGT
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:13354:1070
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAA
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:14899:1066
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAGCACACGT
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:15273:1065
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAGCACAAGT
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:17148:1066
GTCNAGAACCTCGCGAGCTCGCCGGCGTTCTACGAGAAGCTTGGATTCACCGTCTTCGGGGGAAATGCCTCACAA
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:5125:1074
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAAAAAACGT
>ILLUMINA-2AC670:34:64CKCAAXX:2:1:5173:1073
CTGCTTCGCGCACTTATTTGCAGGGCCGATTTCGGCAGTCAGATCTGAATGAGGATTTGCTGCGCTACCTGGTTC
I also attach the filter-below-abund.py file with the following changes
- I have added#! /usr/bin/env python
- CUTOFF = 100 (since I diginormed to C=20)
and I change the permisssions from 644 to 755 in the sandbox directory
Alexis
>
> Thanks,
>
>
>
> On Wed, Mar 13, 2013 at 10:48 AM, Alexis Groppi
> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>>
> wrote:
>
> Another clue :
>
> A empty .below file is generated
> (174r1_prinseq_good_bFr8.fasta.keep.below)
>
> Alexis
>
>
> Le 13/03/2013 15:13, Alexis Groppi a écrit :
>> Hi,
>>
>> Le 13/03/2013 14:12, Eric McDonald a écrit :
>>> Hi Alexis,
>>>
>>> First, let me say thank you for being patient and working with
>>> us in spite of all the problems you are encountering.
>>
>> That's bioinformatician life ;)
>>
>>>
>>> With regards to the floating point exception, I see several
>>> opportunities for a division-by-zero condition in the threading
>>> utilities used by the script. These opportunities exist if an
>>> input file is empty. (The problem may be coming from another
>>> place, but this would be my first guess.) What does the
>>> following command say:
>>>
>>> ls -lh /scratch/ag/khmer/174r1_table.kh
>>> <http://174r1_table.kh/>
>>> /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>
>> The result : (the files are not empty)
>> -rw-r--r-- 1 ag users 299M 12 mars 20:54
>> /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>> -rw-r--r-- 1 ag users 141G 12 mars 21:05
>> /scratch/ag/khmer/174r1_table.kh <http://174r1_table.kh>
>>
>>>
>>> Also, since you appear to be using TORQUE as your resource
>>> manager/batch system, could you please attach the complete
>>> output and error files for the job? (These files should be of
>>> the form <job_name>.o2693 and <job_name>.e2693, where <job_name>
>>> is the name of your job. There may only be one or the other of
>>> these files, depending on site defaults and whether you
>>> specified "-j oe" or "-j eo" in your job submission.)
>>
>> I re run the job since I have deleted previous (2693) err/out files.
>> Here is the new file (merged with the option -j oe in the bash
>> script) :
>>
>> #############################
>> User: ag
>> Date: Wed Mar 13 14:59:21 CET 2013
>> Host: rainman.cbib.u-bordeaux2.fr
>> <http://rainman.cbib.u-bordeaux2.fr>
>> Directory: /mnt/var/home/ag
>> PBS_JOBID: 2695.rainman
>> PBS_O_WORKDIR: /mnt/var/home/ag
>> PBS_NODEFILE: rainman
>> #############################
>> #############################
>> Debut filter-below-abund: Wed Mar 13 14:59:21 CET 2013
>> starting threads
>> starting writer
>> loading...
>> ... filtering 0
>> /var/lib/torque/mom_priv/jobs/2695.rainman.SC
>> <http://2695.rainman.SC>: line 49: 54757 Floating point
>> exception(core dumped) ./khmer-BETA/sandbox/fi
>> lter-below-abund.py /scratch/ag/khmer/174r1_table.kh
>> <http://174r1_table.kh>
>> /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>
>> real 3m54.873s
>> user 0m0.085s
>> sys 2m2.180s
>> Date fin: Wed Mar 13 15:03:15 CET 2013
>> Job finished
>>
>> Thanks again for your help :)
>>
>> Alexis
>>
>>>
>>> Thanks,
>>> Eric
>>>
>>>
>>>
>>> On Wed, Mar 13, 2013 at 5:38 AM, Alexis Groppi
>>> <alexis.groppi at u-bordeaux2.fr
>>> <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>
>>> Hi Eric,
>>>
>>> Thanks for your answer.
>>> But unfortunately, after many attempts I'm getting this error :
>>>
>>> starting threads
>>> starting writer
>>> loading...
>>> ... filtering 0
>>> /var/lib/torque/mom_priv/jobs/2693.rainman.SC
>>> <http://2693.rainman.SC>: line 46: 63657 Floating point
>>> exception(core dumped)
>>> ./khmer-BETA/sandbox/filter-below-abund.py
>>> /scratch/ag/khmer/174r1_table.kh <http://174r1_table.kh>
>>> /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>>
>>> real 3m30.163s
>>> user 0m0.088s
>>>
>>> Your opinion ?
>>>
>>> Thanks
>>>
>>> Alexis
>>>
>>>
>>> Le 13/03/2013 00:55, Eric McDonald a écrit :
>>>> Hi Alexis,
>>>>
>>>> One way to get the 'bleeding-edge' branch is to clone it
>>>> into a fresh directory; for example:
>>>> git clone http://github.com/ged-lab/khmer.git -b
>>>> bleeding-edge khmer-BETA
>>>>
>>>> Assuming you already have a clone of the 'ged-lab/khmer'
>>>> repo, then you should also be able to do:
>>>> git fetch origin
>>>> git checkout bleeding-edge
>>>> Depending on how old your Git client is and what its
>>>> defaults are, you may have to do the following instead:
>>>> git checkout --track -b bleeding-edge origin/bleeding-edge
>>>>
>>>> Hope this helps,
>>>> Eric
>>>>
>>>>
>>>> On Tue, Mar 12, 2013 at 11:32 AM, Alexis Groppi
>>>> <alexis.groppi at u-bordeaux2.fr
>>>> <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>
>>>>
>>>> Le 12/03/2013 16:16, C. Titus Brown a écrit :
>>>>> On Tue, Mar 12, 2013 at 04:15:05PM +0100, Alexis Groppi wrote:
>>>>>> Hi Titus,
>>>>>>
>>>>>> Thanks for your answer
>>>>>> Actually it's my second attempt with filter-below-abund.
>>>>>> The first time, I thought the problem was coming from the location of my
>>>>>> table.kh <http://table.kh> file : in a storage element with poor level performance of I/O
>>>>>> I killed the job after 24h, moved the file in a best place and re run it
>>>>>> But with the same result : no completion after 24h
>>>>>>
>>>>>> Any Idea ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Cheers From Bordeaux :)
>>>>>>
>>>>>> Alexis
>>>>>>
>>>>>> PS : The command line was the following :
>>>>>>
>>>>>> ./filter-below-abund.py174r1_table.kh <http://174r1_table.kh> 174r1_prinseq_good_bFr8.fasta.keep
>>>>>>
>>>>>> Is this correct ?
>>>>> Yes, looks right... Can you try with the bleeding-edge branch, which now
>>>>> incorporates a potential fix for this issue?
>>>> From here :
>>>> https://github.com/ged-lab/khmer/tree/bleeding-edge ?
>>>> or
>>>> here : https://github.com/ctb/khmer/tree/bleeding-edge ?
>>>>
>>>> Do I have to make a fresh install ? and How ?
>>>> Or just replace all the files and folders ?
>>>>
>>>> Thanks :)
>>>>
>>>> Alexis
>>>>
>>>>
>>>>> thanks,
>>>>> --titus
>>>>>
>>>>>> Le 12/03/2013 14:41, C. Titus Brown a ?crit :
>>>>>>> On Tue, Mar 12, 2013 at 10:48:03AM +0100, Alexis Groppi wrote:
>>>>>>>> Metagenome assembly :
>>>>>>>> My data :
>>>>>>>> - original (quality filtered) data : 4463243 reads (75 nt) (Illumina)
>>>>>>>> 1/ Single pass digital normalization with normalize-by-median (C=20)
>>>>>>>> ==> file .keep of 2560557 reads
>>>>>>>> 2/ generated a hash table by load-into-counting on the .keep file
>>>>>>>> ==> file .kh of ~16Go (huge file ?!)
>>>>>>>> 3/ filter-below-abund with C=100 from the two previous file (table.kh <http://table.kh>
>>>>>>>> and reads.keep)
>>>>>>>> Still running after 24 hours :(
>>>>>>>>
>>>>>>>> Any advice to speed up this step ? ... and the others (partitionning ...) ?
>>>>>>>>
>>>>>>>> I can have an access to a HPC : ~3000 cores.
>>>>>>> Hi Alexis,
>>>>>>>
>>>>>>> filter-below-abund and filter-abund have occasional bugs that prevent them
>>>>>>> from completing. I would kill and restart. For that few reads it should
>>>>>>> take no more than a few hours to do everything.
>>>>>>>
>>>>>>> Most of what khmer does cannot easily be distributed across multiple chassis,
>>>>>>> note.
>>>>>>>
>>>>>>> best,
>>>>>>> --titus
>>>>>> --
>>>>
>>>> --
>>>>
>>>> _______________________________________________
>>>> khmer mailing list
>>>> khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>> http://lists.idyll.org/listinfo/khmer
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eric McDonald
>>>> HPC/Cloud Software Engineer
>>>> for the Institute for Cyber-Enabled Research (iCER)
>>>> and the Laboratory for Genomics, Evolution, and
>>>> Development (GED)
>>>> Michigan State University
>>>> P: 517-355-8733 <tel:517-355-8733>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> --
>>> Eric McDonald
>>> HPC/Cloud Software Engineer
>>> for the Institute for Cyber-Enabled Research (iCER)
>>> and the Laboratory for Genomics, Evolution, and Development (GED)
>>> Michigan State University
>>> P: 517-355-8733 <tel:517-355-8733>
>>
>> --
>
> --
>
>
>
>
> --
> Eric McDonald
> HPC/Cloud Software Engineer
> for the Institute for Cyber-Enabled Research (iCER)
> and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0013.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0014.png>
-------------- next part --------------
#!/bin/sh
#############################
# les directives PBS vont ici:
# Your job name (displayed by the queue)
#PBS -N step_1_and_2_Metagenome_assembly
# Specify the working directory
#PBS -d /scratch/ag/khmer
# walltime (hh:mm::ss)
#PBS -l walltime=48:00:00
# Specify the number of nodes(nodes=) and the number of cores per nodes(ppn=) to be used
#PBS -l nodes=1:ppn=1
# fin des directives PBS
#############################
# useful informations to print
echo "#############################"
echo "User:" $USER
echo "Date:" `date`
echo "Host:" `hostname`
echo "Directory:" `pwd`
echo "PBS_JOBID:" $PBS_JOBID
echo "PBS_O_WORKDIR:" $PBS_O_WORKDIR
echo "PBS_NODEFILE: " `cat $PBS_NODEFILE | uniq`
echo "#############################"
#############################
# What you actually want to launch
. /mnt/var/home/ag/env/bin/activate
echo "#############################"
echo "Debut normalize-by-median:" `date`
time ./khmer/scripts/normalize-by-median.py -N 4 -x 64e9 -k 20 /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta
echo "#############################"
echo "Debut load-into-counting:" `date`
time ./khmer/scripts/load-into-counting.py -k 20 -x 32e9 /scratch/ag/khmer/174r1_table.kh 174r1_prinseq_good_bFr8.fasta.keep
echo "#############################"
echo "Debut filter-below-abund:" `date`
time /mnt/var/home/ag/khmer-BETA/sandbox/filter-below-abund.py /scratch/ag/khmer/174r1_table.kh /scratch/ag/khmer/174r1_prinseq_good_bFr8.fasta.keep
echo "Date fin:" `date`
# all done
echo "Job finished"
-------------- next part --------------
#! /usr/bin/env python
import sys
import screed.fasta
import os
import khmer
from khmer.thread_utils import ThreadedSequenceProcessor, verbose_fasta_iter
WORKER_THREADS = 8
GROUPSIZE = 100
CUTOFF = 100
###
def main():
counting_ht = sys.argv[1]
infiles = sys.argv[2:]
print 'file with ht: %s' % counting_ht
print '-- settings:'
print 'N THREADS', WORKER_THREADS
print '--'
print 'making hashtable'
ht = khmer.load_counting_hash(counting_ht)
K = ht.ksize()
for infile in infiles:
print 'filtering', infile
outfile = os.path.basename(infile) + '.below'
outfp = open(outfile, 'w')
def process_fn(record, ht=ht):
name = record['name']
seq = record['sequence']
if 'N' in seq:
return None, None
trim_seq, trim_at = ht.trim_below_abundance(seq, CUTOFF)
if trim_at >= K:
return name, trim_seq
return None, None
tsp = ThreadedSequenceProcessor(process_fn, WORKER_THREADS, GROUPSIZE)
tsp.start(verbose_fasta_iter(infile), outfp)
if __name__ == '__main__':
main()
More information about the khmer
mailing list