[khmer] How to speed up the filter-below-abund script ?

Alexis Groppi alexis.groppi at u-bordeaux2.fr
Wed Mar 13 09:24:15 PDT 2013


Hi,

Thanks for investigating ! ;)

Le 13/03/2013 16:54, Eric McDonald a écrit :
> Thanks for the information, Alexis.
>
> The .below is an output file; it is empty because the script crashed 
> before anything was written to it.
>
> I have some more ideas about what may be happening. But, rather than 
> speculating wildly, I want to reason in a slightly more data-driven 
> manner. So, could you please attach a copy of your job script, which 
> has all of the commands that you are running? (Among other things, I 
> want to see the parameters that you are using to construct the .kh file.)

I attach the bash file with all the previous steps performed.

> Also, can you please show us the output of the following command:
>   head -20 /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep

Here it is :

 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:3210:1066
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAAAAGAAGA
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:7734:1064
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATATCGTATGCCGTCTTCTGCTTGAAAAAAAAGT
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:8363:1068
CGCNGAAGCATTTGTCGCACGGCTTGCGAAAGCAGGCGTGATCGCGCGCGATCCAAGATCGGAAGAGCACACGTC
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:9951:1066
GATNGGAAGAGCATACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAGCACACGT
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:13354:1070
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAA
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:14899:1066
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAGCACACGT
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:15273:1065
GATNGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAGCACAAGT
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:17148:1066
GTCNAGAACCTCGCGAGCTCGCCGGCGTTCTACGAGAAGCTTGGATTCACCGTCTTCGGGGGAAATGCCTCACAA
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:5125:1074
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTGGAAAAAACGT
 >ILLUMINA-2AC670:34:64CKCAAXX:2:1:5173:1073
CTGCTTCGCGCACTTATTTGCAGGGCCGATTTCGGCAGTCAGATCTGAATGAGGATTTGCTGCGCTACCTGGTTC

I also attach the filter-below-abund.py file with the following changes
- I have added#! /usr/bin/env python
- CUTOFF = 100 (since I diginormed to C=20)
and I change the permisssions from 644 to 755 in the sandbox directory

Alexis

>
> Thanks,
>
>
>
> On Wed, Mar 13, 2013 at 10:48 AM, Alexis Groppi 
> <alexis.groppi at u-bordeaux2.fr <mailto:alexis.groppi at u-bordeaux2.fr>> 
> wrote:
>
>     Another clue :
>
>     A empty .below file is generated
>     (174r1_prinseq_good_bFr8.fasta.keep.below)
>
>     Alexis
>
>
>     Le 13/03/2013 15:13, Alexis Groppi a écrit :
>>     Hi,
>>
>>     Le 13/03/2013 14:12, Eric McDonald a écrit :
>>>     Hi Alexis,
>>>
>>>     First, let me say thank you for being patient and working with
>>>     us in spite of all the problems you are encountering.
>>
>>     That's bioinformatician life ;)
>>
>>>
>>>     With regards to the floating point exception, I see several
>>>     opportunities for a division-by-zero condition in the threading
>>>     utilities used by the script. These opportunities exist if an
>>>     input file is empty. (The problem may be coming from another
>>>     place, but this would be my first guess.) What does the
>>>     following command say:
>>>
>>>       ls -lh /scratch/ag/khmer/174r1_table.kh
>>>     <http://174r1_table.kh/>
>>>     /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>
>>      The result : (the files are not empty)
>>     -rw-r--r-- 1 ag users 299M 12 mars  20:54
>>     /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>     -rw-r--r-- 1 ag users 141G 12 mars  21:05
>>     /scratch/ag/khmer/174r1_table.kh <http://174r1_table.kh>
>>
>>>
>>>     Also, since you appear to be using TORQUE as your resource
>>>     manager/batch system, could you please attach the complete
>>>     output and error files for the job? (These files should be of
>>>     the form <job_name>.o2693 and <job_name>.e2693, where <job_name>
>>>     is the name of your job. There may only be one or the other of
>>>     these files, depending on site defaults and whether you
>>>     specified "-j oe" or "-j eo" in your job submission.)
>>
>>     I re run the job since I have deleted previous (2693) err/out files.
>>     Here is the new file (merged with the option -j oe in the bash
>>     script) :
>>
>>     #############################
>>     User: ag
>>     Date: Wed Mar 13 14:59:21 CET 2013
>>     Host: rainman.cbib.u-bordeaux2.fr
>>     <http://rainman.cbib.u-bordeaux2.fr>
>>     Directory: /mnt/var/home/ag
>>     PBS_JOBID: 2695.rainman
>>     PBS_O_WORKDIR: /mnt/var/home/ag
>>     PBS_NODEFILE:  rainman
>>     #############################
>>     #############################
>>     Debut filter-below-abund: Wed Mar 13 14:59:21 CET 2013
>>     starting threads
>>     starting writer
>>     loading...
>>     ... filtering 0
>>     /var/lib/torque/mom_priv/jobs/2695.rainman.SC
>>     <http://2695.rainman.SC>: line 49: 54757 Floating point
>>     exception(core dumped) ./khmer-BETA/sandbox/fi
>>     lter-below-abund.py /scratch/ag/khmer/174r1_table.kh
>>     <http://174r1_table.kh>
>>     /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>
>>     real    3m54.873s
>>     user    0m0.085s
>>     sys     2m2.180s
>>     Date fin: Wed Mar 13 15:03:15 CET 2013
>>     Job finished
>>
>>     Thanks again for your help :)
>>
>>     Alexis
>>
>>>
>>>     Thanks,
>>>       Eric
>>>
>>>
>>>
>>>     On Wed, Mar 13, 2013 at 5:38 AM, Alexis Groppi
>>>     <alexis.groppi at u-bordeaux2.fr
>>>     <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>
>>>         Hi Eric,
>>>
>>>         Thanks for your answer.
>>>         But unfortunately, after many attempts I'm getting this error :
>>>
>>>         starting threads
>>>         starting writer
>>>         loading...
>>>         ... filtering 0
>>>         /var/lib/torque/mom_priv/jobs/2693.rainman.SC
>>>         <http://2693.rainman.SC>: line 46: 63657 Floating point
>>>         exception(core dumped)
>>>         ./khmer-BETA/sandbox/filter-below-abund.py
>>>         /scratch/ag/khmer/174r1_table.kh <http://174r1_table.kh>
>>>         /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta.keep
>>>
>>>         real    3m30.163s
>>>         user    0m0.088s
>>>
>>>         Your opinion ?
>>>
>>>         Thanks
>>>
>>>         Alexis
>>>
>>>
>>>         Le 13/03/2013 00:55, Eric McDonald a écrit :
>>>>         Hi Alexis,
>>>>
>>>>         One way to get the 'bleeding-edge' branch is to clone it
>>>>         into a fresh directory; for example:
>>>>            git clone http://github.com/ged-lab/khmer.git -b
>>>>         bleeding-edge khmer-BETA
>>>>
>>>>         Assuming you already have a clone of the 'ged-lab/khmer'
>>>>         repo, then you should also be able to do:
>>>>           git fetch origin
>>>>           git checkout bleeding-edge
>>>>         Depending on how old your Git client is and what its
>>>>         defaults are, you may have to do the following instead:
>>>>           git checkout --track -b bleeding-edge origin/bleeding-edge
>>>>
>>>>         Hope this helps,
>>>>           Eric
>>>>
>>>>
>>>>         On Tue, Mar 12, 2013 at 11:32 AM, Alexis Groppi
>>>>         <alexis.groppi at u-bordeaux2.fr
>>>>         <mailto:alexis.groppi at u-bordeaux2.fr>> wrote:
>>>>
>>>>
>>>>             Le 12/03/2013 16:16, C. Titus Brown a écrit :
>>>>>             On Tue, Mar 12, 2013 at 04:15:05PM +0100, Alexis Groppi wrote:
>>>>>>             Hi Titus,
>>>>>>
>>>>>>             Thanks for your answer
>>>>>>             Actually it's my second attempt with filter-below-abund.
>>>>>>             The first time, I thought the problem was coming from the location of my
>>>>>>             table.kh  <http://table.kh>  file : in a storage element with poor level performance of I/O
>>>>>>             I killed the job after 24h, moved the file in a best place and re run it
>>>>>>             But with the same result : no completion after 24h
>>>>>>
>>>>>>             Any Idea ?
>>>>>>
>>>>>>             Thanks
>>>>>>
>>>>>>             Cheers From Bordeaux :)
>>>>>>
>>>>>>             Alexis
>>>>>>
>>>>>>             PS : The command line was the following :
>>>>>>
>>>>>>             ./filter-below-abund.py174r1_table.kh  <http://174r1_table.kh>  174r1_prinseq_good_bFr8.fasta.keep
>>>>>>
>>>>>>             Is this correct ?
>>>>>             Yes, looks right... Can you try with the bleeding-edge branch, which now
>>>>>             incorporates a potential fix for this issue?
>>>>             From here :
>>>>             https://github.com/ged-lab/khmer/tree/bleeding-edge ?
>>>>             or
>>>>             here : https://github.com/ctb/khmer/tree/bleeding-edge ?
>>>>
>>>>             Do I have to make a fresh install ? and How  ?
>>>>             Or just replace all the files and folders ?
>>>>
>>>>             Thanks :)
>>>>
>>>>             Alexis
>>>>
>>>>
>>>>>             thanks,
>>>>>             --titus
>>>>>
>>>>>>             Le 12/03/2013 14:41, C. Titus Brown a ?crit :
>>>>>>>             On Tue, Mar 12, 2013 at 10:48:03AM +0100, Alexis Groppi wrote:
>>>>>>>>             Metagenome assembly :
>>>>>>>>             My data :
>>>>>>>>             - original (quality filtered) data : 4463243 reads (75 nt) (Illumina)
>>>>>>>>             1/ Single pass digital normalization with normalize-by-median (C=20)
>>>>>>>>             ==> file .keep of 2560557 reads
>>>>>>>>             2/ generated a hash table by load-into-counting on the .keep file
>>>>>>>>             ==> file .kh of ~16Go (huge file ?!)
>>>>>>>>             3/ filter-below-abund with C=100 from the two previous file (table.kh  <http://table.kh>
>>>>>>>>             and reads.keep)
>>>>>>>>             Still running after 24 hours  :(
>>>>>>>>
>>>>>>>>             Any advice to speed up this step ? ... and the others (partitionning ...) ?
>>>>>>>>
>>>>>>>>             I can have an access to a HPC : ~3000 cores.
>>>>>>>             Hi Alexis,
>>>>>>>
>>>>>>>             filter-below-abund and filter-abund have occasional bugs that prevent them
>>>>>>>             from completing.  I would kill and restart.  For that few reads it should
>>>>>>>             take no more than a few hours to do everything.
>>>>>>>
>>>>>>>             Most of what khmer does cannot easily be distributed across multiple chassis,
>>>>>>>             note.
>>>>>>>
>>>>>>>             best,
>>>>>>>             --titus
>>>>>>             -- 
>>>>
>>>>             -- 
>>>>
>>>>             _______________________________________________
>>>>             khmer mailing list
>>>>             khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>
>>>>             http://lists.idyll.org/listinfo/khmer
>>>>
>>>>
>>>>
>>>>
>>>>         -- 
>>>>         Eric McDonald
>>>>         HPC/Cloud Software Engineer
>>>>           for the Institute for Cyber-Enabled Research (iCER)
>>>>           and the Laboratory for Genomics, Evolution, and
>>>>         Development (GED)
>>>>         Michigan State University
>>>>         P: 517-355-8733 <tel:517-355-8733>
>>>
>>>         -- 
>>>
>>>
>>>
>>>
>>>     -- 
>>>     Eric McDonald
>>>     HPC/Cloud Software Engineer
>>>       for the Institute for Cyber-Enabled Research (iCER)
>>>       and the Laboratory for Genomics, Evolution, and Development (GED)
>>>     Michigan State University
>>>     P: 517-355-8733 <tel:517-355-8733>
>>
>>     -- 
>
>     -- 
>
>
>
>
> -- 
> Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0013.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130313/d33b83a2/attachment-0014.png>
-------------- next part --------------
#!/bin/sh

#############################
# les directives PBS vont ici:

# Your job name (displayed by the queue)
#PBS -N step_1_and_2_Metagenome_assembly

# Specify the working directory
#PBS -d /scratch/ag/khmer

# walltime (hh:mm::ss)
#PBS -l walltime=48:00:00

# Specify the number of nodes(nodes=) and the number of cores per nodes(ppn=) to be used
#PBS -l nodes=1:ppn=1

# fin des directives PBS
#############################

# useful informations to print
echo "#############################" 
echo "User:" $USER
echo "Date:" `date`
echo "Host:" `hostname`
echo "Directory:" `pwd`
echo "PBS_JOBID:" $PBS_JOBID
echo "PBS_O_WORKDIR:" $PBS_O_WORKDIR
echo "PBS_NODEFILE: " `cat $PBS_NODEFILE | uniq`
echo "#############################" 

#############################

# What you actually want to launch
. /mnt/var/home/ag/env/bin/activate
echo "#############################" 
echo "Debut normalize-by-median:" `date`
time ./khmer/scripts/normalize-by-median.py -N 4 -x 64e9 -k 20 /mnt/var/home/ag/174r1_prinseq_good_bFr8.fasta

echo "#############################" 
echo "Debut load-into-counting:" `date`
time ./khmer/scripts/load-into-counting.py -k 20 -x 32e9 /scratch/ag/khmer/174r1_table.kh 174r1_prinseq_good_bFr8.fasta.keep

echo "#############################" 
echo "Debut filter-below-abund:" `date`
time /mnt/var/home/ag/khmer-BETA/sandbox/filter-below-abund.py /scratch/ag/khmer/174r1_table.kh /scratch/ag/khmer/174r1_prinseq_good_bFr8.fasta.keep

echo "Date fin:" `date`

# all done
echo "Job finished" 
-------------- next part --------------
#! /usr/bin/env python

import sys
import screed.fasta
import os
import khmer
from khmer.thread_utils import ThreadedSequenceProcessor, verbose_fasta_iter

WORKER_THREADS = 8
GROUPSIZE = 100

CUTOFF = 100

###


def main():
    counting_ht = sys.argv[1]
    infiles = sys.argv[2:]

    print 'file with ht: %s' % counting_ht
    print '-- settings:'
    print 'N THREADS', WORKER_THREADS
    print '--'

    print 'making hashtable'
    ht = khmer.load_counting_hash(counting_ht)
    K = ht.ksize()

    for infile in infiles:
        print 'filtering', infile
        outfile = os.path.basename(infile) + '.below'

        outfp = open(outfile, 'w')

        def process_fn(record, ht=ht):
            name = record['name']
            seq = record['sequence']
            if 'N' in seq:
                return None, None

            trim_seq, trim_at = ht.trim_below_abundance(seq, CUTOFF)

            if trim_at >= K:
                return name, trim_seq

            return None, None

        tsp = ThreadedSequenceProcessor(process_fn, WORKER_THREADS, GROUPSIZE)

        tsp.start(verbose_fasta_iter(infile), outfp)

if __name__ == '__main__':
    main()


More information about the khmer mailing list