[khmer] ??RE: Lump release - find-knots step

C. Titus Brown ctb at msu.edu
Thu Dec 5 22:08:53 PST 2013


Hi Adi,

this is now officially out of my experience range :(.  Lump removal again
would seem to be the best option.

I would strongly suggest using the filter-below-abund script as in

  https://khmer-protocols.readthedocs.org/en/latest/metagenomics/3-partition.html

as it will be quick and easy compared to lump removal.  We are working on
revamping partitioning to be much, much faster and nicer but we don't have
beta-level code working for that.

best,
--titus

On Thu, Dec 05, 2013 at 08:56:27AM +0000, Adi Faigenboim wrote:
> Hello,
> 
> We merged all the 26 stoptags files together in the lump release analysis using:
> 
>    ht = khmer.new_hashbits(32,1,1)
> 
> ht.load_stop_tags('lump_1.stoptags',0)
> ht.load_stop_tags('lump_2.stoptags',0)
> ht.load_stop_tags('lump_3.stoptags',0)
> ...
> ht.load_stop_tags('lump_26.stoptags',0)
> ht.save_stop_tags('merge.stoptags')
> 
> The file merge.stoptgas was the input for filter-stoptags.py script.
> 
> Our initial lump file was 125 GB. After the lump release analysis was completed we received 88 groups and another very big group of 100 GB. Is this another lump? Should we run the lump release analysis again?
> Our initial step in the khmer analysis was normalization, so we should assume that such a big group compared to the other groups shouldn't be produced, or can it be? (the other groups are files of ~130MB)? In other words- how do we distinguish a lump from a group that can be assembled?
> Thank you,
> Adi
> 
> -----Original Message-----
> From: C. Titus Brown [mailto:ctb at msu.edu] 
> Sent: Wednesday, November 20, 2013 5:32 PM
> To: Adi Faigenboim
> Cc: titus at idyll.org
> Subject: Re: ??RE: [khmer] Lump release - find-knots step
> 
> Should be no problem to merge them all together.
> 
> Could you send Qs to the khmer at lists.idyll.org mailing list in the future, please? thanks!
> 
> --titus
> 
> On Wed, Nov 20, 2013 at 01:45:10PM +0000, Adi Faigenboim wrote:
> > Hi Titus,
> > 
> > Thank you for your advice. We divided the remaining ~6300 pmaps into groups of 250 pmaps each and ran the script find knots.py on different computers and on the same computer in parallel. Using this approach, the find knots stage was completed in a week instead of waiting months! Thank you very much for your information.
> > We think that the writing of such a big stoptag file for all the 8300 pmaps also contributing tremendously to the running time of the script. 
> > 
> > Our next step is to merge all the stoptag files together. Should there be a problem merging all the files together in one step or should we merge some groups together and then merge them again into one big file?
> > 
> > 
> > Best regards,
> > Adi
> > 
> > -----Original Message-----
> > From: C. Titus Brown [mailto:ctb at msu.edu]
> > Sent: Friday, November 08, 2013 1:49 PM
> > To: Adi Faigenboim
> > Cc: titus at idyll.org
> > Subject: Re: ??RE: [khmer] Lump release - find-knots step
> > 
> > You can choose the pmap files randomly (or however) to split across machines.
> > There is a way to merge the pmap using the Python API -- will send to 
> > you in the next few days, but it *should* be as simple as
> > 
> > 	ht = khmer.new_hashbits(K, 1, 1)
> > 	ht.load_stoptags(filename1)
> > 	ht.load_stoptags(filename2)
> > 	...
> > 	ht.save_stoptags(newfilename)
> 
> > 
> > Running the loop on the pmap files however you want should be fine.
> > 
> > best,
> > --titus
> > 
> > On Thu, Nov 07, 2013 at 11:28:44AM +0000, Adi Faigenboim wrote:
> > > Hi Titus,
> > > Thank you again for your response and your sincere will to assist us. The assembly took us a few months before we were able to start the lump release pipeline so we'd rather try and run the find-knots.py in parallel. We would appreciate it if you could advise us on how to that. If we have overall ~8000 pmaps files and the script runs on each pmaps file individually, our question is if we can run part of the pmap files on one machine and the others on a different machine (and consequently in each machine the stoptag file will be created). 
> > > If we understand correctly, the stoptags file is recreated after each pmap partition and is not appended. 
> > > Thus, following this stage (find-knots.py), can we run the script filter-stoptags.py on all the stoptags files created? Or should we merge the stoptags files and how? 
> > > Is there a preferable way to divide the pmap files or could we choose the files randomly? Should we run the find-knot script on the all pmap files and change the loop itself (for n, subset_file in enumerate(pmap_files)) to go over only part of the files or should we divide the files to a few machines and run the loop on all of the pmap files given in that specific machine?
> > > 
> > > Thank you,
> > > 
> > > Adi
> > > 
> > > -----Original Message-----
> > > From: C. Titus Brown [mailto:ctb at msu.edu]
> > > Sent: Thursday, October 31, 2013 3:48 AM
> > > To: Adi Faigenboim
> > > Cc: khmer at lists.idyll.org
> > > Subject: Re: ??RE: [khmer] Lump release - find-knots step
> > > 
> > > On Tue, Oct 29, 2013 at 05:16:23PM +0000, Adi Faigenboim wrote:
> > > > Hi Titus,
> > > > Thank you for your response. From what I understand I can use a 
> > > > different coverage than 20 that will create a smaller lump for example c=5. Is there anything I can do with this current lump to speed up the process?(number of pmaps  8352) I looked at the find-knots.py', wondering if I can run pmaps 1 till 4000 on one computer and the rest on another computer?
> > > 
> > > Hi Adi,
> > > 
> > > Going to a lower coverage in one pass will fragment your assembly; see the diginorm paper for a discussion of this.  You would want to use the three-pass diginorm in the kalamazoo protocol, below.
> > > I'm not sure what effect it will have on the lump tho.
> > > 
> > > You can definitely run pmaps on different computers.  However,  I would suggest switching to the filter-below-abund approach first...
> > > 
> > > cheers,
> > > --titus
> > > 
> > > > _____________________________________
> > > > ?: C. Titus Brown [ctb at msu.edu]
> > > > ??????: ??? ????? 27 ??????? 2013 21:03
> > > > ????: Adi Faigenboim
> > > > Cc: khmer at lists.idyll.org
> > > > ??????: Re: [khmer] Lump release - find-knots step
> > > > 
> > > > On Sun, Oct 27, 2013 at 12:26:52PM +0000, Adi Faigenboim wrote:
> > > > > I have a metagenome of about 2.5G reads. I used the khmer pipeline with dignorm c=20, filtering and partitioning. After the partitioning step I received 345 groups and a very big knot (123 GB). When using the knot release pipeline a received 8352 pmaps. I'm correctly in find-knots.py using  -x 70e9 -N 4. After running a month in this step, only 1450 pmaps have been processed...is it possible that this stage would take so long?
> > > > > Can I split this stage to different computers (run the loop over the pmap_files parallel) ?
> > > > > Can you please shed some light as to what could be the cause for this and should I maybe do the partitioning in a different way ?
> > > > > I tried lowering the coverage to c=10 in the dignorm step but got 20% less data which I think is rather a lot.
> > > > 
> > > > Hi Adi,
> > > > 
> > > > we've got a faster approach in the works -- see 
> > > > 'filter-below-abund', as used in the partitioning section of this protocol:
> > > > 
> > > > https://khmer-protocols.readthedocs.org/en/latest/metagenomics/ind
> > > > ex
> > > > .h
> > > > tml
> > > > 
> > > > And yes, the problem is that lump removal in the find-knots script 
> > > > is dependent on exhaustively traversing all of the repetitive sequence.
> > > > It works really poorly on high-coverage data sets :(.
> > > > 
> > > > best,
> > > > --titus
> > > > --
> > > > C. Titus Brown, ctb at msu.edu
> > > > 
> > > > This mail was received via Mail-SeCure System.
> > > > 
> > > > 
> > > > 
> > > > This mail was sent via Mail-SeCure System.
> > > > 
> > > > 
> > > 
> > > --
> > > C. Titus Brown, ctb at msu.edu
> > > 
> > > This mail was received via Mail-SeCure System.
> > > 
> > > 
> > > 
> > > This mail was sent via Mail-SeCure System.
> > > 
> > > 
> > 
> > --
> > C. Titus Brown, ctb at msu.edu
> > 
> > This mail was received via Mail-SeCure System.
> > 
> > 
> > 
> > This mail was sent via Mail-SeCure System.
> > 
> > 
> 
> --
> C. Titus Brown, ctb at msu.edu
> 
> This mail was received via Mail-SeCure System.
> 
> 
> 
> This mail was sent via Mail-SeCure System.
> 
> 
> 
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer

-- 
C. Titus Brown, ctb at msu.edu




More information about the khmer mailing list