[khmer] Assembling tricky data sets (was: slice-reads-by-coverage.py on PE data question)

Sun Jun 7 04:32:42 PDT 2015

Hi Chris,

sorry, let me explain further -- I meant you should try using the error
correction or error trimming in khmer, specifically (trimming is probably about
as good for this purpose).

Error correction:
ivory.idyll.org/blog/2015-wok-error-correction.html

Error trimming:
https://peerj.com/preprints/890/

The reason here is NOT because our error trimming or error correction is great
(well, actually, the error trimming is pretty good :) but because of the
variable coverage approach, which is (TTBMK) unique to khmer.  Essentially
you will be able to error trim/correct polyploid data.  I would suggest
a high k (k=32 or 31) but that's just a guess.

We have other evidence that diginorm does well in high het situations:
http://www.nature.com/ng/journal/v47/n4/full/ng.3237.html

Happy to talk more about how to approach this problem.  My advice would
be to take a look at read-coverage spectra,

http://khmer-recipes.readthedocs.org/en/latest/005-estimate-total-genome-size/index.html

in addition to the k-mer spectra, and see if you can figure out what's going
on there...

HTH,
--titus

On Fri, Jun 05, 2015 at 04:07:20PM +0000, Fields, Christopher J wrote:
> We???re currently going the EC route now (in our spare time) but will definitely try diginorn.  Thx!
> 
> chris
> 
> > On Jun 5, 2015, at 10:32 AM, C. Titus Brown <ctbrown at ucdavis.edu> wrote:
> > 
> > My suggestion: try diginorm and/or error correction.
> > 
> > --titus
> > 
> > On Fri, Jun 05, 2015 at 03:30:37PM +0000, Fields, Christopher J wrote:
> >> Cool, that does seem like a better path!  Thanks Titus.
> >> 
> >> We have been planning on trying something like this on a particularly thorny plant genome assembly that's been on our back-burner for a bit.  We???ve been nicknaming it ???Nessie???, primarily b/c the kmer distribution resembles fuzzy pics of the Loch Ness monster (has at least three significant peaks).  Probably a nasty combination of being highly heterozygous and having large-scale genome duplications; it???s supposed to be diploid but then again we know how that sometimes turns out w/ plants.  
> >> 
> >> chris
> >> 
> >>> On Jun 5, 2015, at 9:44 AM, C. Titus Brown <ctbrown at ucdavis.edu> wrote:
> >>> 
> >>> Diane, Chris --
> >>> 
> >>> Hmm, the following could work if you don't mind losing orphaned reads:
> >>> 
> >>> interleave-reads.py => interleaved reads.
> >>> 
> >>> slice-reads-by-coverage.py => "broken paired" reads, where pairs remain
> >>>   next to each other but there are lots of orphans.
> >>> 
> >>> extract-paired-reads.py => separate into still-paired (.pe) and orphaned (.se)
> >>>   reads.
> >>> 
> >>> If you want to always retain the pair if either has the right coverage, that
> >>> would require modifications to the script or a more complex workflow.  While
> >>> modifying the script is probably a good idea, we may not have time to do so in
> >>> the next week or three, though.
> >>> 
> >>> Diane, how about this - see if you can get the workflow above to work and
> >>> give decent results (I would suggest plotting the coverage distribution of
> >>> the .pe file as one way to evaluate), and if not, we can do the script
> >>> modification for you.
> >>> 
> >>> --titus
> >>> 
> >>> On Fri, Jun 05, 2015 at 02:14:11PM +0000, Fields, Christopher J wrote:
> >>>> I have used split-paired-reads.py for this purpose when normalizing PE reads, I assume it should work the same here.
> >>>> 
> >>>> chris
> >>>> 
> >>>> On Jun 5, 2015, at 8:38 AM, Diane Hatziioanou <dianehioanou at gmail.com<mailto:dianehioanou at gmail.com>> wrote:
> >>>> 
> >>>> Hello all again,
> >>>> 
> >>>> I have a question.
> >>>> I want to use the slice-reads-by-coverage.py but I've got PE data which I would like to keep as PE data. Is slice-reads-by-coverage.py able to deal with interleaved PE data and keep it PE, can it manage it in another format or am I asking for too much and would have to use single ends and try pairing them back after its done?
> >>>> 
> >>>> Thanks,
> >>>> Diane
> >>>> 
> >>>> --
> >>>> Dr Diane Hatziioanou
> >>>> Greek Mobile: (+30)6909403373
> >>>> UK Mobile: (+44)7779516625
> >>>> www.linkedin.com/in/dhatziioanou/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_dhatziioanou_&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=fbHa8Njtvh9VmSnzJxiEUTW9NWDwMMwQAzhgZDO41GQ&m=dRcvnkEofGJzLdh7UkycULWyPkXIh41x5bZwerAUmho&s=HkGDnjft354ZjCH76btBvDSlqJalDCBYdlxfZezC5A4&e=>
> >>>> https://twitter.com/DianeHIoanou<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_DianeHIoanou&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=fbHa8Njtvh9VmSnzJxiEUTW9NWDwMMwQAzhgZDO41GQ&m=dRcvnkEofGJzLdh7UkycULWyPkXIh41x5bZwerAUmho&s=IXFKb657NAEdn_sXAHqCdOLY8MG2FfADm3AWWqWIMAk&e=>
> >>>> 
> >>>> _______________________________________________
> >>>> khmer mailing list
> >>>> khmer at lists.idyll.org<mailto:khmer at lists.idyll.org>
> >>>> http://lists.idyll.org/listinfo/khmer
> >>>> 
> >>> 
> >>>> _______________________________________________
> >>>> khmer mailing list
> >>>> khmer at lists.idyll.org
> >>>> http://lists.idyll.org/listinfo/khmer
> >>> 
> >>> 
> >>> -- 
> >>> C. Titus Brown, ctbrown at ucdavis.edu
> >> 
> > 
> > -- 
> > C. Titus Brown, ctbrown at ucdavis.edu
> 

-- 
C. Titus Brown, ctbrown at ucdavis.edu