[khmer] Assembling tricky data sets (was: slice-reads-by-coverage.py on PE data question)

Fields, Christopher J cjfields at illinois.edu
Fri Jun 12 11:08:32 PDT 2015


Titus, 

Thanks for the links!  I have used error trimming in the past to great effect but haven’t tried error correction yet using khmer (currently using Lighter or Quake).  We also plan on assessing read coverage spectra on a few of our more recent assemblies. which have varying degrees of het.  

The joys of working with non-model organisms :)

chris

> On Jun 7, 2015, at 6:32 AM, C. Titus Brown <ctbrown at ucdavis.edu> wrote:
> 
> Hi Chris,
> 
> sorry, let me explain further -- I meant you should try using the error
> correction or error trimming in khmer, specifically (trimming is probably about
> as good for this purpose).
> 
> Error correction:
> ivory.idyll.org/blog/2015-wok-error-correction.html
> 
> Error trimming:
> https://peerj.com/preprints/890/
> 
> The reason here is NOT because our error trimming or error correction is great
> (well, actually, the error trimming is pretty good :) but because of the
> variable coverage approach, which is (TTBMK) unique to khmer.  Essentially
> you will be able to error trim/correct polyploid data.  I would suggest
> a high k (k=32 or 31) but that's just a guess.
> 
> We have other evidence that diginorm does well in high het situations:
> http://www.nature.com/ng/journal/v47/n4/full/ng.3237.html
> 
> Happy to talk more about how to approach this problem.  My advice would
> be to take a look at read-coverage spectra,
> 
> http://khmer-recipes.readthedocs.org/en/latest/005-estimate-total-genome-size/index.html
> 
> in addition to the k-mer spectra, and see if you can figure out what's going
> on there...
> 
> HTH,
> --titus
> 
> On Fri, Jun 05, 2015 at 04:07:20PM +0000, Fields, Christopher J wrote:
>> We???re currently going the EC route now (in our spare time) but will definitely try diginorn.  Thx!
>> 
>> chris
>> 
>>> On Jun 5, 2015, at 10:32 AM, C. Titus Brown <ctbrown at ucdavis.edu> wrote:
>>> 
>>> My suggestion: try diginorm and/or error correction.
>>> 
>>> --titus
>>> 
>>> On Fri, Jun 05, 2015 at 03:30:37PM +0000, Fields, Christopher J wrote:
>>>> Cool, that does seem like a better path!  Thanks Titus.
>>>> 
>>>> We have been planning on trying something like this on a particularly thorny plant genome assembly that's been on our back-burner for a bit.  We???ve been nicknaming it ???Nessie???, primarily b/c the kmer distribution resembles fuzzy pics of the Loch Ness monster (has at least three significant peaks).  Probably a nasty combination of being highly heterozygous and having large-scale genome duplications; it???s supposed to be diploid but then again we know how that sometimes turns out w/ plants.  
>>>> 
>>>> chris
>>>> 
>>>>> On Jun 5, 2015, at 9:44 AM, C. Titus Brown <ctbrown at ucdavis.edu> wrote:
>>>>> 
>>>>> Diane, Chris --
>>>>> 
>>>>> Hmm, the following could work if you don't mind losing orphaned reads:
>>>>> 
>>>>> interleave-reads.py => interleaved reads.
>>>>> 
>>>>> slice-reads-by-coverage.py => "broken paired" reads, where pairs remain
>>>>>  next to each other but there are lots of orphans.
>>>>> 
>>>>> extract-paired-reads.py => separate into still-paired (.pe) and orphaned (.se)
>>>>>  reads.
>>>>> 
>>>>> If you want to always retain the pair if either has the right coverage, that
>>>>> would require modifications to the script or a more complex workflow.  While
>>>>> modifying the script is probably a good idea, we may not have time to do so in
>>>>> the next week or three, though.
>>>>> 
>>>>> Diane, how about this - see if you can get the workflow above to work and
>>>>> give decent results (I would suggest plotting the coverage distribution of
>>>>> the .pe file as one way to evaluate), and if not, we can do the script
>>>>> modification for you.
>>>>> 
>>>>> --titus
>>>>> 
>>>>> On Fri, Jun 05, 2015 at 02:14:11PM +0000, Fields, Christopher J wrote:
>>>>>> I have used split-paired-reads.py for this purpose when normalizing PE reads, I assume it should work the same here.
>>>>>> 
>>>>>> chris
>>>>>> 
>>>>>> On Jun 5, 2015, at 8:38 AM, Diane Hatziioanou <dianehioanou at gmail.com<mailto:dianehioanou at gmail.com>> wrote:
>>>>>> 
>>>>>> Hello all again,
>>>>>> 
>>>>>> I have a question.
>>>>>> I want to use the slice-reads-by-coverage.py but I've got PE data which I would like to keep as PE data. Is slice-reads-by-coverage.py able to deal with interleaved PE data and keep it PE, can it manage it in another format or am I asking for too much and would have to use single ends and try pairing them back after its done?
>>>>>> 
>>>>>> Thanks,
>>>>>> Diane
>>>>>> 
>>>>>> --
>>>>>> Dr Diane Hatziioanou
>>>>>> Greek Mobile: (+30)6909403373
>>>>>> UK Mobile: (+44)7779516625
>>>>>> www.linkedin.com/in/dhatziioanou/<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_dhatziioanou_&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=fbHa8Njtvh9VmSnzJxiEUTW9NWDwMMwQAzhgZDO41GQ&m=dRcvnkEofGJzLdh7UkycULWyPkXIh41x5bZwerAUmho&s=HkGDnjft354ZjCH76btBvDSlqJalDCBYdlxfZezC5A4&e=>
>>>>>> https://twitter.com/DianeHIoanou<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_DianeHIoanou&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=fbHa8Njtvh9VmSnzJxiEUTW9NWDwMMwQAzhgZDO41GQ&m=dRcvnkEofGJzLdh7UkycULWyPkXIh41x5bZwerAUmho&s=IXFKb657NAEdn_sXAHqCdOLY8MG2FfADm3AWWqWIMAk&e=>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> khmer mailing list
>>>>>> khmer at lists.idyll.org<mailto:khmer at lists.idyll.org>
>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>>> 
>>>>> 
>>>>>> _______________________________________________
>>>>>> khmer mailing list
>>>>>> khmer at lists.idyll.org
>>>>>> http://lists.idyll.org/listinfo/khmer
>>>>> 
>>>>> 
>>>>> -- 
>>>>> C. Titus Brown, ctbrown at ucdavis.edu
>>>> 
>>> 
>>> -- 
>>> C. Titus Brown, ctbrown at ucdavis.edu
>> 
> 
> -- 
> C. Titus Brown, ctbrown at ucdavis.edu



More information about the khmer mailing list