[khmer] Digital normalization and assembly of millions of unmapped RNA-seq reads

Tue Oct 14 14:53:26 PDT 2014

On Oct 15, 2014, at 7:46 AM, Christian Frech <frech.christian at gmail.com> wrote:

> Hi all,
> 
> we have an RNA-seq data set (Illumina HiSeq 2000, 50 bp, single end reads) from a human cell line where 8.2 million reads (~25% of the total) do not map against the human reference genome. I figured out that almost all these unmapped reads are of viral origin. Including the identified viral genome into the reference and inspecting the re-mapped reads in IGV shows that this viral genome is almost completely covered by several thousands-fold (!). However, I can also see 2-3 gaps in coverage with no mapped reads.
> 
> To figure out exactly how the viral transcript looks like, I tried Minia for de novo assembly but failed, i.e. no contigs from the viral genome assembled. Minia works in principle, because the same Minia output contains an almost perfect assembly of the PhiX genome, which I know is also in the data. 
> 
> After corresponding with Rayan Chikhi (author of Minia) we came to the conclusion that the problem is that I have too HIGH sequence coverage for the viral genome and that this confuses the de bruin graph assembler. The reason is that with so many reads covering each position in the genome, inevitably you get quite a few reads (> 20) with sequencing errors at every position, which obviously wreaks havoc on the de bruin graph.
> 
> He suggested to use digital normalization as possible remedy, so I gave it a try. I followed the single-pass mRNA-seq pipeline at http://ged.msu.edu/angus/diginorm-2012/tutorial.html, with currently no success. I can still not see any major contigs from my viral genome in the Minia output. From the diginorm output, I see that about 80% of my 8.2 million unmapped reads were removed, which I guess is good.
> 
> Any suggestions how to proceed from here? Given my particular problem, is there anything to tweak with diginorm or should I just try different RNA-seq assemblers (Oasis, Trinity)?

Hi Christian,

if 20% of your reads remain, then you should have some reasonable amount of contigs in there, waiting to be assembled :). Either that, or a high error rate that’s making reads look different when they really aren’t.

There are a few reasons why you might not be assembling anything, mostly related to things like repeats or polymorphism.  The main question I have is: what is the coverage of your unassembled regions?  If it’s low, then you have low lying contamination and you shouldn’t expect anything (unlikely in this case). If it’s high, then something else is going on and banging on it with a repeat aware assembler might be useful.

To estimate the coverage without using mapping, try out the k-mer abundance and calc-median-distribution stuff, here:

http://khmer-recipes.readthedocs.org/en/latest/001-extract-reads-by-coverage/index.html

and ship the plots (or raw data files) to the list — let’s see what they look like!

cheers,
—titus