[khmer] Fwd: Digital normalization and assembly of millions of unmapped RNA-seq reads

Mon Nov 3 04:15:43 PST 2014

Hi Titus,

sorry for the late reply, but for some reason your e-mail never made it
into my inbox and I had to find it via google.

Ok, so attached is the plot you asked for and here is how it was generated:

load-into-counting.py -x 1e8 -k 20 18371_unmapped.kh 18371_unmapped.fastq
abundance-dist.py 18371_unmapped.kh 18371_unmapped.fastq 18371_unmapped.dist
plot-abundance-dist.py 18371_unmapped.dist 18371_unmapped.reads-dist.png
--ymax=300

Cheers,
Christian

> Hi Christian,
> if 20% of your reads remain, then you should have some reasonable amount
> of contigs in there, waiting to be assembled :). Either that, or a high
> error rate that’s making reads look different when they really aren’t.
> There are a few reasons why you might not be assembling anything, mostly
> related to things like repeats or polymorphism. The main question I have
> is: what is the coverage of your unassembled regions? If it’s low, then you
> have low lying contamination and you shouldn’t expect anything (unlikely in
> this case). If it’s high, then something else is going on and banging on it
> with a repeat aware assembler might be useful.
> To estimate the coverage without using mapping, try out the k-mer
> abundance and calc-median-distribution stuff, here:
>
> http://khmer-recipes.readthedocs.org/en/latest/001-extract-reads-by-coverage/index.html
> and ship the plots (or raw data files) to the list — let’s see what they
> look like!
> cheers,
> —titus

---------- Forwarded message ----------
From: Christian Frech <frech.christian at gmail.com>
Date: Tue, Oct 14, 2014 at 10:46 PM
Subject: Digital normalization and assembly of millions of unmapped RNA-seq
reads
To: khmer at lists.idyll.org

Hi all,

we have an RNA-seq data set (Illumina HiSeq 2000, 50 bp, single end reads)
from a human cell line where 8.2 million reads (~25% of the total) do not
map against the human reference genome. I figured out that almost all these
unmapped reads are of viral origin. Including the identified viral genome
into the reference and inspecting the re-mapped reads in IGV shows that
this viral genome is almost completely covered by several thousands-fold
(!). However, I can also see 2-3 gaps in coverage with no mapped reads.

To figure out exactly how the viral transcript looks like, I tried Minia
for de novo assembly but failed, i.e. no contigs from the viral genome
assembled. Minia works in principle, because the same Minia output contains
an almost perfect assembly of the PhiX genome, which I know is also in the
data.

After corresponding with Rayan Chikhi (author of Minia) we came to the
conclusion that the problem is that I have too HIGH sequence coverage for
the viral genome and that this confuses the de bruin graph assembler. The
reason is that with so many reads covering each position in the genome,
inevitably you get quite a few reads (> 20) with sequencing errors at every
position, which obviously wreaks havoc on the de bruin graph.

He suggested to use digital normalization as possible remedy, so I gave it
a try. I followed the single-pass mRNA-seq pipeline at
http://ged.msu.edu/angus/diginorm-2012/tutorial.html, with currently no
success. I can still not see any major contigs from my viral genome in the
Minia output. From the diginorm output, I see that about 80% of my 8.2
million unmapped reads were removed, which I guess is good.

Any suggestions how to proceed from here? Given my particular problem, is
there anything to tweak with diginorm or should I just try different
RNA-seq assemblers (Oasis, Trinity)?

Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20141103/3bffbc6c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 18371_unmapped.reads-dist.png
Type: image/png
Size: 31393 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20141103/3bffbc6c/attachment-0001.png>