Hi all,

we have an RNA-seq data set (Illumina HiSeq 2000, 50 bp, single end reads)
from a human cell line where 8.2 million reads (~25% of the total) do not
map against the human reference genome. I figured out that almost all these
unmapped reads are of viral origin. Including the identified viral genome
into the reference and inspecting the re-mapped reads in IGV shows that
this viral genome is almost completely covered by several thousands-fold
(!). However, I can also see 2-3 gaps in coverage with no mapped reads.

To figure out exactly how the viral transcript looks like, I tried Minia
for de novo assembly but failed, i.e. no contigs from the viral genome
assembled. Minia works in principle, because the same Minia output contains
an almost perfect assembly of the PhiX genome, which I know is also in the

After corresponding with Rayan Chikhi (author of Minia) we came to the
conclusion that the problem is that I have too HIGH sequence coverage for
the viral genome and that this confuses the de bruin graph assembler. The
reason is that with so many reads covering each position in the genome,
inevitably you get quite a few reads (> 20) with sequencing errors at every
position, which obviously wreaks havoc on the de bruin graph.

He suggested to use digital normalization as possible remedy, so I gave it
a try. I followed the single-pass mRNA-seq pipeline at
http://ged.msu.edu/angus/diginorm-2012/tutorial.html, with currently no
success. I can still not see any major contigs from my viral genome in the
Minia output. From the diginorm output, I see that about 80% of my 8.2
million unmapped reads were removed, which I guess is good.

Any suggestions how to proceed from here? Given my particular problem, is
there anything to tweak with diginorm or should I just try different
RNA-seq assemblers (Oasis, Trinity)?

