<div dir="ltr"><div>Hi Titus,</div><div><br></div><div>sorry for the late reply, but for some reason your e-mail never made it into my inbox and I had to find it via google.</div><div><br></div><div>Ok, so attached is the plot you asked for and here is how it was generated:</div><div><br></div><div><div>load-into-counting.py -x 1e8 -k 20 <a href="http://18371_unmapped.kh">18371_unmapped.kh</a> 18371_unmapped.fastq </div><div>abundance-dist.py <a href="http://18371_unmapped.kh">18371_unmapped.kh</a> 18371_unmapped.fastq 18371_unmapped.dist<br></div><div>plot-abundance-dist.py 18371_unmapped.dist 18371_unmapped.reads-dist.png --ymax=300<br></div></div><div><br></div><div>Cheers,<br></div><div>Christian</div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex" class="gmail_quote">Hi Christian,<br>

if 20% of your reads remain, then you should have some reasonable amount of contigs in there, waiting to be assembled :). Either that, or a high error rate that’s making reads look different when they really aren’t.<br>

There are a few reasons why you might not be assembling anything, mostly related to things like repeats or polymorphism.  The main question I have is: what is the coverage of your unassembled regions?  If it’s low, then you have low lying contamination and you shouldn’t expect anything (unlikely in this case). If it’s high, then something else is going on and banging on it with a repeat aware assembler might be useful.<br>

To estimate the coverage without using mapping, try out the k-mer abundance and calc-median-distribution stuff, here:<br>

<a href="http://khmer-recipes.readthedocs.org/en/latest/001-extract-reads-by-coverage/index.html">http://khmer-recipes.readthedocs.org/en/latest/001-extract-reads-by-coverage/index.html<br></a>

and ship the plots (or raw data files) to the list — let’s see what they look like!<br>

cheers,<br>—titus</blockquote><pre style="color:rgb(0,0,0)"><br></pre><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Christian Frech</b> <span dir="ltr">&lt;<a href="mailto:frech.christian@gmail.com">frech.christian@gmail.com</a>&gt;</span><br>Date: Tue, Oct 14, 2014 at 10:46 PM<br>Subject: Digital normalization and assembly of millions of unmapped RNA-seq reads<br>To: <a href="mailto:khmer@lists.idyll.org">khmer@lists.idyll.org</a><br><br><br><div dir="ltr"><div><div><div><div>Hi all,<br><br></div>we have an RNA-seq data set (Illumina HiSeq 2000, 50 bp, single end reads) from a human cell line where 8.2 million reads (~25% of the total) do not map against the human reference genome. I figured out that almost all these unmapped reads are of viral origin. Including the identified viral genome into the reference and inspecting the re-mapped reads in IGV shows that this viral genome is almost completely covered by several thousands-fold (!). However, I can also see 2-3 gaps in coverage with no mapped reads.<br><br></div>To figure out exactly how the viral transcript looks like, I tried Minia for de novo assembly but failed, i.e. no contigs from the viral genome assembled. Minia works in principle, because the same Minia output contains an almost perfect assembly of the PhiX genome, which I know is also in the data. <br><br>After corresponding with Rayan Chikhi (author of Minia) we came to the conclusion that the problem is that I have too HIGH sequence coverage for the viral genome and that this confuses the de bruin graph assembler. The reason is that with so many reads covering each position in the genome, inevitably you get quite a few reads (&gt; 20) with sequencing errors at every position, which obviously wreaks havoc on the de bruin graph.<br><br>He suggested to use digital normalization as possible remedy, so I gave it a try. I followed the single-pass mRNA-seq pipeline at <a href="http://ged.msu.edu/angus/diginorm-2012/tutorial.html" target="_blank">http://ged.msu.edu/angus/diginorm-2012/tutorial.html</a>, with currently no success. I can still not see any major contigs from my viral genome in the Minia output. From the diginorm output, I see that about 80% of my 8.2 million unmapped reads were removed, which I guess is good.<br><br>Any suggestions how to proceed from here? Given my particular problem, is there anything to tweak with diginorm or should I just try different RNA-seq assemblers (Oasis, Trinity)?<span class=""><font color="#888888"><br><br></font></span></div><span class=""><font color="#888888">Christian<br></font></span></div></div>

</div><br></div>