[khmer] diginorm on merged reads

Mon Oct 7 08:05:53 PDT 2013

Hi Titus and all,
Following up on my previous question - I ran a few different assemblies
exploring the effect of using khmer digital normalization and FLASH to
merge short reads. I compared the results of (1) running diginorm only, (2)
running diginorm than attempting to merge still-paired reads with FLASH,
and (3) first attempting to merge paired reads with FLASH followed by
diginorm. In all cases, I used trimmed-and-filtered reads and performed
assembly using velvet-oases with a kmer of 21. Below are some assembly
statistics.

1) diginorm only

assembly stat                result
---------------------               ------------
Total Contigs                 126812
Total Trimmed Contigs   126781
Total Length                  109476821
Min contig size              100
Median contig size         365
Mean contig size            863
Max contig size             14314
N50 Contig                    16370
N50 Length                   1933
N90 Contig                    66842
N90 Length                   333

2) diginorm than FLASH

assembly stat                result
---------------------                ------------
Total Contigs                   111434
Total Trimmed Contigs     111413
Total Length                    111343478
Min contig size                100
Median contig size           447
Mean contig size             999
Max contig size               20427
N50 Contig                      15236
N50 Length                      2163
N90 Contig                      58158
N90 Length                     410

3) FLASH than diginorm

assembly stat                result
---------------------               ------------
Total Contigs                  90612
Total Trimmed Contigs    90612
Total Length                   86485229
Min contig size               119
Median contig size          586
Mean contig size            954
Max contig size             14006
N50 Contig                    16436
N50 Length                    1506
N90 Contig                    60314
N90 Length                    396

It's interesting, and seems to make sense, that merging reads prior to
diginorm results in the assembly with the fewest contigs (FYI - based on
the closest genome for this species, I expect ~17k genes so way more
transcripts than genes). I'm leaning towards using this as my final
assembly as having fewer and longer (at least than diginorm alone) contigs
seems preferable.

thanks,
John

On Fri, Jul 26, 2013 at 4:10 PM, John Stanton-Geddes <johnsg at uvm.edu> wrote:

> Hi Titus and the khmer list,
> I'm working on transcriptome assembly with samples treated at 12 different
> temperatures to capture genes expressed across the thermal range of my
> favorite ant species. I pooled the samples and ran them in a single lane of
> 100 bp paired end HiSeq, so I have about 16 million reads per sample, 160
> million reads total.
>
> My question:
> is there any benefit to merging my paired-end reads (e.g. using FLASH
> http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507)
> prior to running diginorm? A preliminary run of FLASH on some of my samples
> showed that about 65% of reads are merged (which is a bit surprising since
> the library was supposed to have been size-selected at 200 bp).
>
> My thought is to run diginorm on the merged reads, and also on the
> un-merged reads using the `-p` option as documented previously (
> http://lists.idyll.org/pipermail/khmer/2013-July/000123.html). I'd then
> combine all these and run a second pass of diginorm.
>
> Is this a valid approach, or is merging reads redundant with what diginorm
> does (since reads that add extra coverage would be tossed out anyway)?
>
> Apologies if this is a noob question.
>
> Thanks for the software!
>
> -John
>
> --
> Postdoctoral Research Associate
> Department of Biology, University of Vermont
> Room 211, Marsh Life Science Building
> 109 Carrigan Drive
> Burlington, Vermont 05405
> www.johnstantongeddes.org
>

-- 
Postdoctoral Research Associate
Department of Biology, University of Vermont
Room 211, Marsh Life Science Building
109 Carrigan Drive
Burlington, Vermont 05405
www.johnstantongeddes.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20131007/665417cd/attachment-0002.htm>