[khmer] diginorm on merged reads
John Stanton-Geddes
johnsg at uvm.edu
Mon Oct 7 08:05:53 PDT 2013
Hi Titus and all,
Following up on my previous question - I ran a few different assemblies
exploring the effect of using khmer digital normalization and FLASH to
merge short reads. I compared the results of (1) running diginorm only, (2)
running diginorm than attempting to merge still-paired reads with FLASH,
and (3) first attempting to merge paired reads with FLASH followed by
diginorm. In all cases, I used trimmed-and-filtered reads and performed
assembly using velvet-oases with a kmer of 21. Below are some assembly
statistics.
1) diginorm only
assembly stat result
--------------------- ------------
Total Contigs 126812
Total Trimmed Contigs 126781
Total Length 109476821
Min contig size 100
Median contig size 365
Mean contig size 863
Max contig size 14314
N50 Contig 16370
N50 Length 1933
N90 Contig 66842
N90 Length 333
2) diginorm than FLASH
assembly stat result
--------------------- ------------
Total Contigs 111434
Total Trimmed Contigs 111413
Total Length 111343478
Min contig size 100
Median contig size 447
Mean contig size 999
Max contig size 20427
N50 Contig 15236
N50 Length 2163
N90 Contig 58158
N90 Length 410
3) FLASH than diginorm
assembly stat result
--------------------- ------------
Total Contigs 90612
Total Trimmed Contigs 90612
Total Length 86485229
Min contig size 119
Median contig size 586
Mean contig size 954
Max contig size 14006
N50 Contig 16436
N50 Length 1506
N90 Contig 60314
N90 Length 396
It's interesting, and seems to make sense, that merging reads prior to
diginorm results in the assembly with the fewest contigs (FYI - based on
the closest genome for this species, I expect ~17k genes so way more
transcripts than genes). I'm leaning towards using this as my final
assembly as having fewer and longer (at least than diginorm alone) contigs
seems preferable.
thanks,
John
On Fri, Jul 26, 2013 at 4:10 PM, John Stanton-Geddes <johnsg at uvm.edu> wrote:
> Hi Titus and the khmer list,
> I'm working on transcriptome assembly with samples treated at 12 different
> temperatures to capture genes expressed across the thermal range of my
> favorite ant species. I pooled the samples and ran them in a single lane of
> 100 bp paired end HiSeq, so I have about 16 million reads per sample, 160
> million reads total.
>
> My question:
> is there any benefit to merging my paired-end reads (e.g. using FLASH
> http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507)
> prior to running diginorm? A preliminary run of FLASH on some of my samples
> showed that about 65% of reads are merged (which is a bit surprising since
> the library was supposed to have been size-selected at 200 bp).
>
> My thought is to run diginorm on the merged reads, and also on the
> un-merged reads using the `-p` option as documented previously (
> http://lists.idyll.org/pipermail/khmer/2013-July/000123.html). I'd then
> combine all these and run a second pass of diginorm.
>
> Is this a valid approach, or is merging reads redundant with what diginorm
> does (since reads that add extra coverage would be tossed out anyway)?
>
> Apologies if this is a noob question.
>
> Thanks for the software!
>
> -John
>
> --
> Postdoctoral Research Associate
> Department of Biology, University of Vermont
> Room 211, Marsh Life Science Building
> 109 Carrigan Drive
> Burlington, Vermont 05405
> www.johnstantongeddes.org
>
--
Postdoctoral Research Associate
Department of Biology, University of Vermont
Room 211, Marsh Life Science Building
109 Carrigan Drive
Burlington, Vermont 05405
www.johnstantongeddes.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20131007/665417cd/attachment-0002.htm>
More information about the khmer
mailing list