[khmer] Normalization

Fri Aug 30 11:39:18 PDT 2013

On Fri, Aug 30, 2013 at 12:10:08PM -0300, Facundo Giorello wrote:
> Hi Titus,
> 
> Im trying to do a consensus transcriptome from fourteen transcriptomic
> libraries.
> 
> For this end, I join all the reads of the 14 libraries, and then run
> Diginorm
> and Trinity in silico normalization (Trinitynorm) before the assembly (with
> Trinity).
> I also did the assembly using the join of all 14 libraries reads without
> any
> normalization ("multireads strategy").
> 
> Unexpectedly, the multireads strategy was the best reconstructing the
> coding sequences. I think that this is given by the fact that the assembly
> from multireads
> assembled longest contigs  (despite that the contig median and average
> length is lower
> compared with the assembly from both normalization algorithms)
> 
> On the other hand, it seems that the assembly from normalization
> algorithms "recovered" more (potentially) isoforms.
> If I look for the number of contigs that were reconstructed for each
> coding sequence,  normalization strategies reconstructed more contigs
> than multireads (despite normalization strategy reconstructs worse the
> distinct coding sequences). --Please see the attached file.? In fact I
> calculated the average number of repeated "comp_XXX" (using the ID of
> Trinity assembled contigs, that I think, are alternatives
> reconstructions of a given contig) and Diginorm and Trinitynorm
> assembly had the double that the assembly from multireads.
> 
> Other thing that I saw, is that the assembly after normalizations
> recovered more distinct genes in total (after blastx annotation).
> 
> Do you think that my explanations of the results are fine? Any idea on
> why the assembly after normalization cause a more fragmented assembly?
> (I ran both, Diginorm and Trinitynorm with a "K" (kmer) of 25 )
> 
> *Diginorm command:
> normalize-by-median.py -p -C 30 -k 25 -N 4 -x 2.5e9 reads_shuffled.fastq
> 
> *Trinitynorm command:
> 
> normalize_by_kmer_coverage.pl --seqType fq --JM 100G --max_cov 30
> --left left.fq --right right.fq --pairs_together --PARALLEL_STATS
> --JELLY_CPU 10
> 
> 
> The assembly was performed using paired end reads. For run Diginorm, I
> shuffle the reads before the normalization.

Hi Facundo,

thanks for this very thorough e-mail!

Your results mirror comments we've seen from others: with normalization
approaches, you seem to get (a) more isoforms, (b) better sensitivity,
and (c) less contiguity.  All three of these seem to depend on lots
of things, including data quality and preprocessing, transcriptome
complexity, and sequencing depth.

Side note: I already sent you the eel-pond mRNAseq protocol separately, 

	https://khmer-protocols.readthedocs.org/en/v0.8.2/

and I would be interested in your experiences with that, if you have the
time and energy.  No worries if not.

A few thoughts and comments:

* The main thing that is missing from your khmer use is some sort of
  error-trimming.  We have recently added the '-V' flag to 'filter-abund'
  which lets you do low-abundance k-mer trimming on variable-abundance
  data sets like transcriptomes.  Highly recommended; it should resolve
  the problem where diginorm preferentially collects errors because they
  look so shiny and novel.

* Your two metrics (number of conserved genes recovered; length of protein
  coding regions reconstructed) seem like the right ones to use.  I recognize
  that conflicting results are annoying and we're working on a resolution,
  but for now I don't have any remedy handy.

* I actually think of the recovery of all those isoforms as a bigger problem
  than you might; they're likely the product of noisy splicing (see
  Pickrell and Pritchard's paper on that).

So, umm, no specific directions for you.  We've seen and heard similar things
and are trying to figure out good ways of combining approaches to generate
the best of both worlds.  Hope that's not too depressing... ;)

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu