[khmer] Questions about digital normalization

C. Titus Brown ctb at msu.edu
Sat Jul 13 20:19:29 PDT 2013


On Sat, Jul 13, 2013 at 10:24:09AM +0800, cy_jiang wrote:
>  Hello Professor C. Titus Brown,
> 
> I am trying to implement digital normalization on paired-end reads and now I have some questions regarding the usage of it. Would you kind enough to help me out of them? Since I am new in bioinformatics, please spare me if I am not able to put it clearly.
> 
Hi Daniel,

no problem!

> I got forward and reverse reads in two different FASTA files. Each of them contains 16 million reads of length 100bp (a total of 32m reads). I would like to diginorm them and then assemble them using Velvet and Oases. I saw the example you give out here (https://khmer.readthedocs.org/en/latest/scripts.html) and found only one file (test-abund-read-paired.fa) in the example. Does this file contain all the paired-end reads? If it is the case, is there any format criteria for it (like each read is next to its mate read)?  

The file should be interleaved in order to use the '-p' option for
normalize-by-median.

> What is more, I am curious about how the normalization works when provided with paired-end reads. I was thinking to join mate reads together into a sequence of 200bp (16m reads of length 200 in a file). According to the paper you published, digital normalization discards or accepts the whole sequence, and this promises me no orphan reads left. But this will of course increase the kmer number, which results in more RAM. Since I only got a 16GB RAM, I am afraid that there may not enough RAM for this solution. What should I set for the parameters -N and -x?

Hmm, two things --

the -p option to diginorm, implemented by Jared Simpson, keeps or rejects
both reads based on whether either one is novel.  This should equate to
keeping the entire "fragment" from which both reads came if it hasn't been
saturated yet. Nobody has really explored the effects of this on assembly,
however.

second, you should definitely be normalizing all your data in one hash
table.  There are good options for finding larger memory machines if you
need them, but unless you're doing a very diverse metagenome, 16 GB will
probably be enough.

OK, and a third -- the default (without -p) for diginorm is just to discard
reads on an individual basis.  Prior to -p, we recommended interleaving your
reads, then running normalize-by-median, and then going through and
retrieving the pairs into one file and the orphan reads into another.
The script 'sandbox/strip-and-split-for-assembly.py' will do this.

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu




More information about the khmer mailing list