[khmer] Questions about digital normalization

Wed Jul 17 04:29:10 PDT 2013

Hi Daniel,

no, filter-abund trims reads at low-abundance k-mers, rather than discarding
them!

cheers,
--titus

On Wed, Jul 17, 2013 at 05:35:41PM +0800, cy_jiang wrote:
> Hello Titus,
> 
> Thank you very much for your kind reply!
> 
> I had followed your suggestion to deal with my dataset, and it ran pretty well!
> 
> But another question came up when I tried the second step which filters out the unique kmer (using this command:python /home/work/khmer/scripts/filter-abund.py mh.kh reads.fa.keep). I found there were lots reads of shorter length(<100bp) in the reads.fa.keep.abundfilt file. I wondered how this happened? Doesn't it discard or keep the entire read?
> 
> Hope to hear from you  soon!
> 
> Best regards!
> 
> Daniel
> 
> 
> 
> 
> 
> 
> 
> At 2013-07-14 11:19:29,"C. Titus Brown" <ctb at msu.edu> wrote:
> >On Sat, Jul 13, 2013 at 10:24:09AM +0800, cy_jiang wrote:
> >>  Hello Professor C. Titus Brown,
> >> 
> >> I am trying to implement digital normalization on paired-end reads and now I have some questions regarding the usage of it. Would you kind enough to help me out of them? Since I am new in bioinformatics, please spare me if I am not able to put it clearly.
> >> 
> >Hi Daniel,
> >
> >no problem!
> >
> >> I got forward and reverse reads in two different FASTA files. Each of them contains 16 million reads of length 100bp (a total of 32m reads). I would like to diginorm them and then assemble them using Velvet and Oases. I saw the example you give out here (https://khmer.readthedocs.org/en/latest/scripts.html) and found only one file (test-abund-read-paired.fa) in the example. Does this file contain all the paired-end reads? If it is the case, is there any format criteria for it (like each read is next to its mate read)?  
> >
> >The file should be interleaved in order to use the '-p' option for
> >normalize-by-median.
> >
> >> What is more, I am curious about how the normalization works when provided with paired-end reads. I was thinking to join mate reads together into a sequence of 200bp (16m reads of length 200 in a file). According to the paper you published, digital normalization discards or accepts the whole sequence, and this promises me no orphan reads left. But this will of course increase the kmer number, which results in more RAM. Since I only got a 16GB RAM, I am afraid that there may not enough RAM for this solution. What should I set for the parameters -N and -x?
> >
> >Hmm, two things --
> >
> >the -p option to diginorm, implemented by Jared Simpson, keeps or rejects
> >both reads based on whether either one is novel.  This should equate to
> >keeping the entire "fragment" from which both reads came if it hasn't been
> >saturated yet. Nobody has really explored the effects of this on assembly,
> >however.
> >
> >second, you should definitely be normalizing all your data in one hash
> >table.  There are good options for finding larger memory machines if you
> >need them, but unless you're doing a very diverse metagenome, 16 GB will
> >probably be enough.
> >
> >OK, and a third -- the default (without -p) for diginorm is just to discard
> >reads on an individual basis.  Prior to -p, we recommended interleaving your
> >reads, then running normalize-by-median, and then going through and
> >retrieving the pairs into one file and the orphan reads into another.
> >The script 'sandbox/strip-and-split-for-assembly.py' will do this.
> >
> >cheers,
> >--titus
> >-- 
> >C. Titus Brown, ctb at msu.edu

-- 
C. Titus Brown, ctb at msu.edu