[khmer] khmer Digest, Vol 25, Issue 2

Mon May 11 08:28:28 PDT 2015

I think this could be the bug of the normalize-by-median script. There
script with (-p) keep the both paired ends but only count the kmers of one
of them. This is why abound-dist.py after normalize-by-median show many
kmers with count 0.
I think @camillescott was working on fixing that.

Tamer

On Mon, May 11, 2015 at 6:13 AM, <khmer-request at lists.idyll.org> wrote:

> Send khmer mailing list submissions to
>         khmer at lists.idyll.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.idyll.org/listinfo/khmer
> or, via email, send a message with subject or body 'help' to
>         khmer-request at lists.idyll.org
>
> You can reach the person managing the list at
>         khmer-owner at lists.idyll.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of khmer digest..."
>
>
> Today's Topics:
>
>    1. inconsistent unique k-mer counting (Joran Martijn)
>    2. Re: inconsistent unique k-mer counting (C. Titus Brown)
>    3. Re: inconsistent unique k-mer counting (Joran Martijn)
>    4. Re: inconsistent unique k-mer counting (C. Titus Brown)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 11 May 2015 11:29:31 +0200
> From: Joran Martijn <joran.martijn at icm.uu.se>
> Subject: [khmer] inconsistent unique k-mer counting
> To: <khmer at lists.idyll.org>
> Message-ID: <5550767B.4070501 at icm.uu.se>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>
> Dear Khmer mailing list,
>
> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
> the raw dataset and diginormed dataset, similar as was done in the
> original diginorm paper.
>
> I have done this as follows. First I create the counting table:
>
> load-into-counting.py \
>      -t \
>      -k 20 \
>      -N 4 -x 16e9 \
>      --threads 30 \
>      test.ct \
>      test.fastq.gz \
>      &> test.ct.report
>
> The script then reports ~3,1 billion unique 20-mers.
>
> Then I perform the diginorm:
>
> normalize-by-median.py \
>      -p -t \
>      -C 20 \
>      --loadtable test.ct \
>      -o test_k20_C20.fastq.gz.keep \
>      test.fastq.gz \
>      &> test_k20_C20.report
>
> The script then reports approximately 1 million unique 20-mers (if true,
> this would mean a 3000-fold reduction in unique kmers, that sounds like
> too much to me).
>
> Then, I count the number of unique 20-mers again in the normalized
> dataset using load-into-counting.py:
>
> load-into-counting.py \
>      -t \
>      -k 20 \
>      -N 4 -x 16e9 \
>      --threads 30 \
>      test2.ct \
>      test_k20_C20.fastq.gz.keep \
>      &> test2.ct.report
>
> This time however, the script reports approximately 2,8 billion unique
> 20-mers. I am confused, since I was expecting around 1 million, like
> normalize-by-median reported.
> Are the two scripts reporting a different kind of unique k-mers? Which
> counts should I compare to get an estimate of the effect of diginorm on
> the dataset? Or is there a more easy and straightforward way to go about
> this?
>
> Kind regards,
>
> Joran Martijn
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.idyll.org/pipermail/khmer/attachments/20150511/9363769d/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Mon, 11 May 2015 03:12:52 -0700
> From: "C. Titus Brown" <ctbrown at ucdavis.edu>
> Subject: Re: [khmer] inconsistent unique k-mer counting
> To: Joran Martijn <joran.martijn at icm.uu.se>
> Cc: khmer at lists.idyll.org
> Message-ID: <20150511101252.GA3199 at idyll.org>
> Content-Type: text/plain; charset=us-ascii
>
> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
> > Dear Khmer mailing list,
> >
> > I'm trying to compare the number of unique k-mers (lets say 20-mers) in
> > the raw dataset and diginormed dataset, similar as was done in the
> > original diginorm paper.
>
> [ elided ]
>
> Hi Joran,
>
> that certainly doesn't sound good :). Would it be possible to convey the
> various report files to us, publicly or privately?
>
> thanks,
> --titus
>
> p.s. Thank you for the very detailed question!
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 11 May 2015 13:06:24 +0200
> From: Joran Martijn <joran.martijn at icm.uu.se>
> Subject: Re: [khmer] inconsistent unique k-mer counting
> To: <titus at idyll.org>
> Cc: khmer at lists.idyll.org
> Message-ID: <55508D30.1060604 at icm.uu.se>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>
> Hej Titus,
>
> Thanks for the quick reply! Here are the report files, which are
> basically the STDERR and STDOUT output of the scripts.
> Quick note before the reports, I made a small mistake in my
> openingspost. The Coverage threshold I tried for these reports was 5,
> not 20.
>
> Here the report file of the first load-into-counting.py execution (on
> the raw sequence data), test.ct.report:
>
> || This is the script 'load-into-counting.py' in khmer.
> || You are running khmer version 1.3
> || You are also using screed version 0.8
> ||
> || If you use this script in a publication, please cite EACH of the
> following:
> ||
> ||   * MR Crusoe et al., 2014.
> http://dx.doi.org/10.6084/m9.figshare.979190
> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
> ||
> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
> details.
>
>
> PARAMETERS:
>   - kmer size =    20            (-k)
>   - n tables =     4             (-N)
>   - min tablesize = 1.6e+10      (-x)
>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> --------
> Saving k-mer counting table to test.ct
> Loading kmers from sequences in ['test.fastq.gz']
> making k-mer counting table
> consuming input test.fastq.gz
> Total number of unique k-mers: 3102943887
> saving test.ct
> fp rate estimated to be 0.008
> DONE.
> wrote to: test.ct.info
>
> Here the report file of the normalize-by-median.py, test_k20_C5.report
>
> || This is the script 'normalize-by-median.py' in khmer.
> || You are running khmer version 1.3
> || You are also using screed version 0.8
> ||
> || If you use this script in a publication, please cite EACH of the
> following:
> ||
> ||   * MR Crusoe et al., 2014.
> http://dx.doi.org/10.6084/m9.figshare.979190
> ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
> ||
> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
> details.
>
>
> PARAMETERS:
>   - kmer size =    20            (-k)
>   - n tables =     4             (-N)
>   - min tablesize = 1.6e+10      (-x)
>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> --------
> ... kept 58012 of 200000 or 29%
> ... in file test.fastq.gz
> ... kept 116210 of 400000 or 29%
> ... in file test.fastq.gz
>
> ..... etc etc etc .....
>
> ... kept 90482098 of 346200000 or 26%
> ... in file test.fastq.gz
> ... kept 90529526 of 346400000 or 26%
> ... in file test.fastq.gz
> Total number of unique k-mers: 850221
> loading k-mer counting table from test.ct
> DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
> output in test_k20_C5.fastq.gz.keep
> fp rate estimated to be 0.008
>
> And here the second load-into-counting.py report, test2.ct.report
>
> || This is the script 'load-into-counting.py' in khmer.
> || You are running khmer version 1.3
> || You are also using screed version 0.8
> ||
> || If you use this script in a publication, please cite EACH of the
> following:
> ||
> ||   * MR Crusoe et al., 2014.
> http://dx.doi.org/10.6084/m9.figshare.979190
> ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
> ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
> ||
> || Please see http://khmer.readthedocs.org/en/latest/citations.html for
> details.
>
>
> PARAMETERS:
>   - kmer size =    20            (-k)
>   - n tables =     4             (-N)
>   - min tablesize = 1.6e+10      (-x)
>
> Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> --------
> Saving k-mer counting table to test2.ct
> Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
> making k-mer counting table
> consuming input test_k20_C5.fastq.gz.keep
> Total number of unique k-mers: 2822473008
> saving test2.ct
>
> Hope this helps!
>
> Joran
>
> On 11/05/15 12:12, C. Titus Brown wrote:
> > On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
> >> Dear Khmer mailing list,
> >>
> >> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
> >> the raw dataset and diginormed dataset, similar as was done in the
> >> original diginorm paper.
> > [ elided ]
> >
> > Hi Joran,
> >
> > that certainly doesn't sound good :). Would it be possible to convey the
> > various report files to us, publicly or privately?
> >
> > thanks,
> > --titus
> >
> > p.s. Thank you for the very detailed question!
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.idyll.org/pipermail/khmer/attachments/20150511/96dd402e/attachment-0001.htm
> >
>
> ------------------------------
>
> Message: 4
> Date: Mon, 11 May 2015 04:13:26 -0700
> From: "C. Titus Brown" <ctbrown at ucdavis.edu>
> Subject: Re: [khmer] inconsistent unique k-mer counting
> To: Joran Martijn <joran.martijn at icm.uu.se>
> Cc: khmer at lists.idyll.org
> Message-ID: <20150511111326.GG3199 at idyll.org>
> Content-Type: text/plain; charset=us-ascii
>
> OK, that's very weird - this must be a bug, but I'll be darned if I can
> figure out what might be causing it.  The numbers in load-into-counting
> should be correct but I'll have to independently confirm that.
>
> BTW, for the first round of diginorm I'd use C=20; see 3-pass diginorm
> in the dn paper.
>
> best,
> --titus
>
> On Mon, May 11, 2015 at 01:06:24PM +0200, Joran Martijn wrote:
> > Hej Titus,
> >
> > Thanks for the quick reply! Here are the report files, which are
> > basically the STDERR and STDOUT output of the scripts.
> > Quick note before the reports, I made a small mistake in my
> > openingspost. The Coverage threshold I tried for these reports was 5,
> > not 20.
> >
> > Here the report file of the first load-into-counting.py execution (on
> > the raw sequence data), test.ct.report:
> >
> > || This is the script 'load-into-counting.py' in khmer.
> > || You are running khmer version 1.3
> > || You are also using screed version 0.8
> > ||
> > || If you use this script in a publication, please cite EACH of the
> > following:
> > ||
> > ||   * MR Crusoe et al., 2014.
> http://dx.doi.org/10.6084/m9.figshare.979190
> > ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
> > ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
> > ||
> > || Please see http://khmer.readthedocs.org/en/latest/citations.html for
> > details.
> >
> >
> > PARAMETERS:
> >  - kmer size =    20            (-k)
> >  - n tables =     4             (-N)
> >  - min tablesize = 1.6e+10      (-x)
> >
> > Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> > --------
> > Saving k-mer counting table to test.ct
> > Loading kmers from sequences in ['test.fastq.gz']
> > making k-mer counting table
> > consuming input test.fastq.gz
> > Total number of unique k-mers: 3102943887
> > saving test.ct
> > fp rate estimated to be 0.008
> > DONE.
> > wrote to: test.ct.info
> >
> > Here the report file of the normalize-by-median.py, test_k20_C5.report
> >
> > || This is the script 'normalize-by-median.py' in khmer.
> > || You are running khmer version 1.3
> > || You are also using screed version 0.8
> > ||
> > || If you use this script in a publication, please cite EACH of the
> > following:
> > ||
> > ||   * MR Crusoe et al., 2014.
> http://dx.doi.org/10.6084/m9.figshare.979190
> > ||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
> > ||
> > || Please see http://khmer.readthedocs.org/en/latest/citations.html for
> > details.
> >
> >
> > PARAMETERS:
> >  - kmer size =    20            (-k)
> >  - n tables =     4             (-N)
> >  - min tablesize = 1.6e+10      (-x)
> >
> > Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> > --------
> > ... kept 58012 of 200000 or 29%
> > ... in file test.fastq.gz
> > ... kept 116210 of 400000 or 29%
> > ... in file test.fastq.gz
> >
> > ..... etc etc etc .....
> >
> > ... kept 90482098 of 346200000 or 26%
> > ... in file test.fastq.gz
> > ... kept 90529526 of 346400000 or 26%
> > ... in file test.fastq.gz
> > Total number of unique k-mers: 850221
> > loading k-mer counting table from test.ct
> > DONE with test.fastq.gz; kept 90547512 of 346477608 or 26%
> > output in test_k20_C5.fastq.gz.keep
> > fp rate estimated to be 0.008
> >
> > And here the second load-into-counting.py report, test2.ct.report
> >
> > || This is the script 'load-into-counting.py' in khmer.
> > || You are running khmer version 1.3
> > || You are also using screed version 0.8
> > ||
> > || If you use this script in a publication, please cite EACH of the
> > following:
> > ||
> > ||   * MR Crusoe et al., 2014.
> http://dx.doi.org/10.6084/m9.figshare.979190
> > ||   * Q Zhang et al., http://dx.doi.org/10.1371/journal.pone.0101271
> > ||   * A. D303266ring et al. http://dx.doi.org:80/10.1186/1471-2105-9-11
> > ||
> > || Please see http://khmer.readthedocs.org/en/latest/citations.html for
> > details.
> >
> >
> > PARAMETERS:
> >  - kmer size =    20            (-k)
> >  - n tables =     4             (-N)
> >  - min tablesize = 1.6e+10      (-x)
> >
> > Estimated memory usage is 6.4e+10 bytes (n_tables x min_tablesize)
> > --------
> > Saving k-mer counting table to test2.ct
> > Loading kmers from sequences in ['test_k20_C5.fastq.gz.keep']
> > making k-mer counting table
> > consuming input test_k20_C5.fastq.gz.keep
> > Total number of unique k-mers: 2822473008
> > saving test2.ct
> >
> > Hope this helps!
> >
> > Joran
> >
> > On 11/05/15 12:12, C. Titus Brown wrote:
> >> On Mon, May 11, 2015 at 11:29:31AM +0200, Joran Martijn wrote:
> >>> Dear Khmer mailing list,
> >>>
> >>> I'm trying to compare the number of unique k-mers (lets say 20-mers) in
> >>> the raw dataset and diginormed dataset, similar as was done in the
> >>> original diginorm paper.
> >> [ elided ]
> >>
> >> Hi Joran,
> >>
> >> that certainly doesn't sound good :). Would it be possible to convey the
> >> various report files to us, publicly or privately?
> >>
> >> thanks,
> >> --titus
> >>
> >> p.s. Thank you for the very detailed question!
> >
>
> --
> C. Titus Brown, ctbrown at ucdavis.edu
>
>
>
> ------------------------------
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
>
>
> End of khmer Digest, Vol 25, Issue 2
> ************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20150511/80568f33/attachment-0001.htm>