[khmer] Counting kmers and disabling reverse complement

Lester Mackey lmackey at stanford.edu
Fri Jun 14 10:33:51 PDT 2013


Thanks Jordan and Titus!

Am I correct that Titus's script will also work with kt =
khmer.new_counting_hash(KSIZE, starting_size)?  What is the difference
between new_counting_hash and new_hashtable?

Thanks again,
Lester


On Fri, Jun 14, 2013 at 7:36 AM, C. Titus Brown <ctb at msu.edu> wrote:

> Thanks, Jordan.
>
> Lester -- if you want to do standard pentamer signature analysis, here's
> a script I wrote --
>
> ---
>
> #! /usr/bin/env python
> import sys
> import khmer
> import screed
>
> KSIZE=5
>
> def main(inp_name, outp_name, min_seq_len):
>     outfp = open(outp_name, 'w')
>
>     min_seq_len = int(min_seq_len)
>
>     for record in screed.open(inp_name):
>         if len(record.sequence) < min_seq_len:
>             continue
>
>         kt = khmer.new_ktable(KSIZE)
>         kt.consume(record.sequence[:min_seq_len])
>
>         x = []
>         for i in range(4**KSIZE):
>             x.append("%s" % (kt.get(i),))
>
>         print >>outfp, " ".join(x)
>
> if __name__ == '__main__':
>     main(*sys.argv[1:4])
>
> ---
>
> On Fri, Jun 14, 2013 at 08:53:22AM -0400, Jordan Fish wrote:
> > Hi Lester,
> >
> > Unless you are working with fairly small k-values you will probably want
> to
> > use the CountingHash.  Ktable handles simple exact counting so far
> > large-ish values of k (>12, according to
> > http://khmer.readthedocs.org/en/latest/ktable.html) it'll blow up.
> >
> > The counting hash uses a bloom filter to limit memory usage at the cost
> of
> > in-exact counting.  Hopefully titus will jump in here with a link to some
> > documentation on the inexact counting.
> >
> > Finally, if you want to force khmer to treat a kmer and it's reverse
> > complement as unique you will need to edit 'lib/Makefile' and change the
> > line
> >
> > NO_UNIQUE_RC=0
> >
> > to
> >
> > NO_UNIQUE_RC=1
> >
> > and rebuild khmer
> >
> > Jordan
> >
> > On Fri, Jun 14, 2013 at 3:22 AM, Lester Mackey <lmackey at stanford.edu>
> wrote:
> >
> > > Dear khmer Discussion List,
> > >
> > > If my goal is to obtain a vector of kmer counts quickly from a FASTA or
> > > FASTQ file, is there any reason to prefer ktable to one of your other
> data
> > > structures, like the counting hash table?
> > >
> >
> > > I've noticed that ktable hashes a kmer and its reverse complement to
> the
> > > same bin.  Is there an easy way to disable this feature (and thereby
> count
> > > each kmer and reverse complement separately)?
> > >
> > > Thanks,
> > > Lester
> > >
> > > _______________________________________________
> > > khmer mailing list
> > > khmer at lists.idyll.org
> > > http://lists.idyll.org/listinfo/khmer
> > >
> > >
>
> > _______________________________________________
> > khmer mailing list
> > khmer at lists.idyll.org
> > http://lists.idyll.org/listinfo/khmer
>
>
> --
> C. Titus Brown, ctb at msu.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130614/d1613b5e/attachment-0002.htm>


More information about the khmer mailing list