[khmer] Counting kmers and disabling reverse complement

Jordan Fish jrdn.fish at gmail.com
Fri Jun 14 16:30:58 PDT 2013


The hashtable is a simple presence/absence bloom filter.  It can tell you
only whether you've (probably) seen a kmer.  Where as the counting
hashtable is a counting bloomfilter.  It can tell you how many times
(probably) you've seen a given kmer.  I say probably only because the bloom
filter itself is a probabilistic data structure.

As far as replacing the khmer.new_ktable with khmer.new_counting_hash, it
will work but if you're going to stick with low ksizes the ktable will be
perfectly fine (and possibly preferable since it is an exact datastructure).

Jordan


On Fri, Jun 14, 2013 at 1:33 PM, Lester Mackey <lmackey at stanford.edu> wrote:

> Thanks Jordan and Titus!
>
> Am I correct that Titus's script will also work with kt =
> khmer.new_counting_hash(KSIZE, starting_size)?  What is the difference
> between new_counting_hash and new_hashtable?
>
> Thanks again,
> Lester
>
>
> On Fri, Jun 14, 2013 at 7:36 AM, C. Titus Brown <ctb at msu.edu> wrote:
>
>> Thanks, Jordan.
>>
>> Lester -- if you want to do standard pentamer signature analysis, here's
>> a script I wrote --
>>
>> ---
>>
>> #! /usr/bin/env python
>> import sys
>> import khmer
>> import screed
>>
>> KSIZE=5
>>
>> def main(inp_name, outp_name, min_seq_len):
>>     outfp = open(outp_name, 'w')
>>
>>     min_seq_len = int(min_seq_len)
>>
>>     for record in screed.open(inp_name):
>>         if len(record.sequence) < min_seq_len:
>>             continue
>>
>>         kt = khmer.new_ktable(KSIZE)
>>         kt.consume(record.sequence[:min_seq_len])
>>
>>         x = []
>>         for i in range(4**KSIZE):
>>             x.append("%s" % (kt.get(i),))
>>
>>         print >>outfp, " ".join(x)
>>
>> if __name__ == '__main__':
>>     main(*sys.argv[1:4])
>>
>> ---
>>
>> On Fri, Jun 14, 2013 at 08:53:22AM -0400, Jordan Fish wrote:
>> > Hi Lester,
>> >
>> > Unless you are working with fairly small k-values you will probably
>> want to
>> > use the CountingHash.  Ktable handles simple exact counting so far
>> > large-ish values of k (>12, according to
>> > http://khmer.readthedocs.org/en/latest/ktable.html) it'll blow up.
>> >
>> > The counting hash uses a bloom filter to limit memory usage at the cost
>> of
>> > in-exact counting.  Hopefully titus will jump in here with a link to
>> some
>> > documentation on the inexact counting.
>> >
>> > Finally, if you want to force khmer to treat a kmer and it's reverse
>> > complement as unique you will need to edit 'lib/Makefile' and change the
>> > line
>> >
>> > NO_UNIQUE_RC=0
>> >
>> > to
>> >
>> > NO_UNIQUE_RC=1
>> >
>> > and rebuild khmer
>> >
>> > Jordan
>> >
>> > On Fri, Jun 14, 2013 at 3:22 AM, Lester Mackey <lmackey at stanford.edu>
>> wrote:
>> >
>> > > Dear khmer Discussion List,
>> > >
>> > > If my goal is to obtain a vector of kmer counts quickly from a FASTA
>> or
>> > > FASTQ file, is there any reason to prefer ktable to one of your other
>> data
>> > > structures, like the counting hash table?
>> > >
>> >
>> > > I've noticed that ktable hashes a kmer and its reverse complement to
>> the
>> > > same bin.  Is there an easy way to disable this feature (and thereby
>> count
>> > > each kmer and reverse complement separately)?
>> > >
>> > > Thanks,
>> > > Lester
>> > >
>> > > _______________________________________________
>> > > khmer mailing list
>> > > khmer at lists.idyll.org
>> > > http://lists.idyll.org/listinfo/khmer
>> > >
>> > >
>>
>> > _______________________________________________
>> > khmer mailing list
>> > khmer at lists.idyll.org
>> > http://lists.idyll.org/listinfo/khmer
>>
>>
>> --
>> C. Titus Brown, ctb at msu.edu
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130614/d863fd71/attachment-0002.htm>


More information about the khmer mailing list