[khmer] Partitioning based on abundance using an already-loaded hash

Thu Aug 7 10:55:19 PDT 2014

We have a large plant genome project where we used khmer to perform k-mer abundance and found (in the k-mer spectrum) there are three maxima present (~60x, 120x, 180x), probably representing large-scale genome duplication events or alloploidy.

In short, we would like to partition the read data based on abundance to assess how many paired-end and mate-pair reads are retained in each abundance peak, basically to assess whether we can assemble each partition more efficiently and to determine how mosaic the genome structure is (idea being that lower number of retained PE/MP data would indicate more problems).  

We found bin-reads-by-abundance.py in the sandbox but this seems to rebuild the hash from scratch; is there anything that will take an already-generated hash to do the same?  Could probably hack something together based on this but I was curious whether something else already exists.

chris