[khmer] Partitioning based on abundance using an already-loaded hash

Fri Aug 8 03:34:28 PDT 2014

On Thu, Aug 07, 2014 at 05:55:19PM +0000, Fields, Christopher J wrote:
> We have a large plant genome project where we used khmer to perform k-mer abundance and found (in the k-mer spectrum) there are three maxima present (~60x, 120x, 180x), probably representing large-scale genome duplication events or alloploidy.
> 
> In short, we would like to partition the read data based on abundance to assess how many paired-end and mate-pair reads are retained in each abundance peak, basically to assess whether we can assemble each partition more efficiently and to determine how mosaic the genome structure is (idea being that lower number of retained PE/MP data would indicate more problems).  
> 
> We found bin-reads-by-abundance.py in the sandbox but this seems to rebuild the hash from scratch; is there anything that will take an already-generated hash to do the same?  Could probably hack something together based on this but I was curious whether something else already exists.

Hi Chris,

I don't think there's anything, sorry!

I would very much suggestion avoiding k-mer abundance spectra and instead
use read coverage spectra (e.g. using the median k-mer abundance within
each read).  The script scripts/count-median.py will calculate this for
each sequence, for example; you'd just need to modify that script to
output sequences based on your chosen cutoffs.

I can potentially help with this in a week, too.

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu