[khmer] Partitioning based on abundance using an already-loaded hash

Fri Aug 8 05:41:51 PDT 2014

> On Aug 8, 2014, at 5:34 AM, "C. Titus Brown" <ctb at msu.edu> wrote:
> 
>> On Thu, Aug 07, 2014 at 05:55:19PM +0000, Fields, Christopher J wrote:
>> We have a large plant genome project where we used khmer to perform k-mer abundance and found (in the k-mer spectrum) there are three maxima present (~60x, 120x, 180x), probably representing large-scale genome duplication events or alloploidy.
>> 
>> In short, we would like to partition the read data based on abundance to assess how many paired-end and mate-pair reads are retained in each abundance peak, basically to assess whether we can assemble each partition more efficiently and to determine how mosaic the genome structure is (idea being that lower number of retained PE/MP data would indicate more problems).  
>> 
>> We found bin-reads-by-abundance.py in the sandbox but this seems to rebuild the hash from scratch; is there anything that will take an already-generated hash to do the same?  Could probably hack something together based on this but I was curious whether something else already exists.
> 
> Hi Chris,
> 
> I don't think there's anything, sorry!
> 
> I would very much suggestion avoiding k-mer abundance spectra and instead
> use read coverage spectra (e.g. using the median k-mer abundance within
> each read).  The script scripts/count-median.py will calculate this for
> each sequence, for example; you'd just need to modify that script to
> output sequences based on your chosen cutoffs.
> 
> I can potentially help with this in a week, too.
> 
> cheers,
> --titus
> -- 
> C. Titus Brown, ctb at msu.edu

I can have a look at count-median.py as a basis, should be able to get it working so no worries.  Using the median cutoff makes perfect sense as well, thanks for pointing that out!

Chris