[khmer] use of filter-abund.py ... how's the cutoff applied?

C. Titus Brown ctb at msu.edu
Thu Aug 15 13:28:53 PDT 2013


On Thu, Aug 15, 2013 at 01:04:44PM -0700, Joseph Fass wrote:
> Hi khmer-ers,
> 
> I've got a data set that's extremely high coverage (>20,000 base coverage)
> for sequences I want, and also quite high (>2,000) for sequences that are
> problematic for a final assembly that I'm attempting. What I *think* I want
> is more of a band pass filter from normalize-by-abundance.py, but, failing
> that, I thought I'd try to use filter-abund.py. So I ran it on the raw read
> set (*before* a first pass with normalize-by-abundance.py), with a cutoff
> (-C) of 6200, which I thought would have retained the many reads that
> (should) have k-mers with coverages above 20,000 across most or all of the
> read ... but instead, 100% of the reads were discarded.
> 
> What am I missing?

Hah, took me a second to figure this out :)

We only count up to 255 in khmer, by default.

So, you could subset your data to ~200x base coverage (just discard
99.9% of it) and then do filter-abund to get rid of stuff that's at a
coverage of 2.  Might that work?

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu




More information about the khmer mailing list