[khmer] khmer stripped header information from RNA-seq reads, rendering them unusable

Erich Marquard Schwarz ems394 at cornell.edu
Thu Jul 17 13:42:18 PDT 2014


On Jul 17, 2014, at 4:21 PM, C. Titus Brown <ctb at msu.edu> wrote:

> For now, we've settled on the following set of advice:
> https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/index.html
> 
> Note the use of scripts in
> https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/1-quality.html

    Fair enough.

    It might not be a bad idea to note, explicitly in the docs, that interleave-reads.py and extract-paired-reads.py have the side effect of appending '/1' and '/2' to reads, and that this behavior is *necessary* in order to avoid problems down the road with khmer stripping away the now-standard Illumina format of spaced suffixes (" 1" and " 2").  Although I am sure that this seems obvious behavior to people who've been developing khmer for, literally, years, it is *not* obvious behavior in fact, and it's a nontrivial detail.  So adding that warning to the docs might be time well spent.


> We are nearing a point where we can fix this behavior in khmer itself, but
> in the past the parsing code has been a major pain point in terms of new
> bugs, so we've avoided making too many changes there...

    Sure, that makes sense.  Not a huge deal on my end.  I have been tending to retro-suffix my Illumina reads anyway, since doing so avoids problems with programs like bowtie that were developed in the paleo Illumina suffix era.  I just made the mistake of trying to be a young hipster this time.


--Erich





More information about the khmer mailing list