[khmer] khmer stripped header information from RNA-seq reads, rendering them unusable
Erich Marquard Schwarz
ems394 at cornell.edu
Thu Jul 17 13:42:18 PDT 2014
On Jul 17, 2014, at 4:21 PM, C. Titus Brown <ctb at msu.edu> wrote:
> For now, we've settled on the following set of advice:
> https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/index.html
>
> Note the use of scripts in
> https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/1-quality.html
Fair enough.
It might not be a bad idea to note, explicitly in the docs, that interleave-reads.py and extract-paired-reads.py have the side effect of appending '/1' and '/2' to reads, and that this behavior is *necessary* in order to avoid problems down the road with khmer stripping away the now-standard Illumina format of spaced suffixes (" 1" and " 2"). Although I am sure that this seems obvious behavior to people who've been developing khmer for, literally, years, it is *not* obvious behavior in fact, and it's a nontrivial detail. So adding that warning to the docs might be time well spent.
> We are nearing a point where we can fix this behavior in khmer itself, but
> in the past the parsing code has been a major pain point in terms of new
> bugs, so we've avoided making too many changes there...
Sure, that makes sense. Not a huge deal on my end. I have been tending to retro-suffix my Illumina reads anyway, since doing so avoids problems with programs like bowtie that were developed in the paleo Illumina suffix era. I just made the mistake of trying to be a young hipster this time.
--Erich
More information about the khmer
mailing list