[khmer] khmer stripped header information from RNA-seq reads, rendering them unusable

C. Titus Brown ctb at msu.edu
Fri Jul 18 04:58:39 PDT 2014


On Thu, Jul 17, 2014 at 08:42:18PM +0000, Erich Marquard Schwarz wrote:
> On Jul 17, 2014, at 4:21 PM, C. Titus Brown <ctb at msu.edu> wrote:
> 
> > For now, we've settled on the following set of advice:
> > https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/index.html
> > 
> > Note the use of scripts in
> > https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/1-quality.html
> 
>     Fair enough.
> 
>     It might not be a bad idea to note, explicitly in the docs, that interleave-reads.py and extract-paired-reads.py have the side effect of appending '/1' and '/2' to reads, and that this behavior is *necessary* in order to avoid problems down the road with khmer stripping away the now-standard Illumina format of spaced suffixes (" 1" and " 2").  Although I am sure that this seems obvious behavior to people who've been developing khmer for, literally, years, it is *not* obvious behavior in fact, and it's a nontrivial detail.  So adding that warning to the docs might be time well spent.

+1

> > We are nearing a point where we can fix this behavior in khmer itself, but
> > in the past the parsing code has been a major pain point in terms of new
> > bugs, so we've avoided making too many changes there...
> 
>     Sure, that makes sense.  Not a huge deal on my end.  I have been tending to retro-suffix my Illumina reads anyway, since doing so avoids problems with programs like bowtie that were developed in the paleo Illumina suffix era.  I just made the mistake of trying to be a young hipster this time.

+1

--titus



More information about the khmer mailing list