[khmer] khmer stripped header information from RNA-seq reads, rendering them unusable

C. Titus Brown ctb at msu.edu
Thu Jul 17 13:21:11 PDT 2014


On Thu, Jul 17, 2014 at 06:36:40PM +0000, Erich Marquard Schwarz wrote:
> On Jul 17, 2014, at 2:25 PM, Philipp Schiffer <philipp.schiffer at gmail.com> wrote:
> 
> > you can indeed do whatever you like there. However, as I tried to indicate, it might really make sense to go with -1, -2 or /1, /2.
> > My guess is that a lot of scripts could struggle with the "#" you are using.
> 
>     Any script that can handle older-format Illumina reads will do fine, which is all of mine, along with many standard programs (e.g., older bowtie works fine with the older format, because that older format was the standard when bowtie was first designed!).  So, you may be right in many cases, but in this case I don't really need to worry about using the older Illumina format.  As with all things Unix, it is a matter of *which* pesky details are going to be lethal in a given context.

Erich,

sorry about the mess; we struggle constantly with this kind of question.

For now, we've settled on the following set of advice:

https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/index.html

Note the use of scripts in

https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/1-quality.html

particularly interleave-reads.py and extract-paired-reads.py.  These
will stick /1 and /2 on the ends of your sequences and thereafter everything
else will work fine.

We are nearing a point where we can fix this behavior in khmer itself, but
in the past the parsing code has been a major pain point in terms of new
bugs, so we've avoided making too many changes there...

cheers,
--titus

> 
> 
> > Meanwhile it is also possible to just "repair" the reads in the .keep file by comparison with the raw reads file where headers have been fixed. Might save some time....
> 
>     Ugh.  I recognize that you are correct, and in some circumstances I would do that, but I think trying to fix a munged file is inherently more error-prone than just making a file that will be bullet-proof.  Again, you're certainly not wrong, but it's a question of what particular gotchas one is trying to steer clear of.
> 
>     I recently told a bioinformatics class in Yerevan, Armenia: "A great deal of bioinformatics consists of converting data from one file format to another, rather than actually doing computations on the data."  Sad, but true.
> 
> 
> > Good luck
> 
>     Thanks!
> 
> 
> --Erich
> 
> 
> 
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer

-- 
C. Titus Brown, ctb at msu.edu



More information about the khmer mailing list