[khmer] khmer stripped header information from RNA-seq reads, rendering them unusable

Thu Jul 17 09:19:30 PDT 2014

Hi all,

I used khmer to begin normalizing RNA-seq data with this command:

    normalize-by-median.py -k 20 -C 20 -x 2e9 -N 4 --savehash Csp1_rna_2014.07.16.filt.jumbled.kh Csp1_rna_2014.07.16.filt.jumbled.fa ;

which produced Csp1_rna_2014.07.16.filt.jumbled.fa.keep.

Unfortunately, I was not aware that khmer has the nasty side effect of stripping header information.  Here are two header texts -- the first from Csp1_rna_2014.07.16.filt.jumbled.fa, the second from its khmer product Csp1_rna_2014.07.16.filt.jumbled.fa.keep:

    >DHKW5DQ1:285:D1T8EACXX:7:1101:1397:2177 1:N:0:TATGTGGC

    >DHKW5DQ1:285:D1T8EACXX:7:1101:1397:2177

The first header line has paired-end information using Illumina's new format (with trailing ' 1' and ' 2' -- which I agree is less robust than the old-style '#1' and '#2' suffixes that Illumina used to use, but Illumina is the 800-pound gorilla here, and we are its mere servant chimps).

That header-stripping 'feature' of khmer totally trashed my later work on the data.  I will have to retroname the reads (give them "#1' and "#2' old-style suffixes) so that I can get khmer to work with them without wrecking their usability for later re-sorting and subsequent uses (in this case, genome RNA-scaffolding).

Lost time, roughly one day.

The version I have of khmer was installed on 9/4/2012.  If this side-effect has been fixed since then, that's good news; if not, then it'd be good if it *were* fixed.

Thank you,

--Erich