[khmer] khmer stripped header information from RNA-seq reads, rendering them unusable

Philipp Schiffer philipp.schiffer at gmail.com
Thu Jul 17 11:25:43 PDT 2014


Dear Erich,

you can indeed do whatever you like there. However, as I tried to 
indicate, it might really make sense to go with -1, -2 or /1, /2.
My guess is that a lot of scripts couldstruggle with the "#" you are using.
Meanwhile it is also possible to just "repair" the reads in the .keep 
file by comparison with the raw reads file where headers have been 
fixed. Might save some time....

Good luck

Philipp

> Erich Marquard Schwarz <mailto:ems394 at cornell.edu>
> 17 July 2014 20:12
>
> For some value of "supposed", but it is not optimal. If it could be 
> corrected in a future version of khmer, that would be good indeed!
>
> I am running a Perl script to retrofit the FastQ headers with '#0/1' 
> and '#0/2' (which was Illumina's older style of distinguishing 
> paired-end reads).
>
>
> --Erich
>
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
> Erich Marquard Schwarz <mailto:ems394 at cornell.edu>
> 17 July 2014 18:19
> Hi all,
>
> I used khmer to begin normalizing RNA-seq data with this command:
>
> normalize-by-median.py -k 20 -C 20 -x 2e9 -N 4 --savehash 
> Csp1_rna_2014.07.16.filt.jumbled.kh Csp1_rna_2014.07.16.filt.jumbled.fa ;
>
> which produced Csp1_rna_2014.07.16.filt.jumbled.fa.keep.
>
> Unfortunately, I was not aware that khmer has the nasty side effect of 
> stripping header information. Here are two header texts -- the first 
> from Csp1_rna_2014.07.16.filt.jumbled.fa, the second from its khmer 
> product Csp1_rna_2014.07.16.filt.jumbled.fa.keep:
>
> >DHKW5DQ1:285:D1T8EACXX:7:1101:1397:2177 1:N:0:TATGTGGC
>
> >DHKW5DQ1:285:D1T8EACXX:7:1101:1397:2177
>
> The first header line has paired-end information using Illumina's new 
> format (with trailing ' 1' and ' 2' -- which I agree is less robust 
> than the old-style '#1' and '#2' suffixes that Illumina used to use, 
> but Illumina is the 800-pound gorilla here, and we are its mere 
> servant chimps).
>
> That header-stripping 'feature' of khmer totally trashed my later work 
> on the data. I will have to retroname the reads (give them "#1' and 
> "#2' old-style suffixes) so that I can get khmer to work with them 
> without wrecking their usability for later re-sorting and subsequent 
> uses (in this case, genome RNA-scaffolding).
>
> Lost time, roughly one day.
>
> The version I have of khmer was installed on 9/4/2012. If this 
> side-effect has been fixed since then, that's good news; if not, then 
> it'd be good if it *were* fixed.
>
> Thank you,
>
>
> --Erich
>
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20140717/f021d599/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compose-unknown-contact.jpg
Type: image/jpeg
Size: 770 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20140717/f021d599/attachment-0001.jpg>


More information about the khmer mailing list