[ngs-2012] Fwd: coping with reads in two libraries

Wed Jun 13 13:06:47 PDT 2012

Neal asked me and Erich how to deal with assembling multiple libraries with different insert sizes; here's Erich's very detailed response :)

This is kind of the typical complexity for an actual assembly effort, FYI.

cheers,
--titus

Begin forwarded message:

> From: Erich Schwarz <schwarz at tenaya.caltech.edu>
> Subject: coping with reads in two libraries
> Date: June 13, 2012 10:58:39 AM EDT
> To: Neal Platt <neal.platt at gmail.com>
> Cc: Titus Brown <ctb at msu.edu>
> 
> Hi Neal,
> 
>    Here's how I think I'd deal with the problem you described.  
> 
>    I've attached an archive, schwarz_scripts_13jun2012.tar.gz, 
> which contains the Perl scripts invoked below.
> 
> 1. Take each set of R1 and R2 individual reads and merge them into a 
> single interleaved read file, using the shuffleSequences_fastq.pl 
> Perl script provided in velvet:
> 
>    shuffleSequences_fastq.pl small_insert_R1.fq    small_insert_R2.fq    small_insert_R12.fq ;
>    shuffleSequences_fastq.pl jumping_library_R1.fq jumping_library_R2.fq jumping_library_R12.fq ;
> 
> 2. If your jumping library reads are facing outward, make a version 
> of the FASTQ file in which they're actually facing inward!!
> 
>    mv -i jumping_library_R12.fq jumping_library_R12.outward.fq ;
>    revcomp_fastq.pl -f jumping_library_R12.outward.fq > jumping_library_R12.fq ;
>    rm jumping_library_R12.outward.fq ;
> 
> 3. Your reads are probably from the newer Illumina software.  This 
> new software has an awful new naming scheme.  To make everything 
> work better, get all the FASTQ reads renamed from the newer style of 
> R1 and R2 ends, as follows:
> 
>    mv -i small_insert_R12.fq    small_insert_R12.awful.fq ;
>    mv -i jumping_library_R12.fq jumping_library_R12.awful.fq ;
> 
>    retroname_fastq_reads.pl small_insert_R12.awful.fq    > small_insert_R12.retro.fq ;
>    retroname_fastq_reads.pl jumping_library_R12.awful.fq > jumping_library_R12.retro.fq ;
> 
> [Note that the newer Illumina read-naming scheme is like this:
> 
> @HWI-ST0787:106:D0549ACXX:7:1101:1479:2152 1:Y:0:GGCTAC
> ATTNAACATTCTTAGTGGNTNNNNNTTACATCTNNNNNNNNNNNNNNNNCGTAAATGTCTCACTGTTTTTCAAACCCCACTTCTAGAAGGGGGACCCCCTC
> +
> <<<#2@@@@@@@@@@@@@#3#####21@@@???################--;=?????@@@@@@@@??????>??=<<<<<========<<::6:<<<9::
> @HWI-ST0787:106:D0549ACXX:7:1101:1479:2152 2:Y:0:GGCTAC
> AAGCCTATTTCCTCAATTCTCGCCGCACTTCATTTTCTATCGCCTGGAGTGTGAAGCACCATTTAGACCAGGACGACAACCCCGATCACTCCTACTGTTTG
> +
> CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJJIJJJJJJIJJJJJIJJJJIBCGIJJJJJHHHHHHFFFFFEDDDDDDDDDDDDDDDDDDDDDDDDDDEDDD
> 
> [The older Illumina read-naming scheme is like this:
> 
> @HWI-ST0787:106:D0549ACXX:7:1101:1585:2216#0/1
> ATTTTGCTCGGACCTTGCCTTTAACAGAATTATTCACATTATATTCATCACATCTATTTCAAACATGGGTTCAAAAATTCTGGATAGGATGGTATTCGGT
> +
> CCCFFFFFGHHGHJJJJIJJJJIJIJJJIJJIJJJIJJIJIJJJJEIIIGIIJIIJJJJJJJIJIIIIJDGHGEHFCHFFFFFFDEEEEEDDCECEDDDB
> @HWI-ST0787:106:D0549ACXX:7:1101:1585:2216#0/2
> TGAATGTAGCTGAGTTCAGACGACAGCTGGATGATGGTTTATTGACGTGGGTACACACAATGGTACAGTTAAGAATGAAACCTACTCCCTCATGCGAGTA
> +
> @CCFFDDFHHHHHIIHGGIGGIEGJJJJJJJIIIIIJFFHGIIEDHHHGI>FFHHGIJJJGHG=CEEFDDFFFACCCCEEDDDDDDDDDDDBCCCDDD>A
> 
> and is *much* easier to manipulate, because it puts the R1 and R2 
> labels as '#0/1' and '#0/2' suffixes inside the obvious name of the 
> read, where they belong.]
> 
> 4. Your reads are probably not selected for basic quality.  In other 
> words, they probably have not been subjected to chastity filtering, 
> minimal requirement of a quality score of >= 3, or the absense of N 
> residues.  You can (and probably should) impose all of these quality 
> filters by running the following step.  Note that, by doing so, you 
> will jumble the order of the paired reads, which will need to be 
> re-sorted into paired and unpaired reads later on:
> 
>    quality_trim_fastq.pl -q 33 -n -i small_insert_R12.retro.fq    -o small_insert_R12.retro.jumbled.fq ;
>    quality_trim_fastq.pl -q 33 -n -i jumping_library_R12.retro.fq -o jumping_library_R12.retro.jumbled.fq ;
> 
> To get a fuller understanding of the quality_trim_fastq.pl options, 
> type "quality_trim_fastq.pl -h":
> 
> usage: quality_trim_fastq.pl
> 
>       -i|--input      <in>       input file name (fastq); opt. stream input via '-'
>       -o|--output     <out>      output file name (fastq), required
>       -q|--quality    <integer>  quality score offset for FASTQ: in Illumina, was 64 (for Phred+64) for 2009-2011; shifted to 33 (for Phred+33) by Oct. 2011; required
> 
>       [at least one of the following arguments is required]
>       -e|--end_trim   <integer>  1. Before anything else, trim this many residues off end (typically, 1 last unreliable residue)
>       -u|--uppermost  <integer>  2. Maximum residue length - trim all reads to this length, immediately after --end_trim
>       -n|--no_Ns                 3. Trim all *starting* N residues, and then all 3' residues to last non-N
>       -t|--threshold  <integer>  4. Quality threshold - trim all 3'-ward residues below this score
>       -m|--minimum    <integer>  5. Minimum residue length - after all other filters, censor any reads trimmed below this length
> 
>       [the following option is allowed, but not recommended]
>       -c|--chastity              6. Override the automatic filtering out of any read with 1:Y:\d or 2:Y:\d in the header
> 
>       [print this message:]
>       -h|--help
> 
> Note that quality_trim_fastq.pl deletes non-chastity-qualified reads 
> by *default*.  Also note that in this instance, I'm assuming you 
> have the newer-style Illumina reads with a quality score offset of 
> 33 (rather then 64).
> 
> 5. Copy all the retro-named, basic quality-filtered FASTQ reads into 
> a single monolithic FASTA file which can be easily parsed by khmer.
> 
>    fastq2fa_simple.pl small_insert_R12.retro.jumbled.fq    >  monolith.fa ;
> 
>    # Note '>>', not '>' -- you want to append reads to an existing file:
>    fastq2fa_simple.pl jumping_library_R12.retro.jumbled.fq >> monolith.fa ;
> 
> 6. Carry out digital normalization on the FASTA file.  Get a new 
> FASTA file which has only the selected reads.
> 
>    /PATH/khmer/scripts/normalize-by-median.py -k 20 -C 20 -x 1e10 -N 4 --savehash monolith.kh monolith.fa ;
> 
>    /PATH/khmer/scripts/filter-abund.py -C 2 monolith.kh monolith.fa.keep ;
> 
>    /PATH/khmer/scripts/normalize-by-median.py -k 20 -C 5 -x 1e10 -N 4 monolith.fa.keep.abundfilt ;
> 
> 7. Deal with the silly and uninformative hard-coded output names from khmer:
> 
>    mv -i monolith.fa.keep                monolith.khmer_C20.fa ;
>    mv -i monolith.fa.keep.abundfilt      monolith.khmer_C20-2.fa ;
>    mv -i monolith.fa.keep.abundfilt.keep monolith.khmer_C20-2-5.fa ;
> 
> 8. From the new, khmer-approved FASTA file, extract a (loooong) list 
> of read names.
> 
>    extract_fasta_namelist.pl monolith.khmer_C20-2-5.fa > khmer_approved_C20-2-5_reads.txt ;
> 
> 9. Go back to your original retro-named FASTQ files.  For each file, 
> use the approved read name-list to pull out only those reads which 
> were approved, and export them to a separate FASTQ file.  This 
> preserves the information of which reads belong to which sized 
> library.
> 
>    extract_fastq_subset.pl -l khmer_approved_C20-2-5_reads.txt -f small_insert_R12.retro.jumbled.fq    > small_insert_R12.khmer_C20-2-5.retro.jumbled.fq ;
> 
>    extract_fastq_subset.pl -l khmer_approved_C20-2-5_reads.txt -f jumping_library_R12.retro.jumbled.fq > jumping_library_R12.khmer_C20-2-5.retro.jumbled.fq ;
> 
> Note that, in the process of doing this, you *will* jumble the reads 
> if you haven't already, and will need to re-sort them into paired- 
> and single-end files.
> 
> 10. So, do just that, like so:
> 
>    paired_vs_unp_fastq.or.a.pl --r1 "#0\/1" --r2 "#0\/2" -i small_insert_R12.khmer_C20-2-5.retro.jumbled.fq -u small_insert_R12.khmer_C20-2-5.retro.se.fq -p small_insert_R12.khmer_C20-2-5.retro.pe.fq ;
> 
>    paired_vs_unp_fastq.or.a.pl --r1 "#0\/1" --r2 "#0\/2" -i jumping_library_R12.khmer_C20-2-5.retro.jumbled.fq -u jumping_library_R12.khmer_C20-2-5.retro.se.fq -p jumping_library_R12.khmer_C20-2-5.retro.pe.fq ;
> 
> 11. Finally, if you really want to have individual R1 and R2 
> reads in separate files, split them out again:
> 
>    split_paired_fastq.or.a.pl -i small_insert_R12.khmer_C20-2-5.retro.pe.fq --fastq --r1 "#0\/1" --r2 "#0\/2" --o1 small_insert_R12.khmer_C20-2-5.retro.R1.fq --o2 small_insert_R12.khmer_C20-2-5.retro.R2.fq ;
> 
>    split_paired_fastq.or.a.pl -i jumping_library_R12.khmer_C20-2-5.retro.pe.fq --fastq --r1 "#0\/1" --r2 "#0\/2" --o1 jumping_library_R12.khmer_C20-2-5.retro.R1.fq --o2 jumping_library_R12.khmer_C20-2-5.retro.R2.fq ;
> 
> 12. You should now have a set of reads that have been subjected to 
> digital normalization, but are still in distinct library sizes, are 
> properly sorted into paired- and single-end reads, and which are 
> usable for genome assembly.
> 
>    Note that you can get small help messages with the '-h' argument 
> for the following scripts: extract_fastq_subset.pl, 
> paired_vs_unp_fastq.or.a.pl, quality_trim_fastq.pl, and 
> retroname_fastq_reads.pl.
> 
>    Also note that this would probably be considered a clumsy 
> approach by a lot of people -- but these are the tools I've 
> hand-coded, and they do at least do the job.
> 
> 
> --Erich
-------------- next part --------------
A non-text attachment was scrubbed...
Name: schwarz_scripts_13jun2012.tar.gz
Type: application/x-gzip
Size: 9031 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/ngs-2012/attachments/20120613/7a30c3a6/attachment.bin>
-------------- next part --------------