[ngs-2012] Fwd: coping with reads in two libraries
C. Titus Brown
ctb at msu.edu
Wed Jun 13 13:06:47 PDT 2012
Neal asked me and Erich how to deal with assembling multiple libraries with different insert sizes; here's Erich's very detailed response :)
This is kind of the typical complexity for an actual assembly effort, FYI.
cheers,
--titus
Begin forwarded message:
> From: Erich Schwarz <schwarz at tenaya.caltech.edu>
> Subject: coping with reads in two libraries
> Date: June 13, 2012 10:58:39 AM EDT
> To: Neal Platt <neal.platt at gmail.com>
> Cc: Titus Brown <ctb at msu.edu>
>
> Hi Neal,
>
> Here's how I think I'd deal with the problem you described.
>
> I've attached an archive, schwarz_scripts_13jun2012.tar.gz,
> which contains the Perl scripts invoked below.
>
> 1. Take each set of R1 and R2 individual reads and merge them into a
> single interleaved read file, using the shuffleSequences_fastq.pl
> Perl script provided in velvet:
>
> shuffleSequences_fastq.pl small_insert_R1.fq small_insert_R2.fq small_insert_R12.fq ;
> shuffleSequences_fastq.pl jumping_library_R1.fq jumping_library_R2.fq jumping_library_R12.fq ;
>
> 2. If your jumping library reads are facing outward, make a version
> of the FASTQ file in which they're actually facing inward!!
>
> mv -i jumping_library_R12.fq jumping_library_R12.outward.fq ;
> revcomp_fastq.pl -f jumping_library_R12.outward.fq > jumping_library_R12.fq ;
> rm jumping_library_R12.outward.fq ;
>
> 3. Your reads are probably from the newer Illumina software. This
> new software has an awful new naming scheme. To make everything
> work better, get all the FASTQ reads renamed from the newer style of
> R1 and R2 ends, as follows:
>
> mv -i small_insert_R12.fq small_insert_R12.awful.fq ;
> mv -i jumping_library_R12.fq jumping_library_R12.awful.fq ;
>
> retroname_fastq_reads.pl small_insert_R12.awful.fq > small_insert_R12.retro.fq ;
> retroname_fastq_reads.pl jumping_library_R12.awful.fq > jumping_library_R12.retro.fq ;
>
> [Note that the newer Illumina read-naming scheme is like this:
>
> @HWI-ST0787:106:D0549ACXX:7:1101:1479:2152 1:Y:0:GGCTAC
> ATTNAACATTCTTAGTGGNTNNNNNTTACATCTNNNNNNNNNNNNNNNNCGTAAATGTCTCACTGTTTTTCAAACCCCACTTCTAGAAGGGGGACCCCCTC
> +
> <<<#2@@@@@@@@@@@@@#3#####21@@@???################--;=?????@@@@@@@@??????>??=<<<<<========<<::6:<<<9::
> @HWI-ST0787:106:D0549ACXX:7:1101:1479:2152 2:Y:0:GGCTAC
> AAGCCTATTTCCTCAATTCTCGCCGCACTTCATTTTCTATCGCCTGGAGTGTGAAGCACCATTTAGACCAGGACGACAACCCCGATCACTCCTACTGTTTG
> +
> CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJJIJJJJJJIJJJJJIJJJJIBCGIJJJJJHHHHHHFFFFFEDDDDDDDDDDDDDDDDDDDDDDDDDDEDDD
>
> [The older Illumina read-naming scheme is like this:
>
> @HWI-ST0787:106:D0549ACXX:7:1101:1585:2216#0/1
> ATTTTGCTCGGACCTTGCCTTTAACAGAATTATTCACATTATATTCATCACATCTATTTCAAACATGGGTTCAAAAATTCTGGATAGGATGGTATTCGGT
> +
> CCCFFFFFGHHGHJJJJIJJJJIJIJJJIJJIJJJIJJIJIJJJJEIIIGIIJIIJJJJJJJIJIIIIJDGHGEHFCHFFFFFFDEEEEEDDCECEDDDB
> @HWI-ST0787:106:D0549ACXX:7:1101:1585:2216#0/2
> TGAATGTAGCTGAGTTCAGACGACAGCTGGATGATGGTTTATTGACGTGGGTACACACAATGGTACAGTTAAGAATGAAACCTACTCCCTCATGCGAGTA
> +
> @CCFFDDFHHHHHIIHGGIGGIEGJJJJJJJIIIIIJFFHGIIEDHHHGI>FFHHGIJJJGHG=CEEFDDFFFACCCCEEDDDDDDDDDDDBCCCDDD>A
>
> and is *much* easier to manipulate, because it puts the R1 and R2
> labels as '#0/1' and '#0/2' suffixes inside the obvious name of the
> read, where they belong.]
>
> 4. Your reads are probably not selected for basic quality. In other
> words, they probably have not been subjected to chastity filtering,
> minimal requirement of a quality score of >= 3, or the absense of N
> residues. You can (and probably should) impose all of these quality
> filters by running the following step. Note that, by doing so, you
> will jumble the order of the paired reads, which will need to be
> re-sorted into paired and unpaired reads later on:
>
> quality_trim_fastq.pl -q 33 -n -i small_insert_R12.retro.fq -o small_insert_R12.retro.jumbled.fq ;
> quality_trim_fastq.pl -q 33 -n -i jumping_library_R12.retro.fq -o jumping_library_R12.retro.jumbled.fq ;
>
> To get a fuller understanding of the quality_trim_fastq.pl options,
> type "quality_trim_fastq.pl -h":
>
> usage: quality_trim_fastq.pl
>
> -i|--input <in> input file name (fastq); opt. stream input via '-'
> -o|--output <out> output file name (fastq), required
> -q|--quality <integer> quality score offset for FASTQ: in Illumina, was 64 (for Phred+64) for 2009-2011; shifted to 33 (for Phred+33) by Oct. 2011; required
>
> [at least one of the following arguments is required]
> -e|--end_trim <integer> 1. Before anything else, trim this many residues off end (typically, 1 last unreliable residue)
> -u|--uppermost <integer> 2. Maximum residue length - trim all reads to this length, immediately after --end_trim
> -n|--no_Ns 3. Trim all *starting* N residues, and then all 3' residues to last non-N
> -t|--threshold <integer> 4. Quality threshold - trim all 3'-ward residues below this score
> -m|--minimum <integer> 5. Minimum residue length - after all other filters, censor any reads trimmed below this length
>
> [the following option is allowed, but not recommended]
> -c|--chastity 6. Override the automatic filtering out of any read with 1:Y:\d or 2:Y:\d in the header
>
> [print this message:]
> -h|--help
>
> Note that quality_trim_fastq.pl deletes non-chastity-qualified reads
> by *default*. Also note that in this instance, I'm assuming you
> have the newer-style Illumina reads with a quality score offset of
> 33 (rather then 64).
>
> 5. Copy all the retro-named, basic quality-filtered FASTQ reads into
> a single monolithic FASTA file which can be easily parsed by khmer.
>
> fastq2fa_simple.pl small_insert_R12.retro.jumbled.fq > monolith.fa ;
>
> # Note '>>', not '>' -- you want to append reads to an existing file:
> fastq2fa_simple.pl jumping_library_R12.retro.jumbled.fq >> monolith.fa ;
>
> 6. Carry out digital normalization on the FASTA file. Get a new
> FASTA file which has only the selected reads.
>
> /PATH/khmer/scripts/normalize-by-median.py -k 20 -C 20 -x 1e10 -N 4 --savehash monolith.kh monolith.fa ;
>
> /PATH/khmer/scripts/filter-abund.py -C 2 monolith.kh monolith.fa.keep ;
>
> /PATH/khmer/scripts/normalize-by-median.py -k 20 -C 5 -x 1e10 -N 4 monolith.fa.keep.abundfilt ;
>
> 7. Deal with the silly and uninformative hard-coded output names from khmer:
>
> mv -i monolith.fa.keep monolith.khmer_C20.fa ;
> mv -i monolith.fa.keep.abundfilt monolith.khmer_C20-2.fa ;
> mv -i monolith.fa.keep.abundfilt.keep monolith.khmer_C20-2-5.fa ;
>
> 8. From the new, khmer-approved FASTA file, extract a (loooong) list
> of read names.
>
> extract_fasta_namelist.pl monolith.khmer_C20-2-5.fa > khmer_approved_C20-2-5_reads.txt ;
>
> 9. Go back to your original retro-named FASTQ files. For each file,
> use the approved read name-list to pull out only those reads which
> were approved, and export them to a separate FASTQ file. This
> preserves the information of which reads belong to which sized
> library.
>
> extract_fastq_subset.pl -l khmer_approved_C20-2-5_reads.txt -f small_insert_R12.retro.jumbled.fq > small_insert_R12.khmer_C20-2-5.retro.jumbled.fq ;
>
> extract_fastq_subset.pl -l khmer_approved_C20-2-5_reads.txt -f jumping_library_R12.retro.jumbled.fq > jumping_library_R12.khmer_C20-2-5.retro.jumbled.fq ;
>
> Note that, in the process of doing this, you *will* jumble the reads
> if you haven't already, and will need to re-sort them into paired-
> and single-end files.
>
> 10. So, do just that, like so:
>
> paired_vs_unp_fastq.or.a.pl --r1 "#0\/1" --r2 "#0\/2" -i small_insert_R12.khmer_C20-2-5.retro.jumbled.fq -u small_insert_R12.khmer_C20-2-5.retro.se.fq -p small_insert_R12.khmer_C20-2-5.retro.pe.fq ;
>
> paired_vs_unp_fastq.or.a.pl --r1 "#0\/1" --r2 "#0\/2" -i jumping_library_R12.khmer_C20-2-5.retro.jumbled.fq -u jumping_library_R12.khmer_C20-2-5.retro.se.fq -p jumping_library_R12.khmer_C20-2-5.retro.pe.fq ;
>
> 11. Finally, if you really want to have individual R1 and R2
> reads in separate files, split them out again:
>
> split_paired_fastq.or.a.pl -i small_insert_R12.khmer_C20-2-5.retro.pe.fq --fastq --r1 "#0\/1" --r2 "#0\/2" --o1 small_insert_R12.khmer_C20-2-5.retro.R1.fq --o2 small_insert_R12.khmer_C20-2-5.retro.R2.fq ;
>
> split_paired_fastq.or.a.pl -i jumping_library_R12.khmer_C20-2-5.retro.pe.fq --fastq --r1 "#0\/1" --r2 "#0\/2" --o1 jumping_library_R12.khmer_C20-2-5.retro.R1.fq --o2 jumping_library_R12.khmer_C20-2-5.retro.R2.fq ;
>
> 12. You should now have a set of reads that have been subjected to
> digital normalization, but are still in distinct library sizes, are
> properly sorted into paired- and single-end reads, and which are
> usable for genome assembly.
>
> Note that you can get small help messages with the '-h' argument
> for the following scripts: extract_fastq_subset.pl,
> paired_vs_unp_fastq.or.a.pl, quality_trim_fastq.pl, and
> retroname_fastq_reads.pl.
>
> Also note that this would probably be considered a clumsy
> approach by a lot of people -- but these are the tools I've
> hand-coded, and they do at least do the job.
>
>
> --Erich
-------------- next part --------------
A non-text attachment was scrubbed...
Name: schwarz_scripts_13jun2012.tar.gz
Type: application/x-gzip
Size: 9031 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/ngs-2012/attachments/20120613/7a30c3a6/attachment.bin>
-------------- next part --------------
More information about the ngs-2012
mailing list