[ngs] counting

Tue Jun 18 10:36:19 PDT 2013

Will also be on the site later.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/ngs-2013/attachments/20130618/c0bbc10d/attachment.htm>
-------------- next part --------------
==========================
Counting reads for RNA-seq
==========================

We now have reads mapped to the reference transcriptome. The next step is counting: how many reads mapped to each transcript? These numbers are what we need to be able to identify differentially expressed genes.

One simple way to do this step is to use bedtools with the BAM file you generated (sorted and indexed) to do the counting. This approach also requires a reference file, and an annotation file containing the coordinates of all the transcripts/exons/whatever features you're interested in. bedtools will tell you how many reads map to each feature.

If you've mapped your reads to an annotated reference genome, you can probably download a pre-computed annotation file. We've done that for the Drosophila genome and placed it into the snapshot you're using.

However, in many cases nobody has already created this annotation file for you. For example, you may be mapping reads to a transcriptome that you assembled yourself (maybe even using the same RNA-seq dataset). In that case, to use bedtools, you'll need to create an annotation file yourself. We've written a simple script that will create a simple BED file that has one entry for each contig in your transcriptome, so that you can use bedtools to do the counting for you (note: this script requires the BioPython package)::

  '''
  Generates a BED file from a fasta file
  The BED file contains one annotation feature for each sequence in the fasta,
   each one spanning the entire sequence
  Can then use this BED file to count the number of reads mapping to each sequence
   in the fasta file with existing tools

  Usage:

  python make_bed_from_fasta.py <sequences.fasta>

  '''

  import sys
  from Bio import SeqIO

  fasta_handle = open(sys.argv[1], "rU") #Open the fasta file specified by the user on the command line

  #Go through all the records in the fasta file
  for record in SeqIO.parse(fasta_handle, "fasta"):
      cur_id = record.id #Name of the current feature
      cur_length = len(record.seq) #Size of the current feature
      print "%s\t%d\t%d" % (cur_id, 0, cur_length) #Output the name, start, and end coordinates to the screen

In this case, we mapped reads to the Drosophila transcriptome with BWA (rather than to the Drosophila genome with tophat). So let's create a BED file and do the counting (as always, make sure you're doing all of this in the right directory!)::

  python make_bed_from_fasta.py dmel-all-transcript-r5.51.fasta > dmel_transcriptome.bed
  multiBamCov -q 30 -p -bams ${bam1} ${bam2} ${bam3} ${bam4} -bed dmel_transcriptome.bed > bwa_transcriptome_counts.txt

In the multiBamCov line, the -q 30 option tells the software to only count reads that have a mapping quality of 30 or better, while the -p option specifies that you only want to use properly mapping read pairs.

If you take a look at the file::

  head bwa_transcriptome_counts.txt

you'll see that it is a readable flat text file, e.g.::

  FBtr0086024	0	2202	0	0	0	0
  FBtr0082362	0	898	0	0	0	0
  FBtr0300409	0	4704	284	458	645	389
  FBtr0308603	0	4307	18	14	26	6
  FBtr0080982	0	1284	1355	1342	2178	1172
  FBtr0305683	0	1463	2	2	4	4
  FBtr0299694	0	635	0	0	0	0
  FBtr0299695	0	587	0	0	0	0
  FBtr0085737	0	2643	0	0	0	0
  FBtr0309194	0	2398	0	0	0	0 

The first column is transcript IDs, and the second and third columns are the start and end positions of each transcript that reads could map to (they all start at 0 and span the entire length of each transcript, because we wanted to count all reads that map to anywhere in the transcript). The following four columns are the read counts for each sample.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: analyze_edgeR_bwa_transcriptome.R
Type: application/octet-stream
Size: 790 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/ngs-2013/attachments/20130618/c0bbc10d/attachment.obj>