[pygr-notify] [pygr commit] r91 - wiki
codesite-noreply at google.com
codesite-noreply at google.com
Tue Aug 12 14:12:45 PDT 2008
Author: ramccreary
Date: Tue Aug 12 14:11:24 2008
New Revision: 91
Added:
wiki/LocatingIntergenicRegionsWithinAGenome.wiki
Log:
Created wiki page through web user interface.
Added: wiki/LocatingIntergenicRegionsWithinAGenome.wiki
==============================================================================
--- (empty file)
+++ wiki/LocatingIntergenicRegionsWithinAGenome.wiki Tue Aug 12 14:11:24
2008
@@ -0,0 +1,105 @@
+#summary pygr provides convenient and simple methods for anaylyzing genomes
+
+= Introduction =
+
+This code, which utilizes some of pygr's numerous storage databases (seqdb
and AnnotationDB), demonstrates a way in which intergenic regions are
located within a genome, then stored and maniuplated using pygr.
+
+
+=A Guide to the Script=
+
+First, once the necessary modules are imported, the yeast genome (the .fna
file) is stored in a BLAST database through the seqdb module in pygr. The
BlastDB unpacks the FASTA id and potentially several other ids potentially
contained in the NCBI genome. The ID is obtained from the genome by
splicing the key located at the beginning of the genome sequence. The key
looks like this: gi|50593115|ref|NC_001134.7|, but the desired ID is merely
NC_001134.7, necessitating the split. Next, the search for the intergenic
region continues by creating the class Intergenic, which defines the
characteristics of each regions, including before_gene (the gene preceding
the region), after_gene, start and stop (the integers at which the sequence
starts and concludes), orientation (which strand the region is located
upon), length, and id.
+
+{{{
+import csv
+import string
+from pygr.seqdb import BlastDB
+from pygr import seqdb
+from pygr import cnestedlist
+import cogs2
+
+
+# Load up the genome into a BlastDB
+genome = BlastDB(data/NC_001134.fna)
+# Get the ID from the genome
+for key, val in genome.iteritems():
+ genome_id = key.split('|')[3]
+ break # We only have one (Why isn't there a keys() method?!)
+
+# Create a class for the intergenic region
+class Intergenic(object):
+ def __init__(self, before_gene, after_gene, start, stop, orientation,
id, length):
+ self.before_gene = before_gene
+ self.after_gene = after_gene
+ self.start = start
+ self.stop = stop
+ self.orientation = orientation
+ self.id = id
+ self.length = length
+Intergenic_Regions = {}
+
+}}}
+
+Next, the cogs2 module provides several handy classes for iterating
through the various regions between known genes, including CogsFileContent.
This class reads the .ptt file and converts the gene coordinates to
integers, throughout the length of the yeast genome stored in the BlastDB.
The function return_intergenic_regions returns a dictionary of the
intergenic regions consisting of the start and stop coordinates of the
intergenic regions and the gene names for the previous and subsequent gene.
+
+The start and stop coordinates for each region interval is defined and
saved, and the dictionary Intergenic_Regions now holds all of the pertinent
data for the intergenic sequences.
+
+{{{
+yeastregions = cogs2.CogsFileContent(data/NC_001134.ptt, len(genome))
+
+def return_intergenic_regions(ptt_info):
+ intergenic = cogs2.IntergenicRegionsByFootprint(ptt_info)
+ yeast_between = {}
+ for(start, end, name) in intergenic:
+ if "-" in name:
+ continue
+ key = tuple(name)
+ if end - start > 0:
+ yeast_between[key] = (start, end)
+ return yeast_between
+
+
+yeast = return_intergenic_regions(yeastregions)
+id = genome.seqLenDict.keys()[0]
+yeast_genome = genome[id]
+
+
+for i, (gene,site) in enumerate(yeast.iteritems()):
+ regions = str(yeast_genome[site[0]:site[1]])
+ start = site[0]
+ stop = site[1]
+ Intergenic_Regions[i] = Intergenic(gene[0] ,gene[1], start, stop, 1,
genome_id, stop - start + 1)
+}}}
+
+An annotation database is created to store the intergenic regions and the
yeast genome. Annotation databases are especially useful because they can
be utilized in several different ways. For example, a simple use shown here
is to assign the first intergenic region to intergenic_region_1, then print
out the preceding gene, the following gene, and the actual sequence of the
region. Obviously, AnnotationDB has many more complex uses, but the ease
with which it finds and stores information makes it handy to do simple
tasks as well.
+
+Finally, a reverse mapping is built, tying the intergenic regions to the
genome and storing them as representations of aligned intervals using
nestedlist, allowing the pairs to takes up very little storage space but
providing ease of search and function. The sequences can be easily and
simply queried. The start and stop coordinates of the desired region are
provided, as well as the genome id, and, using the keys method, relevant
data can be quickly produced.
+
+{{{
+# Create a pygr AnnotationDB
+annot_db = seqdb.AnnotationDB(Intergenic_Regions, genome)
+# Do cool things with it
+intergenic_region_1 = annot_db[1] # The key is the order
+print(intergenic_region_1.before_gene) # The gene that comes before
+print(intergenic_region_1.after_gene) # The gene that comes after
+print(intergenic_region_1.sequence) # The intergenic sequence
+# And so on
+
+# Build a reverse mapping
+annot_map = cnestedlist.NLMSA('genes', mode='memory', use_virtual_lpo=True)
+for v in annot_db.values():
+ annot_map.addAnnotation(v)
+annot_map.build()
+query_sequence =
genome[genome_id][intergenic_region_1.start:intergenic_region_1.stop]
+annotations = annot_map[query_sequence]
+annotation = annotations.keys()[0]
+print(annotation.before_gene)
+}}}
+
+=Conclusion=
+
+While this script is fairly simple, it can be modified to serve varied
purposes. For example, an intergenomic region of interest can be quickly
found and iterated through, so concentrating on a small portion of a large
genome can become an easy task. Furthermore, AnnotationDB can be
manipulated in various ways, according to the user's skill level and
desired use.
+
+
+==Note==
+
+cogs2.py and return_intergenic_regions are based upon code Dr. C. Titus
Brown wrote, and can be found and used in the article "Using pygr to study
orthologous binding sites in bacterial genomes". found here:
http://bio.scipy.org/wiki/index.php/Using_pygr_to_study_orthologous_binding_sites_in_bacterial_genomes.
\ No newline at end of file
More information about the pygr-notify
mailing list