[pygr-notify] [pygr commit] r50 - wiki

Mon Jun 23 09:29:00 PDT 2008

Author: ramccreary
Date: Mon Jun 23 09:28:41 2008
New Revision: 50

Modified:
   wiki/DataStorageUsingpygr.wiki

Log:
Edited wiki page through web user interface.

Modified: wiki/DataStorageUsingpygr.wiki
==============================================================================

--- wiki/DataStorageUsingpygr.wiki	(original)
+++ wiki/DataStorageUsingpygr.wiki	Mon Jun 23 09:28:41 2008
@@ -1,8 +1,8 @@
-#summary Storing data in a MySQL table using pygr
+#summary Storing data in a MySQL table and pygr.Data

 = Introduction =

-	This article is an in-depth explanation of a script in which a genome 
and the accompanying annotations are stored in multiple ways, including 
a MySQL database. Storing the data this way enables it to be easily 
manipulated using pygr and prevents potential errors by allowing ease 
of access to the necessary genomic information.
+	This article is an in-depth explanation of a script in which a genome 
and the accompanying annotations are stored in multiple ways, including 
a MySQL database and pygr.Data. Storing the data this way enables it to 
be easily manipulated using pygr and prevents potential errors by 
allowing ease of access to the necessary genomic information.


 = Step-By-Step Example =
@@ -122,7 +122,7 @@
 }}}


-In creating the dictionary for the annotations, the csv module can be 
used to read the .gff file. The csv module is able to differentiate the 
unique fields. Furthermore, the fields in the subsequent database are 
created and identified.
+In creating the dictionary for the annotations, the csv module can be 
used to read the .gff file. The csv module is able to differentiate the 
unique fields. Furthermore, the fields in the subsequent database are 
created and identified. The

 {{{
 annot_dict = {}
@@ -151,7 +151,7 @@
 }}}


-The following uses the csv dict reader to read the file.
+The following uses the csv dict reader to read the file. Since the 
annotations are seperated by white spaces, it can be difficult to 
differentiate the data in the separate fields, which is why DictReader 
is used.

 {{{
 for row in reader:
@@ -180,7 +180,7 @@
 }}}


-Each row that is read is entered into the features2 table:
+Each row that is read is entered into the features2 table. 'Start' 
and 'stop' identify the beginning and end of each interval on the 
sequence (given with respect to the positive strand), 'orientation' is 
the orientation of the interval on the sequence (will return a value of 
1 or -1, depending on whether it is the positive or negative 
strand), 'chr' is the identification of the sequence the intervals are 
contained within, and 'info' is the gene identifier.

 {{{
          c.execute('''INSERT INTO features2 (start, stop, orientation, 
chr, info)
@@ -215,12 +215,12 @@
  Finally, the MySQL database for the annotation is built, and saved as 
the supplied database name. conn.commit() closes the database and the 
transaction and makes the changes permanent.


-Here, slicedb uses slices (intervals) from the SQLTable, which 
correspond to the gene sequences in the BLAST database previously 
constructed that contains the genome.
+Here, slicedb uses slices (intervals) from the SQLTable, which 
correspond to the gene sequences in the BLAST database previously 
constructed that contains the genome. AnnotationDB uses the sequence 
intervals are keys within a dictionary, and the values are the 
annotation objects, which are similar to sequence intervals in that 
they represent segments of the genome, but have annotation data 
associated with them. The two containers supplied for AnnotationDB are 
the slicedb, which contains the SQL table that holds the list of 
annotation intervals, and sequence database for the E. coli sequence intervals.

 {{{
 slicedb = SQLTable('ecoli.features2', c)

-annot_db = seqdb.AnnotationDB(slicedb, genome,
+annots = seqdb.AnnotationDB(slicedb, genome,

                               sliceAttrDict=dict(id='chr'))
 }}}
@@ -237,7 +237,7 @@
  Finally, an annotation map is created, with the annotations added. The 
nested list format for data structure shortens the time needed to scan 
the intervals by storing overlapping intervals in a more efficient and 
hierarchal format. The annotations are then mapped to the segment of 
the genome to which they correspond.

 {{{
-annot_map = 
cnestedlist.NLMSA('/home/mccreary/Projects/pygr/data/annot_map', 'w', genomeUnion,
+annot_map = cnestedlist.NLMSA('annotationmap', 'w', genomeUnion,

                               pairwiseMode=True)

@@ -257,4 +257,26 @@
 print 'building...'

 annot_map.build()
-}}}
\ No newline at end of file
+}}}
+
+Docstrings are then created for the genome, the annotations, and the 
annotation map so they may be stored in pygr.Data. pygr.Data requires 
docstrings to be assigned to every resources stored within, to allow a 
more transparent storage of data and to allow easier access.
+
+{{{
+genome.__doc__ = 'ecoli genome'
+annots.__doc__ = 'ecoli annotations'
+annot_map.__doc__ = 'annotation map'
+}}}
+
+Finally, the genome, the annotation, and the annotation map is stored 
in pygr.Data. Since the annotation map is a schema, its can be stored 
in pygr.Data as a schema. In order to store schema in pygr.Data, the 
relationship between the schema must be defined (Many-To-Many or 
One-To-Many). The annotation map is saved in pygr.Dara first, then 
again with the schema assignment. When saving the map as schema, the 
relationship between the schema and the resources it references must 
also be made clear, and the resources must be available in pygr.Data as 
well (you must save the genome and annotations along with the 
annotation map).
+
+bindAttr can have up to three attribute names, although only one is 
used here. 'annots' is bound to the objects of the source database (the 
annotations are keys for the annotation map). The pygr.Data resources 
are then stored to pygr.Data using the save() command, which is 
essential for any session that modifies or adds pygr.Data resources.
+
+{{{
+pygr.Data.Bio.Seq.Genome.ecoli = genome
+pygr.Data.Bio.Annotation.ecoli.annotations = annots
+pygr.Data.Bio.Annotation.ecoli.annotationmap = annot_map
+pygr.Data.schema.Bio.Annotation.ecoli.annotationmap = \
+    pygr.Data.ManyToManyRelation(genome,annots,bindAttrs=('annots',))
+}}}
+
+pygr.Data.save()
\ No newline at end of file