[pygr-notify] [pygr commit] r50 - wiki
codesite-noreply at google.com
codesite-noreply at google.com
Mon Jun 23 09:29:00 PDT 2008
Author: ramccreary
Date: Mon Jun 23 09:28:41 2008
New Revision: 50
Modified:
wiki/DataStorageUsingpygr.wiki
Log:
Edited wiki page through web user interface.
Modified: wiki/DataStorageUsingpygr.wiki
==============================================================================
--- wiki/DataStorageUsingpygr.wiki (original)
+++ wiki/DataStorageUsingpygr.wiki Mon Jun 23 09:28:41 2008
@@ -1,8 +1,8 @@
-#summary Storing data in a MySQL table using pygr
+#summary Storing data in a MySQL table and pygr.Data
= Introduction =
- This article is an in-depth explanation of a script in which a genome
and the accompanying annotations are stored in multiple ways, including
a MySQL database. Storing the data this way enables it to be easily
manipulated using pygr and prevents potential errors by allowing ease
of access to the necessary genomic information.
+ This article is an in-depth explanation of a script in which a genome
and the accompanying annotations are stored in multiple ways, including
a MySQL database and pygr.Data. Storing the data this way enables it to
be easily manipulated using pygr and prevents potential errors by
allowing ease of access to the necessary genomic information.
= Step-By-Step Example =
@@ -122,7 +122,7 @@
}}}
-In creating the dictionary for the annotations, the csv module can be
used to read the .gff file. The csv module is able to differentiate the
unique fields. Furthermore, the fields in the subsequent database are
created and identified.
+In creating the dictionary for the annotations, the csv module can be
used to read the .gff file. The csv module is able to differentiate the
unique fields. Furthermore, the fields in the subsequent database are
created and identified. The
{{{
annot_dict = {}
@@ -151,7 +151,7 @@
}}}
-The following uses the csv dict reader to read the file.
+The following uses the csv dict reader to read the file. Since the
annotations are seperated by white spaces, it can be difficult to
differentiate the data in the separate fields, which is why DictReader
is used.
{{{
for row in reader:
@@ -180,7 +180,7 @@
}}}
-Each row that is read is entered into the features2 table:
+Each row that is read is entered into the features2 table. 'Start'
and 'stop' identify the beginning and end of each interval on the
sequence (given with respect to the positive strand), 'orientation' is
the orientation of the interval on the sequence (will return a value of
1 or -1, depending on whether it is the positive or negative
strand), 'chr' is the identification of the sequence the intervals are
contained within, and 'info' is the gene identifier.
{{{
c.execute('''INSERT INTO features2 (start, stop, orientation,
chr, info)
@@ -215,12 +215,12 @@
Finally, the MySQL database for the annotation is built, and saved as
the supplied database name. conn.commit() closes the database and the
transaction and makes the changes permanent.
-Here, slicedb uses slices (intervals) from the SQLTable, which
correspond to the gene sequences in the BLAST database previously
constructed that contains the genome.
+Here, slicedb uses slices (intervals) from the SQLTable, which
correspond to the gene sequences in the BLAST database previously
constructed that contains the genome. AnnotationDB uses the sequence
intervals are keys within a dictionary, and the values are the
annotation objects, which are similar to sequence intervals in that
they represent segments of the genome, but have annotation data
associated with them. The two containers supplied for AnnotationDB are
the slicedb, which contains the SQL table that holds the list of
annotation intervals, and sequence database for the E. coli sequence intervals.
{{{
slicedb = SQLTable('ecoli.features2', c)
-annot_db = seqdb.AnnotationDB(slicedb, genome,
+annots = seqdb.AnnotationDB(slicedb, genome,
sliceAttrDict=dict(id='chr'))
}}}
@@ -237,7 +237,7 @@
Finally, an annotation map is created, with the annotations added. The
nested list format for data structure shortens the time needed to scan
the intervals by storing overlapping intervals in a more efficient and
hierarchal format. The annotations are then mapped to the segment of
the genome to which they correspond.
{{{
-annot_map =
cnestedlist.NLMSA('/home/mccreary/Projects/pygr/data/annot_map', 'w', genomeUnion,
+annot_map = cnestedlist.NLMSA('annotationmap', 'w', genomeUnion,
pairwiseMode=True)
@@ -257,4 +257,26 @@
print 'building...'
annot_map.build()
-}}}
\ No newline at end of file
+}}}
+
+Docstrings are then created for the genome, the annotations, and the
annotation map so they may be stored in pygr.Data. pygr.Data requires
docstrings to be assigned to every resources stored within, to allow a
more transparent storage of data and to allow easier access.
+
+{{{
+genome.__doc__ = 'ecoli genome'
+annots.__doc__ = 'ecoli annotations'
+annot_map.__doc__ = 'annotation map'
+}}}
+
+Finally, the genome, the annotation, and the annotation map is stored
in pygr.Data. Since the annotation map is a schema, its can be stored
in pygr.Data as a schema. In order to store schema in pygr.Data, the
relationship between the schema must be defined (Many-To-Many or
One-To-Many). The annotation map is saved in pygr.Dara first, then
again with the schema assignment. When saving the map as schema, the
relationship between the schema and the resources it references must
also be made clear, and the resources must be available in pygr.Data as
well (you must save the genome and annotations along with the
annotation map).
+
+bindAttr can have up to three attribute names, although only one is
used here. 'annots' is bound to the objects of the source database (the
annotations are keys for the annotation map). The pygr.Data resources
are then stored to pygr.Data using the save() command, which is
essential for any session that modifies or adds pygr.Data resources.
+
+{{{
+pygr.Data.Bio.Seq.Genome.ecoli = genome
+pygr.Data.Bio.Annotation.ecoli.annotations = annots
+pygr.Data.Bio.Annotation.ecoli.annotationmap = annot_map
+pygr.Data.schema.Bio.Annotation.ecoli.annotationmap = \
+ pygr.Data.ManyToManyRelation(genome,annots,bindAttrs=('annots',))
+}}}
+
+pygr.Data.save()
\ No newline at end of file
More information about the pygr-notify
mailing list