[pygr-notify] [pygr commit] r101 - wiki

Wed Sep 10 19:01:38 PDT 2008

Author: jqian.ubc
Date: Wed Sep 10 19:01:27 2008
New Revision: 101

Modified:
    wiki/PygrOnEnsembl.wiki

Log:
Edited wiki page through web user interface.

Modified: wiki/PygrOnEnsembl.wiki
==============================================================================

--- wiki/PygrOnEnsembl.wiki	(original)
+++ wiki/PygrOnEnsembl.wiki	Wed Sep 10 19:01:27 2008
@@ -1,8 +1,8 @@
-#summary using pygr to develop an Ensembl API
+#summary using Pygr to develop an Ensembl API

  = Introduction =

-The Ensembl database system is a central data repository for various  
eukaryotic genome sequences and their annotated information  
[http://www.ensembl.org Ensembl Home].  The screenshots of schema diagrams  
for the four basic types of databases (core, compara, variation and  
funcgen) can be found at:  
[http://groups.google.com/group/pygr-dev/files?hl=en pygr-dev files].  They  
were created using the files in the sql/
+The Ensembl database system is a central data repository for various  
eukaryotic genome sequences and their annotated information  
[http://www.ensembl.org Ensembl Home].  The screen shots of schema diagrams  
for the four basic types of databases (core, compara, variation and  
funcgen) can be found at:  
[http://groups.google.com/group/pygr-dev/files?hl=en pygr-dev files].  They  
were created using the files in the sql/
  directory of the ensembl CVS module. The  
[http://pygr-dev.googlegroups.com/web/table.sql?gda=M3ILbjoAAABJgcRQ_B738LYip0lXSox5BrGVnIRWNUQzXUPZ5KyWuGG1qiJ7UbTIup-M2XPURDTDvhSABxKrnfEc_FGQElaK  
table.sql] file gives the table
  definitions and the  
[http://pygr-dev.googlegroups.com/web/foreign_keys.sql?gda=K02TckEAAABJgcRQ_B738LYip0lXSox5BrGVnIRWNUQzXUPZ5KyWuGG1qiJ7UbTIup-M2XPURDRvOefWPvoIMlEIkd9UdRbQLTxVVTd9FLrlvrrz00ZndA  
foreign_keys.sql] gives the foreign key definitions.  Being able to access  
its numerous large databases efficiently is indispensable to any genome  
research project. Currently, the Ensembl databases are mostly accessed  
through a Perl API or a (less developed) Java API. No equivalent Python API  
is yet available.

@@ -65,37 +65,42 @@

  *Framework*

-*1.* the datamodel.py module
-a BaseModel super class and its subclasses.  Each subclass represents a  
biological entity.
+*1.* the datamodel module (datamodel.py):
+- a generic datamodel (BaseModel) class (super class).  It is a subclass  
of the Pygr's sqlgraph.TupleO
+- specialized datamodel classes (subclasses of BaseModel).  Each subclass  
represents a biological entity, or an Ensembl row/item object.
+- a generic Feature class.  It represents a generic Ensembl feature.  An  
Ensembl feature refers to an object that has the attributes of  
seq_region_id, seq_region_start, seq_region_end and seq_region_strand.  The  
get_sequence() method is implemented using Pygr's seqdb.AnnotationDB
+- specialized feature classes (subclasses of Feature).  The schema between  
features is implemented using Pygr's sqlgraph.SQLGraph
+
+*2.* the adaptor module (adaptor.py):
+- a Registry class: provides a connection to the ensembl SQL server
+- specialized adaptor classes (subclasses of Pygr's sqlgraph.SQLTable  
class): provides access to a specific sql table in an ensembl core database.
+- private module methods: provide automatic saving of the Ensembl database  
schema to pygr.Data

-*2.* the adaptor.py module
-a Registry class, a generic adaptor class (super class) and many  
specialized adaptor classes (sub classes).  Each specialized adaptor class  
employs pygr modules (mainly the sqlgraph and seqdb module) and provides  
access to its corresponding sql table in an ensembl core database.
+*3.* the featuremapping module (featuremapping.py): provides mapping  
between ensembl features

-*3.* the featuremapping.py module
-
-*4.* the supporting module (seqregion.py): extensions of the pygr core  
modules.
+*4.* the supporting module (seqregion.py): provides mapping between a  
sequence slice and the set of Ensembl features in the slice.

  *Design Pattern*

-The Driver class in the adaptor module is implemented as a singleton  
class, since making a connection to the database is expensive.
+The Registry class in the adaptor module is implemented as a singleton  
class, since making a connection to the server is expensive.

  = Implemented Functionality =

-The latest ensembl API allows the user to perform the following tasks:
+The latest Ensembl API allows the user to perform the following tasks:

  *General methods*

  Create a connection to the ensembl MySQL server:

-serverRegistry = get_registry(host='ensembldb.ensembl.org',  
user='anonymous')
+`serverRegistry = get_registry(host='ensembldb.ensembl.org',  
user='anonymous')`

  Create access to an ensembl core database:

-coreDBAdaptor =  
serverRegistry.get_DBAdaptor('homo_sapiens', 'core', '47_36i')
+`coreDBAdaptor =  
serverRegistry.get_DBAdaptor('homo_sapiens', 'core', '47_36i')`

  Retrieve a sequence object:

-coreDBAdaptor.fetch_slice_by_seqregion(coordSystemName, seqregionName)
+`coreDBAdaptor.fetch_slice_by_seqregion(coordSystemName, seqregionName)`

  -coordSystemName: 'chromosome' or 'contig'
  -seqreionName: a chromosome name, such as '1'
@@ -105,17 +110,17 @@
  Create access to any table in an ensembl core database:

  e.g.
-transcriptAdaptor = coreDBAdaptor.get_adaptor('transcript') will return a  
transcriptAdaptor object that can be used to access any record/item in the  
transcript table.
+`transcriptAdaptor = coreDBAdaptor.get_adaptor('transcript')` will return  
a transcriptAdaptor object that can be used to access any record/item in  
the transcript table.

  Create access to any record in an ensembl sql table:

  e.g.
-transcript = transcriptAdaptor[1] will return a transcript item with the  
unique dbID 1
+`transcript = transcriptAdaptor[1]` will return a transcript item with the  
unique dbID 1

  Create access to any column of an ensembl sql table record:

  e.g.
-transcript.seq_region_start will return the seq_region_start value of the  
give transcript
+`transcript.seq_region_start` will return the seq_region_start value of  
the give transcript


  *Methods for an ensembl feature object*
@@ -123,69 +128,69 @@
  An ensembl feature refers to an object that has the attributes of  
seq_region_id, seq_region_start, seq_region_end and seq_region_strand.

  Retrieve the sequence of an ensembl feature:
-get_sequence()
+`get_sequence()`

  e.g.
-gene.get_sequence() will return a sequence object of the given gene.
+`gene.get_sequence()` will return a sequence object of the given gene.

  optional argument for this method: the lengh of the flanking region on  
both sides of the feature sequence:

  e.g.
-gene.get_sequence(500) will return the sequence of the gene plus 500bp  
flanking regions on both sides of the gene.
+`gene.get_sequence(500)` will return the sequence of the gene plus 500bp  
flanking regions on both sides of the gene.

  Find all the feature objects in a particular slice:

-fetch_all_by_slice(slice)
+`fetch_all_by_slice(slice)`

  e.g.
-transcriptAdaptor.fetch_all_by_slice(slice) will retrieve all the  
transcripts in the give slice.
+`transcriptAdaptor.fetch_all_by_slice(slice)` will retrieve all the  
transcripts in the give slice.

  Retrieve the stable_id, created_date, modified_date or the version for a  
gene/transcript/translation/exon

  e.g.
-gene.get_stable_id() will return the ensembl stable_id for the given gene
+`gene.get_stable_id()` will return the ensembl stable_id for the given gene

  Obtain a gene object:

-transcript.get_gene()
-geneAdaptor.fetch_by_stable_id(geneStableID)
+`transcript.get_gene()`
+`geneAdaptor.fetch_by_stable_id(geneStableID)`

  Obtain transcript objects:

-gene.get_transcripts()
-exon.get_all_transcripts()
-translation.get_transcript()
-transcriptAdaptor.fetch_by_stable_id(transcriptStableID)
+`gene.get_transcripts()`
+`exon.get_all_transcripts()`
+`translation.get_transcript()`
+`transcriptAdaptor.fetch_by_stable_id(transcriptStableID)`

  Obtain exon objects:

-transcript.get_all_exons()
-exonAdaptor.fetch_by_stable_id(exonStableID)
+`transcript.get_all_exons()`
+`exonAdaptor.fetch_by_stable_id(exonStableID)`

  Obtain a translation object:

-transcript.get_translation()
-translationAdaptor.fetch_by_stable_id(translationStableID)
+`transcript.get_translation()`
+`translationAdaptor.fetch_by_stable_id(translationStableID)`

  Obtain a spliced sequence object:

-transcript.get_spliced_seq()
+`transcript.get_spliced_seq()`

  Obtain a five-prime untranslated region:

-transcript.get_five_utr()
+`transcript.get_five_utr()`

  Obtain a three-prime untranslated region:

-transcript.get_three_utr()
+`transcript.get_three_utr()`

  Obtain a prediction_transcript object:

-predictionExon.get_prediction_transcript()
+`predictionExon.get_prediction_transcript()`

  Obtain prediction_exon objects:

-predictionTranscript.get_all_prediction_exons()
+`predictionTranscript.get_all_prediction_exons()`


  Additional sample code can be found under major methods in both the  
adaptor.py module and the datamodel.py module, in the form of doctests.
@@ -195,7 +200,7 @@
  *1.* The latest Ensembl API tarball Qing_Qian.tar.gz can be downloaded  
from  
[http://code.google.com/p/google-summer-of-code-2008-psf/downloads/list#].
  For the prerequisites and installation details, please refer to the README  
file.

-Alternatively, the current ensembl API code, together with pygr, can be  
retrieved from the public git repository.  To check out a copy, run the  
following instruction on the command line:
+Alternatively, the current ensembl API code, together with Pygr, can be  
retrieved from the public git repository.  To check out a copy, run the  
following instruction on the command line:

  `git clone git://iorich.caltech.edu/git/public/pygr-jenny <dirname of your  
choice>`