[pygr-notify] [pygr commit] r186 - Edited wiki page through web user interface.

codesite-noreply at google.com codesite-noreply at google.com
Wed Apr 15 17:41:06 PDT 2009


Author: marecki
Date: Wed Apr 15 17:40:12 2009
New Revision: 186

Modified:
    wiki/MegatestSetup.wiki

Log:
Edited wiki page through web user interface.

Modified: wiki/MegatestSetup.wiki
==============================================================================
--- wiki/MegatestSetup.wiki	(original)
+++ wiki/MegatestSetup.wiki	Wed Apr 15 17:40:12 2009
@@ -21,147 +21,66 @@

   * Everything needed by Pygr itself;

- * (optional) A local pygr.Data XML-RPC server, so that the data-download  
test is not affected by the quality of your connection to the UCLA one.
+ * _(optional)_ A local pygr.Data XML-RPC server, so that the  
data-download test is not affected by the quality of your connection to the  
UCLA one;

-
-XXX
-
-== Structure of the directory tree ==
-
-The directory structure for megatest setup is as follows.
-
- _/result/pygr_megatest_ - parent directory for megatests. All inputs and  
pre-calculated outputs should be saved here. This includes sub-directories  
_axt_data_, _maf_data_, _maf_data3_, _maf_test_ and _maf_test3_.
-
- _/result/pygr_megatest/src_save_ - megatest running directory. Log files  
and sendmail scripts are saved here.
+ * sequence data and other input used by megatests along with reference  
output; obtaining and installing these will be described below.


  == Downloading and preparing data ==

-The first step here is to obtain appropriate genome sequences and store  
them as _seqdb.BlastDB_ files. The easiest way of doing this involves using  
pygr.Data to fetch the relevant BlastDB files over XML-RPC from  
_biodb2.bioinformatics.ucla.edu_, see the PygrResourceDownloader page for  
more details on how to do this.
-
-The following sequences must be obtained:
-
-* for dm2 megatests
- * Bio.Seq.Genome.ANOGA.anoGam1
- * Bio.Seq.Genome.APIME.apiMel2
- * Bio.Seq.Genome.DROME.dm2
- * Bio.Seq.Genome.DROPS.dp4
- * Bio.Seq.Genome.DROAN.droAna3
- * Bio.Seq.Genome.DROER.droEre2
- * Bio.Seq.Genome.DROGR.droGri2
- * Bio.Seq.Genome.DROMO.droMoj3
- * Bio.Seq.Genome.DROPE.droPer1
- * Bio.Seq.Genome.DROSE.droSec1
- * Bio.Seq.Genome.DROSI.droSim1
- * Bio.Seq.Genome.DROVI.droVir3
- * Bio.Seq.Genome.DROWI.droWil1
- * Bio.Seq.Genome.DROYA.droYak2
- * Bio.Seq.Genome.TRICA.triCas2
-
-
-* for hg18 "annotation" and "NLMSA" megatests
-  
* 'anoCar1', 'bosTau3', 'canFam2', 'cavPor2', 'danRer4', 'dasNov1', 'echTel1', 'equCab1', 'eriEur1', 'felCat3', 'fr2', 'galGal3', 'gasAcu1', 'hg18', 'loxAfr1', 'mm8', 'monDom4', 'ornAna1', 'oryCun1', 'oryLat1', 'otoGar1', 'panTro2', 'rheMac2', 'rn4', 'sorAra1', 'tetNig1', 'tupBel1', 'xenTro2'
-
-* for hg18 "pairwise alignment" megatest
- * 'canFam2', 'hg18', 'mm8', 'panTro2', 'rn4'
-
-
-At the top of each dm2/hg18 megatest files in tests directory, you can see  
following lines. You may need to change following lines to  
_PYGRDATADOWNLOAD_ path. If you want to maintain original *megatest.py*  
files intact, you may need to create following directories on your megatest  
machine.
-
- seqDir = '/result/pygr_megatest/seq_data'
-
- seqDir = '/result/pygr_megatest/seq_data3'
-
-All the necessary pre-built NLMSA and pre-calculated results are available  
at http://biodb.bioinformatics.ucla.edu/MEGATEST/
-
-And, there are some sets of pre-build NLMSA. These pre-built NLMSA files  
should be saved in your machine.
-
- msaDir = '/result/pygr_megatest/maf_test'  # maf_test.tar in biodb  
MEGATEST URL.
-
- msaDir = '/result/pygr_megatest/maf_test3' # maf_test3.tar in biodb  
MEGATEST URL.
-
-Note that the files _dm2_multiz15way.seqDictP_ from maf_test and  
_hg18_multiz28way.seqDictP_ from maf_test3 contain hardcoded paths which  
will need to be changed should your directory structure be different from  
the one described here. This can be done using an ordinary text editor.
-
-All the test results and NLMSA will be saved subdirectory of  
_/usr/tmp/deepreds_. You can find _/usr/tmp/deepreds_ path in all of  
*megatest.py*. You may need to change this directory or create one for  
megatest.
-
-----
-
-For mutigenome NLMSA megatest, download the archives
+Data files need by Pygr megatests can be divided into three categories:  
sequence data in Pygr's _seqdb.BlastDB_ format, NLMSA files for different  
tests, and miscellaneous input/output files. The latter two are installed  
differently from the former one; both procedures will be described here.

- http://biodb.bioinformatics.ucla.edu/MEGATEST/maf_data.tar
+Presently there are two distinct classes of megatests, differing in what  
the primary genome used by each class is and therefore named after the  
genome in question: _dm2_ (_Drosophila melanogaster_, or common fruit fly)  
and _hg18_ (_Homo sapiens_, or human). Each class uses its own set of input  
and output data; it is recommended to keep them in separate directories.

- http://biodb.bioinformatics.ucla.edu/MEGATEST/maf_data3.tar

-then extract them to the following two directories on your machine,  
respectively:
+=== BlastDB files ===

- nlmsa_dm2_megatest.py:mafDir = '/result/pygr_megatest/maf_data'
+The easiest way of obtaining BlastDB sequence-data files is to fetch them  
using Pygr itself, from the UCLA XML-RPC server - that way downloaded files  
will automatically become registered into the local Pygr resource database.  
Information on how to do this can be found on the PygrResourceDownloader  
page; for your convenience, the lists below provide data-set names in the  
format understood by Pygr.

- nlmsa_hg18_megatest.py:mafDir = '/result/pygr_megatest/maf_data3'
-
-
-For pairwise NLMSA megatest, you need to extract files from
-
-  http://biodb.bioinformatics.ucla.edu/MEGATEST/axt_data.tar
-
-to
-
- pairwise_hg18_megatest.py:axtDir = '/result/pygr_megatest/axt_data'
-
-
-The final archive to download is
-
- http://biodb.bioinformatics.ucla.edu/MEGATEST/input_and_results.tar
-
-On leelab2, all megatest is running /result/pygr_megatest directory. Thus,
-input_and_results.tar should be extracted in /result/pygr_megatest  
directory
-or create one for your machine.
-
-input_and_results.tar contains the following files:
-
- * Annotation_ConservedElement_Exons_chrYh_dm2.txt
- * Annotation_ConservedElement_Exons_chrY_hg18.txt
- * Annotation_ConservedElement_Exons_dm2.txt
- * Annotation_ConservedElement_Exons_hg18.txt
- * Annotation_ConservedElement_Introns_chrYh_dm2.txt
- * Annotation_ConservedElement_Introns_chrY_hg18.txt
- * Annotation_ConservedElement_Introns_dm2.txt
- * Annotation_ConservedElement_Introns_hg18.txt
- * Annotation_ConservedElement_Stop_chrY_hg18.txt
- * Annotation_ConservedElement_Stop_hg18.txt
- * phastConsElements15way_chrYh_dm2.txt
- * phastConsElements15way_dm2.txt
- * phastConsElements28way_chrY_hg18.txt
- * phastConsElements28way_hg18.txt
- * refGene_cdsAnnot_chrY_hg18.txt
- * refGene_cdsAnnot_hg18.txt
- * refGene_exonAnnot_chrYh_dm2.txt
- * refGene_exonAnnot_chrY_hg18.txt
- * refGene_exonAnnot_dm2.txt
- * refGene_exonAnnot_hg18.txt
- * refGene_spliceAnnot_chrYh_dm2.txt
- * refGene_spliceAnnot_chrY_hg18.txt
- * refGene_spliceAnnot_dm2.txt
- * refGene_spliceAnnot_hg18.txt
- * snp126_chrY_hg18.txt
- * snp126_hg18.txt
- * splicesite_dm2_chr4h_multiz15way.txt
- * splicesite_dm2_chr4h.txt
- * splicesite_dm2_multiz15way.txt
- * splicesite_dm2.txt
- * splicesite_hg18_chrY_multiz28way.txt
- * splicesite_hg18_chrY_pairwise5way.txt
- * splicesite_hg18_chrY.txt
- * splicesite_hg18_multiz28way.txt
- * splicesite_hg18_pairwise5way.txt
- * splicesite_hg18.txt
+The following sequences must be obtained:

-As you can see, there are full versions and _chrY(hg18)_ and _chrYh(dm2)_  
versions. Current version of pygr megatest uses only short _(chrY for hg18  
and chrYh for dm2)_ versions in order to reduce overhead (both CPU and  
disc-space). If you want to test full version, you need to change  
*megatest.py* files, i.e. remove all _chrY_/_chrYh_.
+ # For _dm2_ megatests
+  * Bio.Seq.Genome.ANOGA.anoGam1
+  * Bio.Seq.Genome.APIME.apiMel2
+  * Bio.Seq.Genome.DROME.dm2
+  * Bio.Seq.Genome.DROPS.dp4
+  * Bio.Seq.Genome.DROAN.droAna3
+  * Bio.Seq.Genome.DROER.droEre2
+  * Bio.Seq.Genome.DROGR.droGri2
+  * Bio.Seq.Genome.DROMO.droMoj3
+  * Bio.Seq.Genome.DROPE.droPer1
+  * Bio.Seq.Genome.DROSE.droSec1
+  * Bio.Seq.Genome.DROSI.droSim1
+  * Bio.Seq.Genome.DROVI.droVir3
+  * Bio.Seq.Genome.DROWI.droWil1
+  * Bio.Seq.Genome.DROYA.droYak2
+  * Bio.Seq.Genome.TRICA.triCas2
+ # For _hg18_ "annotation" and "NLMSA" megatests (FIXME - verify and  
reformat!)
+   
* 'anoCar1', 'bosTau3', 'canFam2', 'cavPor2', 'danRer4', 'dasNov1', 'echTel1', 'equCab1', 'eriEur1', 'felCat3', 'fr2', 'galGal3', 'gasAcu1', 'hg18', 'loxAfr1', 'mm8', 'monDom4', 'ornAna1', 'oryCun1', 'oryLat1', 'otoGar1', 'panTro2', 'rheMac2', 'rn4', 'sorAra1', 'tetNig1', 'tupBel1', 'xenTro2'
+ # For _hg18_ "pairwise alignment" megatest (FIXME - verify and reformat!)
+  * 'canFam2', 'hg18', 'mm8', 'panTro2', 'rn4'
+
+Once the files have been downloaded they require no further attention.
+
+
+=== NLMSA and other files ===
+
+The necessary files are available (as tar archives) on the Web, at  
http://biodb.bioinformatics.ucla.edu/MEGATEST/ . Download the archives and  
unpack them into directories of your choice. You need the following files:
+
+ # NLMSA for _dm2_ megatests
+  * maf_data.tar
+  * maf_test.tar
+ # NLMSA for _hg18_ megatests
+  * axt_data.tar
+  * maf_data3.tar
+  * maf_test3.tar
+ # Miscellaneous files
+  * input_and_results.tar (note: doesn't create its own directory!)

-On leelab2 (2.8 GHz dual-core Opteron CPU), the short version of megatests  
runs for about 5 minutes. On the other hand, the full version takes  
approximately 30 hours.
+This time some post-installation steps are necessary before the data can  
be used: the files _dm2_multiz15way.seqDictP_ (from maf_test.tar) and  
_hg18_multiz28way.seqDictP_ (from maf_test3.tar) contain hardcoded paths  
which will need to be changed to reflect your directory structure. Assuming  
the final path components are to stay the same (i.e. you keep the data in  
the directories in which they came in the archives), simply open the files  
in question using an ordinary text editor and replace all the occurrences  
of _/result/pygr_megatest_ (FIXME: double-check this!) with the path of  
your choice.

-Comparison between new results and pre-built result will be done by
-md5.digest().

+XXX

  == Setting up the Database ==

@@ -180,6 +99,18 @@
  Same NLMSA building Megatest will run twice, one for file-saving version  
and
  the other for MySQL version. In each step, text-to-binary conversion test
  is included.
+
+
+== Choosing the variant ==
+
+As you can see, there are full versions and _chrY(hg18)_ and _chrYh(dm2)_  
versions. Current version of pygr megatest uses only short _(chrY for hg18  
and chrYh for dm2)_ versions in order to reduce overhead (both CPU and  
disc-space). If you want to test full version, you need to change  
*megatest.py* files, i.e. remove all _chrY_/_chrYh_.
+
+On leelab2 (2.8 GHz dual-core Opteron CPU), the short version of megatests  
runs for about 5 minutes. On the other hand, the full version takes  
approximately 30 hours.
+
+
+== The config file ==
+
+The latest incarnation of megatest code and support scripts is quite  
flexible in terms of where input should come from and output should go to,  
meaning you could basically distribute the relevant directories all over  
the system if you wanted to - but also that regardless of where they go,  
Pygr needs to be told where to look for them. This, along with some other  
megatest-related things, is done using a Pygr configuration file.


  == Shell script ==


More information about the pygr-notify mailing list