[pygr-notify] [pygr commit] r69 - wiki
codesite-noreply at google.com
codesite-noreply at google.com
Tue Jul 8 12:36:21 PDT 2008
Author: ramccreary
Date: Tue Jul 8 12:35:27 2008
New Revision: 69
Modified:
wiki/GenomeCalculationsUsingpygr.wiki
Log:
Edited wiki page through web user interface.
Modified: wiki/GenomeCalculationsUsingpygr.wiki
==============================================================================
--- wiki/GenomeCalculationsUsingpygr.wiki (original)
+++ wiki/GenomeCalculationsUsingpygr.wiki Tue Jul 8 12:35:27 2008
@@ -50,7 +50,7 @@
The files were loaded using the simple OptionsParser module, which
takes the command line arguments, stores them as the designated
options, and loads them when called (like options.ecoli_fna_filename).
The parser is populated with the command line arguments, and the unique
options assigned to each argument differentiates between them.
-The ecoli genome (ecoli_fna_filename) is stored in a BlastDB. The
BlastDB unpacks the FASTA id and potentiall several other ids
potentially contained in the NCBI genome.
+The ecoli genome (ecoli_fna_filename) is stored in a BlastDB. The
BlastDB unpacks the FASTA id and potentially several other ids
potentially contained in the NCBI genome.
DictReader (from the csv module) opens the .gff file and reads the
tab-seperated entries, assigning each a string name. Since the entries
are seperatedby tabs, the delimiter used to differentiate the fields is
clearly a tab.
@@ -78,9 +78,9 @@
Here comes the fun part. In order to perform our desired calculations
on this genome, the various gene intervals denoted in the annotations
must be linked to their corresponding sequence in the genome. In short,
the specific nucleotides must be able to be retrieved. First, a
dictionary is created, annots, which will hold the gene intervals found
earlier in the code keyed by the locus_tag. It will also store the
corresponding values (start, stop, id, etc.) for the gene intervals.
-The annots dictionary and E. coli genome are then both stored in an
AnnotationDB, which, as mentioned earlier, is finicky about the
sequence ID given, so ensure the field assigned to the id in the .gff
file is actually the desired field that stores the id.
+The annots dictionary and E. coli genome are then both stored in
seqDB, which, as mentioned earlier, is finicky about the sequence ID
given, so ensure the field assigned to the id in the .gff file is
actually the desired field that stores the id.
-Next, another dictionary is created, this time to store the counts of
each nucleotide per gene (ex: In the gene '1364', there are 226 As).
iteritems() came in handy because it iterated over all the actual
genome sequence corresponding to each annotation, and returned the
value for each genome sequence. nucs was the key for the ec_count, and
each nucleotide base was a value (A, T, G, C,).
+Next, another dictionary is created, this time to store the counts of
each nucleotide per gene (ex: In the gene '1364', there are 226 As).
iteritems() came in handy because it iterated over the dictionary, and
returned the value for each genome sequence. nucs was the key for the
ec_count, and each nucleotide base was a value (A, T, G, C,).
{{{
annots = {}
@@ -96,18 +96,15 @@
for gene, annot in annot_db.iteritems():
nucs = str(annot.sequence)
- ec_count = {}
+ ec_count = dict(A=0, C=0, T=0, G=0)
for nuc in nucs:
- if ec_count.has_key(nuc):
ec_count[nuc] = ec_count[nuc] + 1
- else:
- ec_count[nuc] = 1
ecoli_nuc_count[gene] = ec_count
}}}
I then create another dictionary, sum, to hold the counts of the
number of nucleotides per genome. The intial sums are all set at 0.0 to
initialize the count.
-Next, iteritems() is used again to add up the total number of bases,
each instance of a nucleotide increasing the count for that base by
one. And finally, the reason for the code: the sum of each base is
divided by the total number (producing the average number of bases for
that gene) and prints the values found for the calculations.
+Next, we iterate over the genes and nucleotide counts, each instance
of a nucleotide increasing the count for that base by one. And finally,
the reason for the code: the sum of each base is divided by the total
number (producing the average number of bases for that gene) and prints
the values found for the calculations.
{{{
sum = {}
More information about the pygr-notify
mailing list