[bip] Pygr questions
Christopher Lee
leec at chem.ucla.edu
Thu Oct 4 09:55:40 PDT 2007
Hi Diane,
a few answers:
- regarding the "orientation" attribute. This is actually optional;
if you want you can simply use negative (start,stop) coordinates to
specify reverse orientation. Maybe Titus should add an example to
his tutorial that illustrates this use of negative coordinates. From
the AnnotationDB docs:
"Note: the start,stop coordinates should follow the SeqPath sign
convention, i.e. positive coordinates mean an interval on the
positive strand, and negative coordinates mean an interval on the
negative strand (i.e. the reverse complement of the positive strand.
See the reference documentation on SeqPath above for details).
If the sliceAttrDict (or sliceInfo object directly) provides a
orientation attribute, it will be used to be change positive
intervals to negative intervals if the orientation attribute is
negative. This gives the user an alternative method to represent
orientation: give all coordinates in positive orientation (positive
integer values), and give an orientation attribute that is a negative
value if the interval should be reversed (to negative orientation). "
- you are right to ask for more details (relative to the tutorial)
about how to get query sequence objects from seqDict. The key issue
is making sure that your query sequence (slices) are from the same
sequence database object as the NLMSA is using (it opens its bound
seqdbs upon __init__). Unless the sequence database is known to
pygr.Data, just reopening the same sequence database file (creating a
new seqdb python object) won't work; this is one reason to use
pygr.Data, which eliminates this whole issue. Assuming you're NOT
using pygr.Data, then you should access the sequence database via the
NLMSA's seqDict as you said. For any NLMSA that involves multiple
sequence or annotation databases (as this one does), the seqDict is a
PrefixUnionDict that assigns each sequence database a distinct string
prefix and accepts keys of the form "prefix.id" (this follows UCSC's
convention for how they identify sequences in their multigenome
alignments). When constructing the original NLMSA you can directly
pass it such a PrefixUnionDict containing your own prefix
assignments. Alternatively, you can let the NLMSA build a
PrefixUnionDict for you automatically (this is what Titus did), in
which case it will try to choose a sensible prefix based either on
each sequence database's pygr.Data ID, its filename, or if all else
fails a generic but unique ID.
The bottom line is, you can get the list of assigned prefixes via
nlmsa.seqDict.prefixDict.keys(). Then obtain an individual sequence
by concatenating a string key of the form "prefix.id", where prefix
is the sequence database's prefix, and id is the identifier of the
desired sequence in that database. Then get the sequence via
s = nlmsa.seqDict[my_key]
or alternatively
s = nlmsa.seqDict.prefixDict[prefix][id]
You can then slice s any way you want and use it or its slices as a
query to the NLMSA.
It occurs to me we could implement a __hash__ for sequence databases
such that any pair of sequence database objects that actually are
derived from the same file would be treated as the "same database"
for NLMSA.seqDict and other purposes... In this case you could open
the same database separately and it would work for querying the
NLMSA... but only if it was exactly the same filepath (which seems
fragile, given the prevalence of automount these days, constructing
arbitrary path prefixes). Does that seem worthwhile?
- currently NLMSA is read-only. Once built, you can't add more data
to it. We could change this behavior (forcing a rebuild after new
data was added), or we could implement a truly dynamic version of
NLMSA (using tree structures rather than sorted arrays).
Yours,
Chris
On Oct 3, 2007, at 1:46 PM, Diane Trout wrote:
> Hi,
>
> I was using pygr and instead of waiting (potentially forever) to
> properly prepare my questions and observations I just wanted to
> toss out what I've gotten so far.
>
> http://bio.scipy.org/wiki/index.php/
> Retrieving_sequence_annotations_by_location_with_pygr
>
> is useful, though it might've been nice if it also included talking
> about setting orientation. AKA the annot class should have an
> attribute orientation set to either -1 for reverse strand and +1
> for "forward" strand.
>
> Once I build a NLMSA should I stick to using its seqDict for
> building queries? (At least until I get to understanding pygr.Data.)
>
> Is it possible to add more annotations to a currently existing NLMSA?
>
> diane
> _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python
More information about the biology-in-python
mailing list