[bip] Pygr questions

Thu Oct 4 09:55:40 PDT 2007

Hi Diane,
a few answers:

- regarding the "orientation" attribute.  This is actually optional;  
if you want you can simply use negative (start,stop) coordinates to  
specify reverse orientation.  Maybe Titus should add an example to  
his tutorial that illustrates this use of negative coordinates.  From  
the AnnotationDB docs:

"Note: the start,stop coordinates should follow the SeqPath sign  
convention, i.e. positive coordinates mean an interval on the  
positive strand, and negative coordinates mean an interval on the  
negative strand (i.e. the reverse complement of the positive strand.  
See the reference documentation on SeqPath above for details).

If the sliceAttrDict (or sliceInfo object directly) provides a  
orientation attribute, it will be used to be change positive  
intervals to negative intervals if the orientation attribute is  
negative. This gives the user an alternative method to represent  
orientation: give all coordinates in positive orientation (positive  
integer values), and give an orientation attribute that is a negative  
value if the interval should be reversed (to negative orientation). "

- you are right to ask for more details (relative to the tutorial)  
about how to get query sequence objects from seqDict.  The key issue  
is making sure that your query sequence (slices) are from the same  
sequence database object as the NLMSA is using (it opens its bound  
seqdbs upon __init__).  Unless the sequence database is known to  
pygr.Data, just reopening the same sequence database file (creating a  
new seqdb python object) won't work; this is one reason to use  
pygr.Data, which eliminates this whole issue.  Assuming you're NOT  
using pygr.Data, then you should access the sequence database via the  
NLMSA's seqDict as you said.  For any NLMSA that involves multiple  
sequence or annotation databases (as this one does), the seqDict is a  
PrefixUnionDict that assigns each sequence database a distinct string  
prefix and accepts keys of the form "prefix.id" (this follows UCSC's  
convention for how they identify sequences in their multigenome  
alignments).  When constructing the original NLMSA you can directly  
pass it such a PrefixUnionDict containing your own prefix  
assignments.  Alternatively, you can let the NLMSA build a  
PrefixUnionDict for you automatically (this is what Titus did), in  
which case it will try to choose a sensible prefix based either on  
each sequence database's pygr.Data ID, its filename, or if all else  
fails a generic but unique ID.

The bottom line is, you can get the list of assigned prefixes via  
nlmsa.seqDict.prefixDict.keys().  Then obtain an individual sequence  
by concatenating a string key of the form "prefix.id", where prefix  
is the sequence database's prefix, and id is the identifier of the  
desired sequence in that database.  Then get the sequence via
s = nlmsa.seqDict[my_key]
or alternatively
s = nlmsa.seqDict.prefixDict[prefix][id]
You can then slice s any way you want and use it or its slices as a  
query to the NLMSA.

It occurs to me we could implement a __hash__ for sequence databases  
such that any pair of sequence database objects that actually are  
derived from the same file would be treated as the "same database"  
for NLMSA.seqDict and other purposes...  In this case you could open  
the same database separately and it would work for querying the  
NLMSA... but only if it was exactly the same filepath (which seems  
fragile, given the prevalence of automount these days, constructing  
arbitrary path prefixes).  Does that seem worthwhile?

- currently NLMSA is read-only.  Once built, you can't add more data  
to it.  We could change this behavior (forcing a rebuild after new  
data was added), or we could implement a truly dynamic version of  
NLMSA (using tree structures rather than sorted arrays).

Yours,

Chris

On Oct 3, 2007, at 1:46 PM, Diane Trout wrote:

> Hi,
>
> I was using pygr and instead of waiting (potentially forever) to  
> properly prepare my questions and observations I just wanted to  
> toss out what I've gotten so far.
>
> http://bio.scipy.org/wiki/index.php/ 
> Retrieving_sequence_annotations_by_location_with_pygr
>
> is useful, though it might've been nice if it also included talking  
> about setting orientation. AKA the annot class should have an  
> attribute orientation set to either -1 for reverse strand and +1  
> for "forward" strand.
>
> Once I build a NLMSA should I stick to using its seqDict for  
> building queries? (At least until I get to understanding pygr.Data.)
>
> Is it possible to add more annotations to a currently existing NLMSA?
>
> diane
> _______________________________________________
> biology-in-python mailing list
> biology-in-python at lists.idyll.org
> http://lists.idyll.org/listinfo/biology-in-python