[bip] Pygr questions

Thu Oct 4 14:51:31 PDT 2007

Hi Diane,
answers below...

On Oct 4, 2007, at 10:37 AM, Diane Trout wrote:

> On Thu, Oct 04, 2007 at 09:55:40AM -0700, Christopher Lee wrote:
>> Hi Diane,
>> a few answers:
>>
>> - regarding the "orientation" attribute.  This is actually  
>> optional; if you
>> want you can simply use negative (start,stop) coordinates to specify
>> reverse orientation.
>
> Ok, I was a bit confused about the proper values for negative  
> coordinates which is why I stuck with the familiar  
> (start,stop,orientation).
>
> Imagining a 1000 bp sequence is a (start, stop, orientation) of  
> (800, 900, -1) == (-200, -100) AKA [start-len(seq):stop-len(seq)]?  
> It might've been useful to have an illustrative example showing how  
> the various coordinate systems match up in the pygr reference docs.
>
It's actually simpler than that: a start,stop pair of (-900,-800)  
means the negative strand interval corresponding to (800,900), i.e.  
its reverse complement interval.  The advantage of this is that it  
doesn't depend on the sequence length, so extending the sequence  
length won't affect the existing coordinates at all (this allows a  
sequence to be mutable, like a Python list, rather immutable like a  
Python tuple; this is important for things like union coordinate  
systems, that you can append new sequences to).

 From the SeqPath docs: attribute "start: start coordinate of the  
interval. NB: SeqPath stores coordinates relative to the start of the  
forward strand. This is necessary for allowing resizing of the top- 
level SeqPath; if coordinates were relative to the end of the  
sequence, they would have to be recomputed every time the length of  
the sequence changed. The main consequence of this is that  
coordinates for forward intervals are always positive, whereas  
coordinates for reverse intervals are always negative (i.e. following  
the Python convention that negative coordinates count backwards from  
the end, and the fact that the end of the reverse strand corresponds  
to the start of the forward strand). "

Just giving an orientation value, as you suggested, eliminates having  
to think about these issues at all...

>
>> s = nlmsa.seqDict.prefixDict[prefix][id]
>> You can then slice s any way you want and use it or its slices as  
>> a query
>> to the NLMSA.
>
> I wonder if making a slice that way is faster than constructing the  
> string prefix+"."+id?
It probably is faster, as it would avoid the string join / split  
operations...

>
>> It occurs to me we could implement a __hash__ for sequence  
>> databases such
>> that any pair of sequence database objects that actually are  
>> derived from
>> the same file would be treated as the "same database" for  
>> NLMSA.seqDict and
>> other purposes...  In this case you could open the same database  
>> separately
>> and it would work for querying the NLMSA... but only if it was  
>> exactly the
>> same filepath (which seems fragile, given the prevalence of  
>> automount these
>> days, constructing arbitrary path prefixes).  Does that seem  
>> worthwhile?
>
> That would've simplified my fumbling around but if some of the  
> introductory tutorials illustrated seqDict I would've used it.
>
OK, we need to add this to the tutorials...

>> - currently NLMSA is read-only.  Once built, you can't add more  
>> data to it.
>>  We could change this behavior (forcing a rebuild after new data was
>> added), or we could implement a truly dynamic version of NLMSA  
>> (using tree
>> structures rather than sorted arrays).
>
> A possibly simpler, but still useful, solution would be some way of  
> combining some NLMSAs together into a new NLMSA.
>
That's a nice idea.  I'll have to think about this.

--Chris