[bip] pygr future features?

Mon Sep 24 20:30:28 PDT 2007

Hi,
given that we're nearing completion of the 0.7 release, I thought it  
would be useful to summarize some thoughts about possible next  
features to implement, to get feedback about what people think would  
be useful / higher priority.  These are some fairly narrow ideas for  
things that hopefully could be done quickly.  A lot of these things  
are more complicated to explain than to implement...  I'd love to  
hear your ideas as well.  I'll divide this into several areas:

ALIGNMENT / ANNOTATION:
1. XMLRPC AnnotationDB: currently pygr doesn't support serving  
AnnotationDB over XMLRPC.  Should be easy though.  That would give  
pretty good coverage for remote services: sequence databases (done);  
alignments (done); annotations (needed).

2. XMLRPC blast methods: current BlastDB will not serve its blast()  
or megablast() methods over XMLRPC.  Again, seems pretty  
straightforward.  But I'm not sure how useful this is for people.

3. perl XMLRPC client for NLMSA, BlastDB.  Don't laugh.  Folks at  
UCSC asked for this with high priority, because it would allow them  
to say that NLMSA provides a reasonably comprehensive remote service  
for serving UCSC genome alignments to both python and perl users.   
Should be easy to use a perl XMLRPC client to just get back  
(src_start, src_stop, sequence ID, start, stop) tuples for any given  
alignment query...

4. sequence indexing for fast NLMSA joins: this is an internal issue  
of how we assign IDs to sequences in an NLMSA.  Currently, to join  
two or more NLMSAs requires performing a query against one NLMSA,  
taking the results and querying them against the second NLMSA, etc.   
Because each step of getting results back out to Python (from the C  
code in NLMSASlice) is slower (than C), this reduces performance.  In  
particular, having to look up sequence IDs in disk index files, and  
then create Python Sequence objects, is unnecessary for intermediate  
steps in a join (since those intermediate steps just get thrown away  
and not returned to the user).  It would be far better to be able to  
keep sequence IDs organized in a way that can be directly converted  
between different NLMSAs without any lookup.  There is a  
straightforward way to do this, which just requires a slight change  
to how nlmsa_id are allocated to sequences in an NLMSA.  This is a  
purely internal issue, but makes possible additional features...

5. fast & easy filtering of results: the current mechanism by which  
groupBySequences allows you to filter what sequences you want  
reported is too slow in my opinion, because it uses Python  
dictionaries.  Often you want to get results from just one target  
genome (or list of genomes).  Using the sequence indexing proposal  
above, this could be done in C as just an interval range check (we  
could actually use IntervalDB for this), which would be so fast you  
could use it for all sorts of interesting applications.  For example,  
you could store many different kinds of annotations in a single NLMSA  
(rather than a separate NLMSA for each), each kind as a separate  
annotation database with a distinct range of IDs, and easily /  
quickly do queries that are restricted to just a desired subset of  
annotation types.

6. fast joining of multiple NLMSAs: currently NLMSASlice either  
performs a one-step (pairwiseMode) or two-step (true MSA) query  
process.  By merely generalizing this to an N-step process, it could  
be made to perform fast joins of multiple NLMSAs entirely in C,  
without having to go back out into Python (i.e. create Python objects  
for the intermediate results during the join process).  My intuitive  
feeling is that this would give a big performance increase.   
Furthermore, it gives a useful new feature (joins) that otherwise  
users would have to write a fair amount of code to do (nested for- 
loops and proper intersection code, which is easy to get wrong).  I  
think this generalization is straightforward, but it would need to be  
done carefully.  One key simplifying idea would be to require that  
all the source intervals (query) be from exactly one NLMSA, and all  
the final intervals (results) be from exactly one NLMSA.  (Currently  
both query and results have to be from the same NLMSA).  One could  
also go further and explore the idea of storing queries, results, or  
intermediate results as "temporary tables" that would just be NLMSAs  
(in memory or on disk).  This is an area where one can generate as  
much work as one wants (by just making up new ideas), so the key is  
to find some clear first step that is both doable and valuable.

7. keep BLAST edge info?  NLMSA does not store separate edge  
information for each alignment relation, so BLAST information like  
the expectation score for each hit is not kept.  There are several  
ways this could be resolved, but I'm unsure how important people  
think this is.

8. nucleotide-to-protein (e.g. ORF) annotation.  There are many  
reasons for needing a separate coordinate system for nucleotide to  
protein alignments.  For example, what if the query is a nucleotide  
sequence and finds a reverse-complement homology to a protein  
sequence?  I.e. when the query is reverse-complemented, it has a  
translated-homology to that protein sequence.  The result of any  
alignment query must always be returned in the same orientation as  
the user-supplied query, so the homologous protein interval must be  
returned in "negative orientation" -- which of course does not exist  
for a true protein sequence.  A related issue is how to robustly  
represent the reading frame "phase" for any given part of such an  
alignment (i.e. the ability to represent alignment to a "partial  
codon"; for example, aligning a protein against exons (nucleotide  
coordinate system) may split a single codon across two different exons!

One elegant solution to this is to create an annotation for the  
protein sequence interval, but using a nucleotide coordinate system.   
This coordinate system solves the otherwise vexing issues of how to  
represent "phase" and orientation for an aligned protein.  This  
annotation coordinate system is usable in all the normal ways  
(slicing; NLMSA etc).  If the user is interested in the phase and  
orientation, s/he can see that directly from the annotation's  
coordinates.  If s/he wants the corresponding protein sequence  
interval that it represents, s/he just gets the annotation's sequence  
attribute (as usual).  We could map such ORF annotations directly  
onto genomic sequence, use them to store tblastn or blastx results, etc.

To see the underlying value of this, consider the alternatives.  For  
example, NLMSA uses integer coordinates.  In the absence of such an  
"ORF annotation" coordinate system, we would have to convert NLMSA to  
a floating point alignment representation (to deal with all the  
possible permutations of protein vs. nucleotide query vs. result  
alignment scaling and phasing).  I personally think that this is a  
really bad idea, and would be likely to lead to endless trouble (for  
example, floating pointing rounding errors).

9. tblastn and blastx support.  Currently BlastDB doesn't support  
these flavors of blastall, because of the above phase / orientation  
issues for how to hand the results back to the user.  Based on #8, we  
could easily provide this.

10. read other alignment formats.  Currently we read only the UCSC  
MAF and axtNet formats, on the assumption that biopython and others  
provide parsers for other alignment formats.  However, to make things  
easy for users, we should decide either a) exactly how they should  
use those external parsers to read alignments into pygr NLMSAs, or b)  
provide some of our own reading methods if a) seems too complicated  
for users.

11. interface with other alignment programs like CLUSTAL, etc.  Right  
now, the only external "alignment" program we interface to is  
blastall.  It would be nice to provide multiple alignment services in  
a convenient pygr framework.

PYGR.DATA Features

12. download=True option.  Currently pygr.Data provides access to  
things like NLMSA and BlastDB either locally (using files already  
accessible to your computer) or remotely (via XMLRPC).  The remote  
service was intended to let users jump directly into using pygr  
without having to go through all the complexities of obtaining big  
databases and figuring out how to set them up.  However, remote  
services are not adequate for large scale usage (e.g. genome-wide  
queries), for reasons that are obvious.  One idea I've wanted to  
implement is to accept a download=True option to any pygr.Data  
constructor, e.g.

nlmsa = pygr.Data.Bio.MSA.UCSC.hg17_multiz_28way(download=True)

This option would cause pygr.Data to download the data to your  
computer (including dependencies!), and automatically build it ready  
for use as a LOCAL resource.  The user gains the performance of local  
resources, combined with the convenience of a remote resource (i.e. a  
remote server passes your computer all the data for constructing the  
local resource, and the user has to do... nada).  The neat thing is  
how incredibly easy this would be to implement (at least for existing  
data types like NLMSA and BlastDB).

13. signed-pickle security: sad to say, Python unpickling has  
security vulnerabilities (because it has access to import), so as  
soon as you start encouraging people to share lots of data with each  
other, you need to start thinking about security.  This concern  
applies to ANY pygr.Data access path that involves pickled data from  
somebody else, regardless of whether the transport mechanism is  
XMLRPC, MySQL, or the user just copying a file.  This is probably not  
an immediate issue (i.e. I trust data from Titus, and maybe he trusts  
data from me), but if usage grows and random people start loading  
data from other random people, security becomes a serious issue.   
Pygr.Data already has a stub for keeping a security code with each  
pygr.Data ID, but it currently is unused.

One simple solution would be use PGP based signatures.  A data  
producer signs a given pickle using their private key (i.e. they  
produce a checksum for the data that is signed with their private  
key, and verifiable with their public key).  Then the unpickler first  
verifies the data before unpickling it, by getting the producer's  
public key (from a trusted PGP key repository), and computing a  
checksum on the actual pickle data and verifying it with the public  
key.  Users have to keep a list of who they are willing to accept  
data from (i.e. "trusted data producers").  Obviously this issue has  
been solved before for the web, so we could use those certificates  
instead of PGP if that is preferable.

This obviously isn't a pygr issue; it's a Python issue.  But it seems  
simple enough to "just do it" (or convince someone in the Python  
development team to do it?).  It seems to me that pickle is an  
undervalued and under-utilized Python feature, that deserves to be  
given real respect (and usage!)...

14. schema extensions: one obvious idea is to start formalizing the  
pygr.Data namespace where appropriate.  For example, we have chosen  
the UniProt species nomenclature as the IDs we will use for species,  
e.g. in pygr.Data.Bio.Seq.Genome.HUMAN.hg17, HUMAN is the unique key  
assigned by UniProt to H. sapiens.  It therefore makes sense to make  
this schema explicit (i.e. known to pygr.Data, by telling it that for  
the namespace Bio.Seq.Genome.X, X must be a primary key from the  
UniProt nomenclature).  In this case, if a user attempts to add  
something not allowed by the namespace, they get an exception  
explaining that they need to use the right species code, and how to  
look that up.  We can easily provide an override that makes it accept  
the name you wanted to assign, for cases where you have an organism  
that no one's ever heard of before...  The principle here is that if  
we want a whole population of users to work together in a rational  
way (e.g. use a consistent naming system), we need to make it doggone  
easy for users to do the right thing!  And that is the beauty of  
making schema explicit: if pygr.Data knows what the schema is  
supposed to be, it can help users follow a correct, rational,  
consistent policy, by making that policy immediately visible and usable.

This could be implemented in two easy steps.  First, write some code  
supporting such "namespace hints".  Second, add such namespace hints  
to our standard XMLRPC server that every pygr.Data user defaults to  
(if their PYGRDATAPATH is blank).  I guess we should also consider  
what to do if the server is inaccessible (e.g. if the user is not  
connected to the internet).

MISCELLANEOUS ISSUES

15. change license from GPL to BSD: people seem to think this would  
be an improvement...  As Andrew Dalke points out, whether people will  
contribute changes they made to pygr code back is probably more a  
function of culture (i.e. whether they want to contribute) than  
something that can just be turned on or off by the legal terms of the  
license.

16. switch to a public subversion server so everyone can get the  
latest code and participate in pygr development if they want to.  The  
main issue is where to host it.  I've heard some complaints about  
SourceForge as a lackluster platform, and suggestions that we switch  
to Google Code or host it ourselves (Titus, want to host it on  
idyll.org along with BIP?).

Please send feedback to this mailing list (biology-in- 
python at lists.idyll.org).

Yours,

Chris