[bip] pygr future features?
Christopher Lee
leec at chem.ucla.edu
Mon Sep 24 20:30:28 PDT 2007
Hi,
given that we're nearing completion of the 0.7 release, I thought it
would be useful to summarize some thoughts about possible next
features to implement, to get feedback about what people think would
be useful / higher priority. These are some fairly narrow ideas for
things that hopefully could be done quickly. A lot of these things
are more complicated to explain than to implement... I'd love to
hear your ideas as well. I'll divide this into several areas:
ALIGNMENT / ANNOTATION:
1. XMLRPC AnnotationDB: currently pygr doesn't support serving
AnnotationDB over XMLRPC. Should be easy though. That would give
pretty good coverage for remote services: sequence databases (done);
alignments (done); annotations (needed).
2. XMLRPC blast methods: current BlastDB will not serve its blast()
or megablast() methods over XMLRPC. Again, seems pretty
straightforward. But I'm not sure how useful this is for people.
3. perl XMLRPC client for NLMSA, BlastDB. Don't laugh. Folks at
UCSC asked for this with high priority, because it would allow them
to say that NLMSA provides a reasonably comprehensive remote service
for serving UCSC genome alignments to both python and perl users.
Should be easy to use a perl XMLRPC client to just get back
(src_start, src_stop, sequence ID, start, stop) tuples for any given
alignment query...
4. sequence indexing for fast NLMSA joins: this is an internal issue
of how we assign IDs to sequences in an NLMSA. Currently, to join
two or more NLMSAs requires performing a query against one NLMSA,
taking the results and querying them against the second NLMSA, etc.
Because each step of getting results back out to Python (from the C
code in NLMSASlice) is slower (than C), this reduces performance. In
particular, having to look up sequence IDs in disk index files, and
then create Python Sequence objects, is unnecessary for intermediate
steps in a join (since those intermediate steps just get thrown away
and not returned to the user). It would be far better to be able to
keep sequence IDs organized in a way that can be directly converted
between different NLMSAs without any lookup. There is a
straightforward way to do this, which just requires a slight change
to how nlmsa_id are allocated to sequences in an NLMSA. This is a
purely internal issue, but makes possible additional features...
5. fast & easy filtering of results: the current mechanism by which
groupBySequences allows you to filter what sequences you want
reported is too slow in my opinion, because it uses Python
dictionaries. Often you want to get results from just one target
genome (or list of genomes). Using the sequence indexing proposal
above, this could be done in C as just an interval range check (we
could actually use IntervalDB for this), which would be so fast you
could use it for all sorts of interesting applications. For example,
you could store many different kinds of annotations in a single NLMSA
(rather than a separate NLMSA for each), each kind as a separate
annotation database with a distinct range of IDs, and easily /
quickly do queries that are restricted to just a desired subset of
annotation types.
6. fast joining of multiple NLMSAs: currently NLMSASlice either
performs a one-step (pairwiseMode) or two-step (true MSA) query
process. By merely generalizing this to an N-step process, it could
be made to perform fast joins of multiple NLMSAs entirely in C,
without having to go back out into Python (i.e. create Python objects
for the intermediate results during the join process). My intuitive
feeling is that this would give a big performance increase.
Furthermore, it gives a useful new feature (joins) that otherwise
users would have to write a fair amount of code to do (nested for-
loops and proper intersection code, which is easy to get wrong). I
think this generalization is straightforward, but it would need to be
done carefully. One key simplifying idea would be to require that
all the source intervals (query) be from exactly one NLMSA, and all
the final intervals (results) be from exactly one NLMSA. (Currently
both query and results have to be from the same NLMSA). One could
also go further and explore the idea of storing queries, results, or
intermediate results as "temporary tables" that would just be NLMSAs
(in memory or on disk). This is an area where one can generate as
much work as one wants (by just making up new ideas), so the key is
to find some clear first step that is both doable and valuable.
7. keep BLAST edge info? NLMSA does not store separate edge
information for each alignment relation, so BLAST information like
the expectation score for each hit is not kept. There are several
ways this could be resolved, but I'm unsure how important people
think this is.
8. nucleotide-to-protein (e.g. ORF) annotation. There are many
reasons for needing a separate coordinate system for nucleotide to
protein alignments. For example, what if the query is a nucleotide
sequence and finds a reverse-complement homology to a protein
sequence? I.e. when the query is reverse-complemented, it has a
translated-homology to that protein sequence. The result of any
alignment query must always be returned in the same orientation as
the user-supplied query, so the homologous protein interval must be
returned in "negative orientation" -- which of course does not exist
for a true protein sequence. A related issue is how to robustly
represent the reading frame "phase" for any given part of such an
alignment (i.e. the ability to represent alignment to a "partial
codon"; for example, aligning a protein against exons (nucleotide
coordinate system) may split a single codon across two different exons!
One elegant solution to this is to create an annotation for the
protein sequence interval, but using a nucleotide coordinate system.
This coordinate system solves the otherwise vexing issues of how to
represent "phase" and orientation for an aligned protein. This
annotation coordinate system is usable in all the normal ways
(slicing; NLMSA etc). If the user is interested in the phase and
orientation, s/he can see that directly from the annotation's
coordinates. If s/he wants the corresponding protein sequence
interval that it represents, s/he just gets the annotation's sequence
attribute (as usual). We could map such ORF annotations directly
onto genomic sequence, use them to store tblastn or blastx results, etc.
To see the underlying value of this, consider the alternatives. For
example, NLMSA uses integer coordinates. In the absence of such an
"ORF annotation" coordinate system, we would have to convert NLMSA to
a floating point alignment representation (to deal with all the
possible permutations of protein vs. nucleotide query vs. result
alignment scaling and phasing). I personally think that this is a
really bad idea, and would be likely to lead to endless trouble (for
example, floating pointing rounding errors).
9. tblastn and blastx support. Currently BlastDB doesn't support
these flavors of blastall, because of the above phase / orientation
issues for how to hand the results back to the user. Based on #8, we
could easily provide this.
10. read other alignment formats. Currently we read only the UCSC
MAF and axtNet formats, on the assumption that biopython and others
provide parsers for other alignment formats. However, to make things
easy for users, we should decide either a) exactly how they should
use those external parsers to read alignments into pygr NLMSAs, or b)
provide some of our own reading methods if a) seems too complicated
for users.
11. interface with other alignment programs like CLUSTAL, etc. Right
now, the only external "alignment" program we interface to is
blastall. It would be nice to provide multiple alignment services in
a convenient pygr framework.
PYGR.DATA Features
12. download=True option. Currently pygr.Data provides access to
things like NLMSA and BlastDB either locally (using files already
accessible to your computer) or remotely (via XMLRPC). The remote
service was intended to let users jump directly into using pygr
without having to go through all the complexities of obtaining big
databases and figuring out how to set them up. However, remote
services are not adequate for large scale usage (e.g. genome-wide
queries), for reasons that are obvious. One idea I've wanted to
implement is to accept a download=True option to any pygr.Data
constructor, e.g.
nlmsa = pygr.Data.Bio.MSA.UCSC.hg17_multiz_28way(download=True)
This option would cause pygr.Data to download the data to your
computer (including dependencies!), and automatically build it ready
for use as a LOCAL resource. The user gains the performance of local
resources, combined with the convenience of a remote resource (i.e. a
remote server passes your computer all the data for constructing the
local resource, and the user has to do... nada). The neat thing is
how incredibly easy this would be to implement (at least for existing
data types like NLMSA and BlastDB).
13. signed-pickle security: sad to say, Python unpickling has
security vulnerabilities (because it has access to import), so as
soon as you start encouraging people to share lots of data with each
other, you need to start thinking about security. This concern
applies to ANY pygr.Data access path that involves pickled data from
somebody else, regardless of whether the transport mechanism is
XMLRPC, MySQL, or the user just copying a file. This is probably not
an immediate issue (i.e. I trust data from Titus, and maybe he trusts
data from me), but if usage grows and random people start loading
data from other random people, security becomes a serious issue.
Pygr.Data already has a stub for keeping a security code with each
pygr.Data ID, but it currently is unused.
One simple solution would be use PGP based signatures. A data
producer signs a given pickle using their private key (i.e. they
produce a checksum for the data that is signed with their private
key, and verifiable with their public key). Then the unpickler first
verifies the data before unpickling it, by getting the producer's
public key (from a trusted PGP key repository), and computing a
checksum on the actual pickle data and verifying it with the public
key. Users have to keep a list of who they are willing to accept
data from (i.e. "trusted data producers"). Obviously this issue has
been solved before for the web, so we could use those certificates
instead of PGP if that is preferable.
This obviously isn't a pygr issue; it's a Python issue. But it seems
simple enough to "just do it" (or convince someone in the Python
development team to do it?). It seems to me that pickle is an
undervalued and under-utilized Python feature, that deserves to be
given real respect (and usage!)...
14. schema extensions: one obvious idea is to start formalizing the
pygr.Data namespace where appropriate. For example, we have chosen
the UniProt species nomenclature as the IDs we will use for species,
e.g. in pygr.Data.Bio.Seq.Genome.HUMAN.hg17, HUMAN is the unique key
assigned by UniProt to H. sapiens. It therefore makes sense to make
this schema explicit (i.e. known to pygr.Data, by telling it that for
the namespace Bio.Seq.Genome.X, X must be a primary key from the
UniProt nomenclature). In this case, if a user attempts to add
something not allowed by the namespace, they get an exception
explaining that they need to use the right species code, and how to
look that up. We can easily provide an override that makes it accept
the name you wanted to assign, for cases where you have an organism
that no one's ever heard of before... The principle here is that if
we want a whole population of users to work together in a rational
way (e.g. use a consistent naming system), we need to make it doggone
easy for users to do the right thing! And that is the beauty of
making schema explicit: if pygr.Data knows what the schema is
supposed to be, it can help users follow a correct, rational,
consistent policy, by making that policy immediately visible and usable.
This could be implemented in two easy steps. First, write some code
supporting such "namespace hints". Second, add such namespace hints
to our standard XMLRPC server that every pygr.Data user defaults to
(if their PYGRDATAPATH is blank). I guess we should also consider
what to do if the server is inaccessible (e.g. if the user is not
connected to the internet).
MISCELLANEOUS ISSUES
15. change license from GPL to BSD: people seem to think this would
be an improvement... As Andrew Dalke points out, whether people will
contribute changes they made to pygr code back is probably more a
function of culture (i.e. whether they want to contribute) than
something that can just be turned on or off by the legal terms of the
license.
16. switch to a public subversion server so everyone can get the
latest code and participate in pygr development if they want to. The
main issue is where to host it. I've heard some complaints about
SourceForge as a lackluster platform, and suggestions that we switch
to Google Code or host it ourselves (Titus, want to host it on
idyll.org along with BIP?).
Please send feedback to this mailing list (biology-in-
python at lists.idyll.org).
Yours,
Chris
More information about the biology-in-python
mailing list