[pygr-notify] Issue 44 in pygr: tblastn and blastx support

Fri Sep 26 18:16:40 PDT 2008

Issue 44: tblastn and blastx support
http://code.google.com/p/pygr/issues/detail?id=44

New issue report by cjlee112:
Right now pygr is restricted to 1:1 alignment relations, which works fine
for blastn and blastp, but not tblastn (protein query vs. nucleotide
database translated to protein sequence) or blastx (nucleotide query vs.
protein database).

tblastn and blastx are problematic for several reasons:
- the returned alignment is not of the actual query sequence and database
sequences, but instead of a *translation* (possibly after
reverse-complementing!) of one side or the other.  Thus the alignment
results are NOT in the coordinate system of the query and the database
seqs; instead they involve a new coordinate system (a translation) created
on the fly.

- this involve a 3:1 alignment relation between nucleotide vs. protein
sequence.  This is problematic in all sorts of ways, the most fundamental
of which is how to robustly represent the reading frame "phase" for any
given part of the alignment (i.e. the ability to represent alignment to a
"partial codon", which can easily occur when aligning protein against exons
which may split a single codon across an exon-exon junction.

- I think tblastn/blastx imply the need a separate coordinate system for
this nucleotide vs. protein alignment problem.  For example, what if the
query is a nucleotide sequence and finds a reverse-complement homology to a
protein sequence?  I.e. when the query is reverse-complemented, it has a
translated-homology to the protein sequence.  The result of any alignment
query must always be returned in the same orientation as the user-supplied
query, which means that the homologous protein interval must be returned in
"negative orientation" -- which of course does not exist for a true protein
sequence.

POSSIBLE SOLUTIONS:

I think this would be easy to resolve by using an annotation to represent
the open reading frame on the protein sequence. The key idea is that an
annotation is an independent coordinate system, but can be converted to the
corresponding sequence interval by requesting its sequence attribute.  So
we could have tblastn return 1:1 alignments of nucleotide sequence to an
ORF annotation (whose coordinate system would be expressed in bp, not aa).
  The user would request its sequence attribute to obtain the corresponding
protein sequence interval.  This would work well in both directions (i.e.
tblastn, and blastx).

The ORF annotation idea solves the "intermediate coordinate system" problem
nicely: it is a nucleotide coordinate system (which can correctly represent
either orientation).  But it is bound to the protein sequence that it
represents, and you can always convert a slice of an ORF annotation to the
corresponding slice of protein sequence by simply accessing its "sequence"
attribute.  We could even map such ORF annotations directly onto genomic
sequence.

Issue attributes:
	Status: Assigned
	Owner: cjlee112
	CC: ti... at idyll.org,  deepreds
	Labels: Type-Enhancement Priority-Medium

-- 
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings