[pygr-notify] Issue 44 in pygr: tblastn and blastx support

Wed Jan 7 13:50:12 PST 2009

Updates:
	Cc: marecki
	Labels: -Priority-Medium Priority-Low Milestone-Release0.8

Comment #3 on issue 44 by cjlee112: tblastn and blastx support
http://code.google.com/p/pygr/issues/detail?id=44

Titus requested some usage examples, so here goes:

# BASIC USAGE: creating a mapping object for any seqDB
# that you want to search using BLAST
blastmap = BlastMapping(seqDB)
# blastmap acts like a Pygr graph (i.e. dict)
# that takes any sequence object as a source node (key)
# and maps them to homologous sequence intervals from seqDB
# as target nodes, with edge information about each alignment,
# following the generic pattern blastmap[src][target] = edgeInfo.
# You can use it either in the mapping style:
# myquery can be any Pygr sequence interval object...
for src,target,edge in blastmap[myquery].edges():
     # print the aligned intervals or whatever you want to do...
     print '%s %s\n%s %s\n' %(src,repr(src),target,repr(target))

# or you use it as a callable, so you can pass
# search parameters like expmax etc.
# This also lets you save many search results into
# one NLMSA...
for myquery in lotsaQueries:
     blastmap(myquery, al=myNLMSA)
myNLMSA.build()

Since this uses NLMSA to store the results, any of the NLMSASlice group-by  
operators
can be applied as usual.

BlastMapping determines which flavor of BLAST (blastp or blastn) to run  
based on the
sequence type (nucleotide vs. protein) of the query and database.   
Currently, tblastn
and blastx won't work correctly, because of the translation issues outlined  
above.

The new proposal would make tblastn work exactly the same as blastp and  
blastn (i.e.
as in the example code above), but with one extra feature: the aligned  
target
intervals (which look like protein sequence intervals, even though the  
target
database is nucleotide) would have a "sequence" attribute which yields the  
associated
nucleotide sequence interval from the target sequence database, e.g. to  
print the
target *nucleotide sequence coordinates* next to the target *protein  
sequence
string*, you would just change the code trivially:

for src,target,edge in blastmap[myquery].edges():
     # print the aligned intervals or whatever you want to do...
     print '%s %s\n%s %s %2.1f\n' %(src,repr(src),target,
                                    repr(target.sequence),
                                    100.*edge.pIdentity())
This example also prints the percent identity based of the *protein*  
sequence of
target vs. src.

Explanation: the target sequence interval is not an interval of seqDB of  
course,
because tblastn translates the subject nucleotide sequence into a protein  
sequence.
target represents that protein sequence interval; you can slice it, extract  
its
protein sequence string (by str()) etc. in all the usual ways.  But it is  
actually an
Annotation sequence object, bound to a real sequence object, namely its  
associated
interval of nucleotide sequence from seqDB, which as always with an  
annotation, you
can obtain by requesting the "sequence" attribute.  This takes advantage of  
the fact
that in Pygr an annotation is actually a subclass of sequence (SeqPath), so  
all the
sequence operation methods come along for free...

--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings