[bip] parallel recipe and bio libraries.
Paul Davis
paul.joseph.davis at gmail.com
Tue Feb 5 06:56:59 PST 2008
Eric,
The issue at hand is the 'effective' database length. Effective
database length is dependant on the query sequence as well.
Check out: BLAST_ComputeLengthAdjustment
Line 4555 in src/algo/blast/core/blast_stat.cpp
And then on line 644 in file src/algo/blast/core/blast_setup.cpp you
can find this comment:
/* if the database length was specified, do not
adjust it when calculating the search space;
it's counter-intuitive to specify a value and
not have that value be used */
If I'm not mistaken, this means that if you blast more than one
sequence at a time, you can only have one user specified database
size, which is incorrectly used as the effective database size. Sooo,
long story short, this will affect e-values.
Simple test:
Blasted gi 15604718 against my version of nr twice. Once with -z, once without.
davisp at cnode1:~/blast$ blastall -p blastp -d /mnt/data/biores/blast/nr
-i test.fasta -o test1.blast -z 1537870565 -a 16 -e 1e-5 -m 8
davisp at cnode1:~/blast$ blastall -p blastp -d /mnt/data/biores/blast/nr
-i test.fasta -o test2.blast -a 16 -e 1e-5 -m 8
I get the following values:
test1.blast
gi|15604718|ref|NP_219502.1| gi|15604718|ref|NP_219502.1| 100.00
591 0 0 1 591 1 591 0.0 1028
gi|15604718|ref|NP_219502.1| gi|76788712|ref|YP_327798.1| 97.97
591 10 1 1 591 1 589 0.0 998
gi|15604718|ref|NP_219502.1| gi|15834888|ref|NP_296647.1| 48.33
629 263 10 1 591 10 614 3e-145 519
test2.blast
gi|15604718|ref|NP_219502.1| gi|15604718|ref|NP_219502.1| 100.00
591 0 0 1 591 1 591 0.0 1028
gi|15604718|ref|NP_219502.1| gi|76788712|ref|YP_327798.1| 97.97
591 10 1 1 591 1 589 0.0 998
gi|15604718|ref|NP_219502.1| gi|15834888|ref|NP_296647.1| 48.33
629 263 10 1 591 10 614 2e-145 519
Moral of this story:
when I specify the database size, the evalue of the third hit is
inflated. This is because the effective database size is not
calculated correctly. (There's no db_size - length_adj * num_seqs)
This is with NCBI blast. No idea about WU-blast.
This just goes to show that we need to figure out a measure that is
not sequence dependant. I sure do like the looks of z-values, but they
look orders of magnitude harder to compute.
And to the others, hopefully this weekend I'll have time to play with
running a couple cursory IO tests.
HTH,
Paul Davis
On 2/5/08, Erich Schwarz <emsch at its.caltech.edu> wrote:
> On Mon, 4 Feb 2008, Paul Davis wrote:
>
> > Also, remember that your e-values are going to change if you
> > change the database size (by splitting it). And to my knowledge,
> > its impossible to exactly match the single database e-values when
> > blasting multiple sequences at once.
>
> Shouldn't you be able to get a size for the overall database,
> then feed it to the individual blast runs? For NCBI's blastall, the
> argument "-z [number]" allows the effective length of the
> full database to be provided independently of whatever subset is
> being scanned in the blastall run. NCBI blastpgp uses "-z" as well;
> WU-BLAST uses "Z=[number]".
>
> Or am I not understanding the problem here (always
> possible...)?
>
>
> --Erich
>
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>
More information about the biology-in-python
mailing list