[bip] parallel recipe and bio libraries.

Tue Feb 5 06:56:59 PST 2008

Eric,

The issue at hand is the 'effective' database length. Effective
database length is dependant on the query sequence as well.

Check out: BLAST_ComputeLengthAdjustment
Line 4555 in src/algo/blast/core/blast_stat.cpp

And then on line 644 in file src/algo/blast/core/blast_setup.cpp you
can find this comment:

/* if the database length was specified, do not
                adjust it when calculating the search space;
                it's counter-intuitive to specify a value and
                not have that value be used */

If I'm not mistaken, this means that if you blast more than one
sequence at a time, you can only have one user specified database
size, which is incorrectly used as the effective database size. Sooo,
long story short, this will affect e-values.

Simple test:

Blasted gi 15604718 against my version of nr twice. Once with -z, once without.

davisp at cnode1:~/blast$ blastall -p blastp -d /mnt/data/biores/blast/nr
-i test.fasta -o test1.blast -z 1537870565 -a 16 -e 1e-5 -m 8

davisp at cnode1:~/blast$ blastall -p blastp -d /mnt/data/biores/blast/nr
-i test.fasta -o test2.blast -a 16 -e 1e-5 -m 8

I get the following values:

test1.blast
gi|15604718|ref|NP_219502.1|    gi|15604718|ref|NP_219502.1|    100.00
 591     0       0       1       591     1       591     0.0     1028
gi|15604718|ref|NP_219502.1|    gi|76788712|ref|YP_327798.1|    97.97
 591     10      1       1       591     1       589     0.0      998
gi|15604718|ref|NP_219502.1|    gi|15834888|ref|NP_296647.1|    48.33
 629     263     10      1       591     10      614     3e-145   519

test2.blast
gi|15604718|ref|NP_219502.1|    gi|15604718|ref|NP_219502.1|    100.00
 591     0       0       1       591     1       591     0.0     1028
gi|15604718|ref|NP_219502.1|    gi|76788712|ref|YP_327798.1|    97.97
 591     10      1       1       591     1       589     0.0      998
gi|15604718|ref|NP_219502.1|    gi|15834888|ref|NP_296647.1|    48.33
 629     263     10      1       591     10      614     2e-145   519

Moral of this story:

when I specify the database size, the evalue of the third hit is
inflated. This is because the effective database size is not
calculated correctly. (There's no db_size - length_adj * num_seqs)

This is with NCBI blast. No idea about WU-blast.

This just goes to show that we need to figure out a measure that is
not sequence dependant. I sure do like the looks of z-values, but they
look orders of magnitude harder to compute.

And to the others, hopefully this weekend I'll have time to play with
running a couple cursory IO tests.

HTH,
Paul Davis

On 2/5/08, Erich Schwarz <emsch at its.caltech.edu> wrote:
> On Mon, 4 Feb 2008, Paul Davis wrote:
>
> > Also, remember that your e-values are going to change if you
> > change the database size (by splitting it). And to my knowledge,
> > its impossible to exactly match the single database e-values when
> > blasting multiple sequences at once.
>
>     Shouldn't you be able to get a size for the overall database,
> then feed it to the individual blast runs?  For NCBI's blastall, the
> argument "-z [number]" allows the effective length of the
> full database to be provided independently of whatever subset is
> being scanned in the blastall run.  NCBI blastpgp uses "-z" as well;
> WU-BLAST uses "Z=[number]".
>
>     Or am I not understanding the problem here (always
> possible...)?
>
>
> --Erich
>
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>