[bip] BLAST and FASTA performance

Mon Jul 21 16:11:33 PDT 2008

It's interesting reading that searching the world's genetic data has  
taken effectively the same time for the last 15 years.  I'm assuming  
of course that's using weighted-for-inflation cost of hardware.   
GenBank size doubles every 18 months, and CPU performance doubles at  
roughly the same rate.

(This coincidence is suggestive - is the growth in sequencing data  
dependent on processing power?  Eg, how much of the assembly process  
is CPU bound?)

My technical optimism says that the algorithms should also be  
improving, so similarity search time should have gone down.  Given  
the number of people who have worked on sequence similarity  
algorithms, I'm surprised this hasn't happened.

Bruce:
> The use of different instruction sets is probably somewhat limited
> because the you need a compiler to support them and a related
> application that can use them. So far these have been orientated to
> multimedia applications and can make a significant difference if
> supported or not.

Granted, "MMX" was named after "multimedia" but the SSE instructions
are "Streaming SIMD Extensions" and the PowerPC's AltiVec are similar
in intent.  gcc has supported those extensions for years, through
a couple of mechanisms.  For this case you would use the gcc builtin
functions, like those listed at
   http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/X86-Built_002din- 
Functions.html
Remember, C is a high-level assembler.  ;)

I pointed out that the Wikipedia page claims some speedups by taking
advantage of SSE2.  One of the claims is:
   http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

    A SSE2 vectorization of the algorithm (Farrar, 2007) is
    now now available providing an 8-fold speedup on Intel/AMD
    processors with SSE2 extensions.[3] When running on Intel
    processor using the new Intel Core microarchitecture the
    SSE2 implementation achieves a 20-fold increase

Following the link to the Farrar paper: "Striped Smith–Waterman  
speeds database searches six times over other SIMD implementations"  
it concludes

http://bioinformatics.oxfordjournals.org/cgi/reprint/23/2/156.pdf

    On the whole, the striped Smith–Waterman was faster than
    FASTA, more than 50% faster when ktup=2  and four time
    faster when ktup=1. SSEARCH averaged about 30% slower runtimes
    when compared to BLAST.

In thinking about this I have a suspicion that people use BLAST  
because people use BLAST.  Something I've seen in other places is  
that people like using widely-used tools because there's less need to  
justify the choice.  If you use some other tool then the first  
question from an article reviewer, or at a conference, will be "why  
did you use method-X instead of BLAST"?

Another possibility is that people trust NCBI to be right, and don't  
trust other programs.  Or that BLAST code isn't written to be easily  
optimized.  Or that NCBI doesn't take in external patches.  There are  
too many vagaries for me to be sure about any of these.

Why do people here use BLAST over at 30% slower but presumably more  
accurate Smith-Waterman implementation?  The code is available at:
   http://farrar.michael.googlepages.com/Smith-waterman

				Andrew
				dalke at dalkescientific.com