[bip] BLAST and FASTA performance
Andrew Dalke
dalke at dalkescientific.com
Mon Jul 21 16:11:33 PDT 2008
It's interesting reading that searching the world's genetic data has
taken effectively the same time for the last 15 years. I'm assuming
of course that's using weighted-for-inflation cost of hardware.
GenBank size doubles every 18 months, and CPU performance doubles at
roughly the same rate.
(This coincidence is suggestive - is the growth in sequencing data
dependent on processing power? Eg, how much of the assembly process
is CPU bound?)
My technical optimism says that the algorithms should also be
improving, so similarity search time should have gone down. Given
the number of people who have worked on sequence similarity
algorithms, I'm surprised this hasn't happened.
Bruce:
> The use of different instruction sets is probably somewhat limited
> because the you need a compiler to support them and a related
> application that can use them. So far these have been orientated to
> multimedia applications and can make a significant difference if
> supported or not.
Granted, "MMX" was named after "multimedia" but the SSE instructions
are "Streaming SIMD Extensions" and the PowerPC's AltiVec are similar
in intent. gcc has supported those extensions for years, through
a couple of mechanisms. For this case you would use the gcc builtin
functions, like those listed at
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/X86-Built_002din-
Functions.html
Remember, C is a high-level assembler. ;)
I pointed out that the Wikipedia page claims some speedups by taking
advantage of SSE2. One of the claims is:
http://en.wikipedia.org/wiki/Smith-Waterman_algorithm
A SSE2 vectorization of the algorithm (Farrar, 2007) is
now now available providing an 8-fold speedup on Intel/AMD
processors with SSE2 extensions.[3] When running on Intel
processor using the new Intel Core microarchitecture the
SSE2 implementation achieves a 20-fold increase
Following the link to the Farrar paper: "Striped Smith–Waterman
speeds database searches six times over other SIMD implementations"
it concludes
http://bioinformatics.oxfordjournals.org/cgi/reprint/23/2/156.pdf
On the whole, the striped Smith–Waterman was faster than
FASTA, more than 50% faster when ktup=2 and four time
faster when ktup=1. SSEARCH averaged about 30% slower runtimes
when compared to BLAST.
In thinking about this I have a suspicion that people use BLAST
because people use BLAST. Something I've seen in other places is
that people like using widely-used tools because there's less need to
justify the choice. If you use some other tool then the first
question from an article reviewer, or at a conference, will be "why
did you use method-X instead of BLAST"?
Another possibility is that people trust NCBI to be right, and don't
trust other programs. Or that BLAST code isn't written to be easily
optimized. Or that NCBI doesn't take in external patches. There are
too many vagaries for me to be sure about any of these.
Why do people here use BLAST over at 30% slower but presumably more
accurate Smith-Waterman implementation? The code is available at:
http://farrar.michael.googlepages.com/Smith-waterman
Andrew
dalke at dalkescientific.com
More information about the biology-in-python
mailing list