[bip] IO tests
Bruce Southey
bsouthey at gmail.com
Wed Feb 6 08:45:53 PST 2008
Hi
Let me clarify, correct and state some of the implicit issues more clearly here.
First one assumes that the blast application is single threaded.
Second, it is assumed there is sufficient memory to hold the database
and sequences. Third, the assumption that everything is exactly
divisible by the number of processors so each processor gets exactly
the same amount of work. Fourth, I will ignore physical aspects of the
processor designs that may limit the ability to get all processors,
especially cores because these have to share the same resource (like
access to memory). Fifth, this assumes SMP system not a cluster
(communication is far more critical in a cluster than a shared-memory
system like SMP).
The times are done with an Intel quad dual core (so 4 processors), 8GB
of memory and blast version 2.2.17 on Linux x86_64 (Fedora 8). The
defaults of blastall were used. The same database of bovine unigene
sequences (1.1G in diskspace; 1,368,959 sequences; 858,973,497 total
letters) and sequences were used in all cases. The input sequences
were used in the database and these are probably near the end of the
database but not sequential. I don't think that the actual database
and sequences is critical.
Single sequence on one processor
There is a single blast against one database. In that case the
sequence, database and other inputs (like scoring matrices) are read
into memory and the results written.
Multiple sequences on one processor processed sequentially
The same as a single sequence on one processor multiplied by the
number of sequences. Note that caching may reduce this slightly
because the processor may not always have to retrieve everything from
disk.
Single sequence with the database split across processors
The database can be split by the number of processors available and
each processor deals it's own database section. There are costs here -
database splitting, all blast instances need to be started on all
processors, communication of the sequence, database fragment and other
inputs to each processor and finally sending the results back to be
collated and written to disk. However, provided these are smaller than
the actual blast search, this is closer to linear speedup over a
single sequence and a single processor.
For example, for a single sequence and the output from the time from
the command line blastall using blastn and only changing the number of
cores or processors (-a)
N cores real user sys
1 0m1.075s 0m1.024s 0m0.050s
2 0m0.574s 0m1.042s 0m0.068s
3 0m0.431s 0m1.046s 0m0.075s
4 0m0.333s 0m1.072s 0m0.076s
Multiple sequences with the database split across processors
There are probably different ways to think about this but I will try
to keep the same logic as with a single sequence case. The database
can be split by the number of processors available and each processor
deals it's own database section. The reading and communication of the
input sequences and results still occurs for each processor. But the
same costs are present, many are only done once - database splitting,
reading inputs, starting blast instances and sending the database
component and inputs to each processor.
With 4 sequences.
N cores real user sys
1 0m6.545s 0m6.466s 0m0.079s
2 0m3.658s 0m6.834s 0m0.091s
3 0m2.514s 0m6.759s 0m0.066s
4 0m2.033s 0m6.760s 0m0.100s
With 16 sequences to try to keep processor busy a little longer :
N cores real user sys
1 0m13.354s 0m13.206s 0m0.145s
2 0m7.585s 0m13.587s 0m0.176s
3 0m5.484s 0m13.645s 0m0.153s
4 0m4.493s 0m13.598s 0m0.170s
Multiple sequences split across processors
Compared to multiple sequences on one processor, we have just divided
the work across processors. The obvious cost is that we have to
perform the splitting of the sequences and communicate the sequences
to each processor. The hidden costs include disk IO because only one
processor can read from the disk or write to the disk at once.
However, if the blast search takes sufficient time then it will be
rare that this will occur and disk access is generally very quick.
I used xargs to send a single sequence to each processor at a time (-P
controls how many processors) . This faster than just running a script
that sent each blast job into the background. Yes this does span the
four instances of blast but does not save the output for all sequences
as given here.
Just to show individual times for a set of 4 sequences (given by
{filenames used} so you can use your own):
$ls {filenames used} | xargs -L1 -P4 time blastall -p blastn -d
blast_database-o blast_output -a 1 -i
1.21user 0.06system 0:01.29elapsed
1.29user 0.05system 0:01.36elapsed
2.04user 0.05system 0:02.11elapsed
4.46user 0.05system 0:04.52elapsed
The total time from
$time ls {filenames used} | xargs -L1 -P4 blastall -p blastn -d
blast_database-o blast_output -a 1 -i
N cores real user sys
1 0m9.204s 0m8.998s 0m0.206s
2 0m5.781s 0m8.951s 0m0.269s
3 0m4.526s 0m8.992s 0m0.249s
4 0m4.507s 0m9.019s 0m0.212s
Okay lets use 16 sequences in a file 'go.sequences':
$time cat go.sequences | xargs -L1 -P4 blastall -p blastn -d
blast_database-o blast_output -a 1 -i
N cores real user sys
1 0m25.792s 0m24.905s 0m0.885s
2 0m13.072s 0m24.932s 0m1.012s
3 0m8.837s 0m24.921s 0m1.018s
4 0m7.291s 0m25.055s 0m0.963s
The conclusion?
Well benchmarks should be repeated multiple times for multiple
sequences that should take the same amount of time for the blast
search.
If you have multiple processors use a multiple threaded blast application.
I do see an advantage in giving blast all sequences at once but the
speedup is not linear.
Really for most scenarios, most of processing time is spent in doing
the actual blast search so the other costs are rather minor provided
that each processor gets about the same amount of work.
There is one key flaw in the splitting of the database, it assumes
that each split has the same amount of work. So if only one sequence
can be run at a time, all processors must wait until all processors
have finished that sequence. This wait is avoided by not waiting and,
obviously, splitting the sequences across processors.
Regards
Bruce
On Feb 5, 2008 3:49 PM, Paul Davis <paul.joseph.davis at gmail.com> wrote:
> fastacmd -I output:
>
> Database: fetcher
> 31,077 sequences; 9,423,040 total letters
>
> Original fasta sequence file size: 13 M
>
>
> 8 Processes:
>
> Ran this from a script that put 8 blastall instances into the
> background and then waited for each to finish. Used time to give the
> command, processor utilization (approximation) and ellapsed wall clock
> time. Ran the script with time to get overall time and processor
> utilization.
>
> Notice the discrepancy, 8 * 99% != 692%. Not sure why that is exactly.
> The script does absolutely nothing after it waits for its background
> jobs.
>
> Each fasta file is approximately the same number of sequences. All 8
> are 1.4-1.7 MiB
>
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_6.fasta -o fp_6.blast
> 99% 549.40
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_3.fasta -o fp_3.blast
> 99% 572.45
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_5.fasta -o fp_5.blast
> 99% 575.52
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_7.fasta -o fp_7.blast
> 99% 579.57
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_0.fasta -o fp_0.blast
> 99% 590.05
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_4.fasta -o fp_4.blast
> 99% 597.35
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_2.fasta -o fp_2.blast
> 99% 633.53
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_1.fasta -o fp_1.blast
> 99% 691.82
> 4782.93user 6.48system 11:31.82elapsed 692%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+3983719minor)pagefaults 0swaps
>
> 8 Threads:
>
> /usr/bin/time blastall -p blastp -d fetcher -e 1e-5 -m 8 -i
> fetcher.fasta -o fetcher.blast -a 8
> 4794.22user 8.47system 19:33.96elapsed 409%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+4064461minor)pagefaults 0swaps
>
> 16 Threads:
>
> /usr/bin/time blastall -p blastp -d fetcher -e 1e-5 -m 8 -i
> fetcher.fasta -o fetcher.blast -a 16
> 4783.51user 8.31system 19:40.08elapsed 406%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+4101291minor)pagefaults 0swaps
>
> For some reason, with this small of a database, processor utilization
> never gets to 99% when using threads.
>
> Ran the 8 process and 8 thread test on a tmpfs ramdisk:
>
> 8 processes:
>
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_6.fasta -o fp_6.blast
> 99% 549.39
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_3.fasta -o fp_3.blast
> 99% 572.67
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_5.fasta -o fp_5.blast
> 99% 576.13
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_7.fasta -o fp_7.blast
> 99% 579.53
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_0.fasta -o fp_0.blast
> 99% 590.65
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_4.fasta -o fp_4.blast
> 99% 597.14
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_2.fasta -o fp_2.blast
> 99% 634.84
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_1.fasta -o fp_1.blast
> 99% 692.08
> 4785.76user 6.60system 11:32.09elapsed 692%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+3983695minor)pagefaults 0swaps
>
> 8 threads:
>
> /usr/bin/time blastall -p blastp -d fetcher -e 1E-5 -m 8 -a 8 -i
> fetcher.fasta -o fetcher.blast
> 4795.48user 8.82system 19:33.90elapsed 409%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+4066324minor)pagefaults 0swaps
>
> ####
>
> Obviously this isn't IO bound. Times and stats are similar.
>
> Perhaps later when I really care I'll do the tests against NR.
>
> Paul
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>
More information about the biology-in-python
mailing list