[bip] IO tests

Wed Feb 6 08:45:53 PST 2008

Hi
Let me clarify, correct and state some of the implicit issues more clearly here.

First one assumes that the blast application is single threaded.
Second, it is assumed there is sufficient memory to hold the database
and sequences. Third, the assumption that everything is exactly
divisible by the number of processors so each processor gets exactly
the same amount of work. Fourth, I will ignore physical aspects of the
processor designs that may limit the ability to get all processors,
especially cores because these have to share the same resource (like
access to memory). Fifth, this assumes SMP system not a cluster
(communication is far more critical in a cluster than a shared-memory
system like SMP).

The times are done with an Intel quad dual core (so 4 processors), 8GB
of memory and blast version 2.2.17 on Linux x86_64 (Fedora 8). The
defaults of blastall were used. The same database of bovine unigene
sequences (1.1G in diskspace; 1,368,959 sequences; 858,973,497 total
letters)  and sequences were used in all cases. The input sequences
were used in the database and these are probably near the end of the
database but not sequential. I don't think that the actual database
and sequences is critical.

Single sequence on one processor
There is a single blast against one database. In that case the
sequence, database and other inputs (like scoring matrices) are read
into memory and the results written.

Multiple sequences on one processor processed sequentially
The same as a single sequence on one processor multiplied by the
number of sequences. Note that caching may reduce this slightly
because the processor may not always have to retrieve everything from
disk.

Single sequence with the database split across processors
The database can be split by the number of processors available and
each processor deals it's own database section. There are costs here -
database splitting, all blast instances need to be started on all
processors, communication of the sequence, database fragment and other
inputs to each processor and finally sending the results back to be
collated and written to disk. However, provided these are smaller than
the actual blast search, this is closer to linear speedup over a
single sequence and a single processor.

For example, for a single sequence and the output from the time from
the command line blastall using blastn and only changing the number of
cores or processors (-a)
N cores      real           user             sys
1             0m1.075s     0m1.024s   0m0.050s
2             0m0.574s    0m1.042s    0m0.068s
3             0m0.431s    0m1.046s   0m0.075s
4             0m0.333s    0m1.072s    0m0.076s

Multiple sequences with the database split across processors
There are probably different ways to think about this but I will try
to keep the same logic as with a single sequence case. The database
can be split by the number of processors available and each processor
deals it's own database section.  The reading and communication of the
input sequences and results still occurs for each processor. But the
same costs are present, many are only done once - database splitting,
reading inputs, starting blast instances and sending the database
component and inputs to each processor.

With 4 sequences.
N cores      real           user             sys
1             0m6.545s    0m6.466s     0m0.079s
2             0m3.658s    0m6.834s      0m0.091s
3             0m2.514s    0m6.759s       0m0.066s
4             0m2.033s   0m6.760s       0m0.100s

With 16 sequences  to try to keep processor busy a little longer :
N cores      real               user             sys
1             0m13.354s    0m13.206s     0m0.145s
2             0m7.585s       0m13.587s     0m0.176s
3             0m5.484s      0m13.645s     0m0.153s
4             0m4.493s      0m13.598s     0m0.170s

Multiple sequences split across processors
Compared to multiple sequences on one processor, we have just divided
the work across processors. The obvious cost is that we have to
perform the splitting of the sequences and communicate the sequences
to each processor. The hidden costs include disk IO because only one
processor can read from the disk or write to the disk at once.
However, if the blast search takes sufficient time then it will be
rare that this will occur and disk access is generally very quick.

I used xargs to send a single sequence to each processor at a time (-P
controls how many processors) . This faster than just running a script
that sent each blast job into the background. Yes this does span the
four instances of blast but does not save the output for all sequences
as given here.

Just to show individual times for a set of 4 sequences (given by
{filenames used} so you can use your own):

$ls {filenames used}  | xargs -L1 -P4 time blastall -p blastn -d
blast_database-o blast_output -a 1 -i
1.21user 0.06system 0:01.29elapsed
1.29user 0.05system 0:01.36elapsed
2.04user 0.05system 0:02.11elapsed
4.46user 0.05system 0:04.52elapsed

The total time from
$time ls {filenames used}  | xargs -L1 -P4 blastall -p blastn -d
blast_database-o blast_output -a 1 -i
N cores      real           user             sys
1            0m9.204s    0m8.998s     0m0.206s
2            0m5.781s     0m8.951s     0m0.269s
3            0m4.526s    0m8.992s     0m0.249s
4            0m4.507s    0m9.019s      0m0.212s

Okay lets use 16 sequences in a file 'go.sequences':
$time cat go.sequences  | xargs -L1 -P4 blastall -p blastn -d
blast_database-o blast_output -a 1 -i
N cores      real                  user             sys
1            0m25.792s        0m24.905s     0m0.885s
2            0m13.072s        0m24.932s     0m1.012s
3            0m8.837s         0m24.921s     0m1.018s
4            0m7.291s          0m25.055s     0m0.963s

The conclusion?
Well benchmarks should be repeated multiple times for multiple
sequences that should take the same amount of time for the blast
search.

If you have multiple processors use a multiple threaded blast application.

I do see an advantage in giving blast all sequences at once but the
speedup is not linear.

Really for most scenarios, most of processing time is spent in doing
the actual blast search so the other costs are rather minor provided
that each processor gets about the same amount of work.

There is one key flaw in the splitting of the database, it assumes
that each split has the same amount of work. So if only one sequence
can be run at a time, all processors must wait until all processors
have finished that sequence. This wait is avoided by not waiting and,
obviously, splitting the sequences across processors.

Regards
Bruce

On Feb 5, 2008 3:49 PM, Paul Davis <paul.joseph.davis at gmail.com> wrote:
> fastacmd -I output:
>
> Database: fetcher
>            31,077 sequences; 9,423,040 total letters
>
> Original fasta sequence file size: 13 M
>
>
> 8 Processes:
>
> Ran this from a script that put 8 blastall instances into the
> background and then waited for each to finish. Used time to give the
> command, processor utilization (approximation) and ellapsed wall clock
> time. Ran the script with time to get overall time and processor
> utilization.
>
> Notice the discrepancy, 8 * 99% != 692%. Not sure why that is exactly.
> The script does absolutely nothing after it waits for its background
> jobs.
>
> Each fasta file is approximately the same number of sequences. All 8
> are 1.4-1.7 MiB
>
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_6.fasta -o fp_6.blast
> 99% 549.40
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_3.fasta -o fp_3.blast
> 99% 572.45
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_5.fasta -o fp_5.blast
> 99% 575.52
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_7.fasta -o fp_7.blast
> 99% 579.57
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_0.fasta -o fp_0.blast
> 99% 590.05
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_4.fasta -o fp_4.blast
> 99% 597.35
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_2.fasta -o fp_2.blast
> 99% 633.53
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_1.fasta -o fp_1.blast
> 99% 691.82
> 4782.93user 6.48system 11:31.82elapsed 692%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+3983719minor)pagefaults 0swaps
>
> 8 Threads:
>
> /usr/bin/time blastall -p blastp -d fetcher -e 1e-5 -m 8 -i
> fetcher.fasta -o fetcher.blast -a 8
> 4794.22user 8.47system 19:33.96elapsed 409%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+4064461minor)pagefaults 0swaps
>
> 16 Threads:
>
> /usr/bin/time blastall -p blastp -d fetcher -e 1e-5 -m 8 -i
> fetcher.fasta -o fetcher.blast -a 16
> 4783.51user 8.31system 19:40.08elapsed 406%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+4101291minor)pagefaults 0swaps
>
> For some reason, with this small of a database, processor utilization
> never gets to 99% when using threads.
>
> Ran the 8 process and 8 thread test on a tmpfs ramdisk:
>
> 8 processes:
>
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_6.fasta -o fp_6.blast
> 99% 549.39
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_3.fasta -o fp_3.blast
> 99% 572.67
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_5.fasta -o fp_5.blast
> 99% 576.13
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_7.fasta -o fp_7.blast
> 99% 579.53
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_0.fasta -o fp_0.blast
> 99% 590.65
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_4.fasta -o fp_4.blast
> 99% 597.14
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_2.fasta -o fp_2.blast
> 99% 634.84
> blastall -p blastp -d fetcher -e 1e-5 -m 8 -i fp_1.fasta -o fp_1.blast
> 99% 692.08
> 4785.76user 6.60system 11:32.09elapsed 692%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+3983695minor)pagefaults 0swaps
>
> 8 threads:
>
> /usr/bin/time blastall -p blastp -d fetcher -e 1E-5 -m 8 -a 8 -i
> fetcher.fasta -o fetcher.blast
> 4795.48user 8.82system 19:33.90elapsed 409%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+4066324minor)pagefaults 0swaps
>
> ####
>
> Obviously this isn't IO bound. Times and stats are similar.
>
> Perhaps later when I really care I'll do the tests against NR.
>
> Paul
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>