[bip] parallel recipe and bio libraries.

Brent Pedersen bpederse at gmail.com
Mon Feb 4 20:40:06 PST 2008


On Feb 4, 2008 2:53 PM, Paul Davis <paul.joseph.davis at gmail.com> wrote:
> Couple caveats with this, the claims on which is more efficient
> definitely depend on your input to database size ratio. Regardless of
> the two options, you're going to have to either read the database or
> the query multiple times. If your db size and query size are
> aproximately equal (all-vs-all blasting) then (theoretically) the two
> methods would be equivalent in terms of disk io efficiency.
>
> Also, remember that your e-values are going to change if you change
> the database size (by splitting it). And to my knowledge, its
> impossible to exactly match the single database e-values when blasting
> multiple sequences at once.
>
> In other news, I've noticed that the concurrent version of blast (ie,
> -a #threads, not mpi-blast) is unable to keep all cores busy. My first
> guess is that blast has some silly behavior when writing to the output
> file. As near as I can tell, blast wants to write output sequences in
> the same order they were read.
>
> So in general, blast's parrallel algorithm would be like this:
>
> Create N number of sequence chunks of roughly equal size
> let N threads process one chunk
> Sort and write chunk results to file
>
> Which hopefully isn't the case but it sure would fit the symptoms.
>
> So that's a round about way to say that I'm contemplating running
> multiple copies of non-parrallel blast. AFAICT, this would end up
> being faster assuming that the size of your input is ~= db size and
> you have a large number of sequences. ( > 100K ) and also assuming you
> have enough ram to hold all copies of the database in memory.
>
> I'll post test results at some point if anyone cares.
>
> HTH,
> Paul
>

i've noticed that with -a as well but never fully diagnosed. i'd be
interested to see your results.
as 2 data points: with the script as pasted, i'm able to keep my 8
core machine at about 95% cpu usage pretty constant while writing to a
non-nfs, non-raid drive, reading from a different drive.
on an aging cluster with 24 nodes, the bottleneck was definitely IO
(or network) because of the nfs. the head node would be at full load.
that, plus the time to scp sequence and output to/from the cluster.



More information about the biology-in-python mailing list