[bip] parallel recipe and bio libraries.

Paul Davis paul.joseph.davis at gmail.com
Mon Feb 4 14:53:21 PST 2008


Couple caveats with this, the claims on which is more efficient
definitely depend on your input to database size ratio. Regardless of
the two options, you're going to have to either read the database or
the query multiple times. If your db size and query size are
aproximately equal (all-vs-all blasting) then (theoretically) the two
methods would be equivalent in terms of disk io efficiency.

Also, remember that your e-values are going to change if you change
the database size (by splitting it). And to my knowledge, its
impossible to exactly match the single database e-values when blasting
multiple sequences at once.

In other news, I've noticed that the concurrent version of blast (ie,
-a #threads, not mpi-blast) is unable to keep all cores busy. My first
guess is that blast has some silly behavior when writing to the output
file. As near as I can tell, blast wants to write output sequences in
the same order they were read.

So in general, blast's parrallel algorithm would be like this:

Create N number of sequence chunks of roughly equal size
let N threads process one chunk
Sort and write chunk results to file

Which hopefully isn't the case but it sure would fit the symptoms.

So that's a round about way to say that I'm contemplating running
multiple copies of non-parrallel blast. AFAICT, this would end up
being faster assuming that the size of your input is ~= db size and
you have a large number of sequences. ( > 100K ) and also assuming you
have enough ram to hold all copies of the database in memory.

I'll post test results at some point if anyone cares.

HTH,
Paul

On 2/4/08, Bruce Southey <bsouthey at gmail.com> wrote:
> Hi,
> Ignoring using a parallel version of blast, I think that there are two
> ways to go:
> 1) Split the sequences across processors (one sequence against the
> database for every processor) which I think you are doing.
> 2) Split the database across processors into the same number of pieces
> as number of processors available and blast each piece by the same
> sequence.
>
> Obviously the second approach used by Blast is useful for a single
> sequence. However, there may not be that much of a difference but
> second can be more efficient because the database is only read once
> (there is a point about this in the documentation). But that doesn't
> necessary mean it is faster or significantly faster for when multiple
> sequences are used . Also, as you pointed out, you can get separate
> output files for each sequence rather than parsing a single output.
>
> > but is there a builtin way to queue jobs with ncbi blast?
> Not that I know of except by doing what you have done.
>
> > would things go faster if i blasted against a single sequence?
> I really do not know. If the 10kmers are overlapping parts of the
> genome, then I would think you would have a smaller database and hence
> be quicker. But using something like Bl2seq might be a better option.
> Perhaps the best alternative is to use Blat, see the FAQ at
> http://genome.ucsc.edu/FAQ/FAQblat for some info.
>
> Regards
> Bruce
>
>
> On Feb 3, 2008 8:30 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> > On Feb 3, 2008 6:11 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> > > Hi,
> > > I have not really looked at the code in depth but Blast uses as many
> > > cpus as told to. Also it handle multiple sequences in a single file
> > > which, in theory, is meant to be more efficient. Also, disk IO is also
> > > a limiting factor especially with SMP (dual processors/cores) so I
> > > usually find there is no advantage in doing database formating in
> > > parallel.
> > >
> > > So what I am missing here?
> > >
> > > Regards
> > > Bruce
> > >
> >
> > good point. though i find that -a option doesnt always work as (i
> > think it should). but is there a builtin way to queue jobs with ncbi
> > blast?
> > even if i were to only a single header per chromosomes (no 10kmers),
> > that'd be 144 jobs. and, i prefer to have the blast output go to
> > separate files.
> > i forgot to explain the 10kmers...
> > we store genomic sequence in our database as 10kmers so i often just
> > use it like that. especially with poorly annotated genomes where
> > trusting gene models is not a good idea.
> > would things go faster if i blasted against a single sequence?
> >
> > thanks,
> > -b
> >
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>



More information about the biology-in-python mailing list