[bip] parallel recipe and bio libraries.

Mon Feb 4 11:09:48 PST 2008

Hi,
Ignoring using a parallel version of blast, I think that there are two
ways to go:
1) Split the sequences across processors (one sequence against the
database for every processor) which I think you are doing.
2) Split the database across processors into the same number of pieces
as number of processors available and blast each piece by the same
sequence.

Obviously the second approach used by Blast is useful for a single
sequence. However, there may not be that much of a difference but
second can be more efficient because the database is only read once
(there is a point about this in the documentation). But that doesn't
necessary mean it is faster or significantly faster for when multiple
sequences are used . Also, as you pointed out, you can get separate
output files for each sequence rather than parsing a single output.

> but is there a builtin way to queue jobs with ncbi blast?
Not that I know of except by doing what you have done.

> would things go faster if i blasted against a single sequence?
I really do not know. If the 10kmers are overlapping parts of the
genome, then I would think you would have a smaller database and hence
be quicker. But using something like Bl2seq might be a better option.
Perhaps the best alternative is to use Blat, see the FAQ at
http://genome.ucsc.edu/FAQ/FAQblat for some info.

Regards
Bruce

On Feb 3, 2008 8:30 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> On Feb 3, 2008 6:11 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> > Hi,
> > I have not really looked at the code in depth but Blast uses as many
> > cpus as told to. Also it handle multiple sequences in a single file
> > which, in theory, is meant to be more efficient. Also, disk IO is also
> > a limiting factor especially with SMP (dual processors/cores) so I
> > usually find there is no advantage in doing database formating in
> > parallel.
> >
> > So what I am missing here?
> >
> > Regards
> > Bruce
> >
>
> good point. though i find that -a option doesnt always work as (i
> think it should). but is there a builtin way to queue jobs with ncbi
> blast?
> even if i were to only a single header per chromosomes (no 10kmers),
> that'd be 144 jobs. and, i prefer to have the blast output go to
> separate files.
> i forgot to explain the 10kmers...
> we store genomic sequence in our database as 10kmers so i often just
> use it like that. especially with poorly annotated genomes where
> trusting gene models is not a good idea.
> would things go faster if i blasted against a single sequence?
>
> thanks,
> -b
>