[bip] parallel recipe and bio libraries.

Bruce Southey bsouthey at gmail.com
Mon Feb 4 18:29:20 PST 2008


Hi,
I agree that there are many factors involved and most are not easy to
quantify. For SMP systems (ie multiple cores) disk IO is a major
bottleneck because there is usually only one disk. Thus processors
have to wait their turn in order to use the disk whether reading or
writing and that can make other aspects rather moot.

Bruce

On Feb 4, 2008 4:53 PM, Paul Davis <paul.joseph.davis at gmail.com> wrote:
> Couple caveats with this, the claims on which is more efficient
> definitely depend on your input to database size ratio. Regardless of
> the two options, you're going to have to either read the database or
> the query multiple times. If your db size and query size are
> aproximately equal (all-vs-all blasting) then (theoretically) the two
> methods would be equivalent in terms of disk io efficiency.
>
> Also, remember that your e-values are going to change if you change
> the database size (by splitting it). And to my knowledge, its
> impossible to exactly match the single database e-values when blasting
> multiple sequences at once.
>
> In other news, I've noticed that the concurrent version of blast (ie,
> -a #threads, not mpi-blast) is unable to keep all cores busy. My first
> guess is that blast has some silly behavior when writing to the output
> file. As near as I can tell, blast wants to write output sequences in
> the same order they were read.
>
> So in general, blast's parrallel algorithm would be like this:
>
> Create N number of sequence chunks of roughly equal size
> let N threads process one chunk
> Sort and write chunk results to file
>
> Which hopefully isn't the case but it sure would fit the symptoms.
>
> So that's a round about way to say that I'm contemplating running
> multiple copies of non-parrallel blast. AFAICT, this would end up
> being faster assuming that the size of your input is ~= db size and
> you have a large number of sequences. ( > 100K ) and also assuming you
> have enough ram to hold all copies of the database in memory.
>
> I'll post test results at some point if anyone cares.
>
> HTH,
> Paul
>
>
> On 2/4/08, Bruce Southey <bsouthey at gmail.com> wrote:
> > Hi,
> > Ignoring using a parallel version of blast, I think that there are two
> > ways to go:
> > 1) Split the sequences across processors (one sequence against the
> > database for every processor) which I think you are doing.
> > 2) Split the database across processors into the same number of pieces
> > as number of processors available and blast each piece by the same
> > sequence.
> >
> > Obviously the second approach used by Blast is useful for a single
> > sequence. However, there may not be that much of a difference but
> > second can be more efficient because the database is only read once
> > (there is a point about this in the documentation). But that doesn't
> > necessary mean it is faster or significantly faster for when multiple
> > sequences are used . Also, as you pointed out, you can get separate
> > output files for each sequence rather than parsing a single output.
> >
> > > but is there a builtin way to queue jobs with ncbi blast?
> > Not that I know of except by doing what you have done.
> >
> > > would things go faster if i blasted against a single sequence?
> > I really do not know. If the 10kmers are overlapping parts of the
> > genome, then I would think you would have a smaller database and hence
> > be quicker. But using something like Bl2seq might be a better option.
> > Perhaps the best alternative is to use Blat, see the FAQ at
> > http://genome.ucsc.edu/FAQ/FAQblat for some info.
> >
> > Regards
> > Bruce
> >
> >
> > On Feb 3, 2008 8:30 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> > > On Feb 3, 2008 6:11 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> > > > Hi,
> > > > I have not really looked at the code in depth but Blast uses as many
> > > > cpus as told to. Also it handle multiple sequences in a single file
> > > > which, in theory, is meant to be more efficient. Also, disk IO is also
> > > > a limiting factor especially with SMP (dual processors/cores) so I
> > > > usually find there is no advantage in doing database formating in
> > > > parallel.
> > > >
> > > > So what I am missing here?
> > > >
> > > > Regards
> > > > Bruce
> > > >
> > >
> > > good point. though i find that -a option doesnt always work as (i
> > > think it should). but is there a builtin way to queue jobs with ncbi
> > > blast?
> > > even if i were to only a single header per chromosomes (no 10kmers),
> > > that'd be 144 jobs. and, i prefer to have the blast output go to
> > > separate files.
> > > i forgot to explain the 10kmers...
> > > we store genomic sequence in our database as 10kmers so i often just
> > > use it like that. especially with poorly annotated genomes where
> > > trusting gene models is not a good idea.
> > > would things go faster if i blasted against a single sequence?
> > >
> > > thanks,
> > > -b
> > >
> >
>
> > _______________________________________________
> > biology-in-python mailing list - bip at lists.idyll.org.
> >
> > See http://bio.scipy.org/ for our Wiki.
> >
>



More information about the biology-in-python mailing list