[socal-piggies] Treaps vs heaps for best m of n problems
Dan Stromberg
strombrg at gmail.com
Sun Jan 17 16:27:30 PST 2010
Darius Bacon wrote:
>> From: Dan Stromberg <strombrg at gmail.com>
>>
>> It was suggested that a heap or priority queue might work well for what
>> I was using treaps for.
>>
>> Reminder: n is the number of files examined, m is the number of the
>> largest files you're interested in, and m is often much smaller than n.
>>
>> IINM, a heap would allow a nice O(n*log(n)) algorithm, but wouldn't work
>> that well for an O(n*log(m)) - because I believe that finding and
>> evicting the smallest value in a max heap is O(n). In "highest", the
>> smallest value is evicted after an insertion, to keep the datastructure
>> from becoming larger than it needs to be - we can do this, because we
>> know we only care about the 1000 (or whatever) largest values.
>>
>
> I'm likely misunderstanding since I missed the meeting, but:
>
> Isn't the idea to keep a *min*-heap of the up-to-m largest values seen
> so far? On considering each of the n inputs, you compare to the
> smallest of the m largest values so far and replace it if the current
> input is larger. That's O(n*log(m)) time with a good heap type like
> heapq.
>
It could be me misunderstanding.
There were two programs we discussed that would use this treap module
(highest does now, gprog soon will if time permits), though at this
point I'm pretty confident there are others that could benefit. I wrote
the module for gprog; highest was a nice real-life test bed for it:
1. http://stromberg.dnsalias.org/~strombrg/highest/ is the simplest
example, and... I think use of heapq will yield an asymptotically
equivalent implementation; I'm not sure what the constants would
be like yet. heapq would almost certainly require a little less
memory, proportionate to m.
2. For the other application (not yet on the web) named gprog, I
believe the sometimes useful trick of storing a minheap of the m
largest values wouldn't work because you often need to evict old,
weak values and also often need to get a copy of the strongest
value - that is, you need both the least and the greatest to
obtain a safe blocksize choice and a new blocksize addition).
gprog is a sort of bandwidth meter for pipes - a GUI one that uses
a cache oblivious algorithm to avoid becoming a bottleneck... So
it forms lots of guesses as to good transfer size and keeps around
the best m guesses out of n (where n is the number of block size
guesses tried) for later It also benefits from being able to
detect collisions between the randomly generated current transfer
size, and a previously-guessed transfer size.
You may find this interesting. It's a performance comparison of a
number of search tree datastructures:
http://en.wikipedia.org/wiki/Binary_search_tree#cite_note-1
The result was that treaps gave the best average performance but had a
highish standard deviation, and red black trees gave good performance
with a low standard deviation.
Unfortunately, standard heaps don't appear to have been compared in the
paper - quite reasonably, as they don't really have the same purpose.
I've today added a third mode to "highest" that uses heapq (not yet on
the web); a pair of performance graphs should complete within a few
days. /Initial/ indications are that, at least for varied n, heapq
yields an implementation that's in between the insort and treap
implementations in performance, though closer to the insort times than
the treap times.
If I'm off in any of this, please do correct me!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/socal-piggies/attachments/20100117/975b491b/attachment.htm>
More information about the socal-piggies
mailing list