[socal-piggies] Treaps vs heaps for best m of n problems

Sun Jan 17 16:27:30 PST 2010

Darius Bacon wrote:
>> From: Dan Stromberg <strombrg at gmail.com>
>>
>> It was suggested that a heap or priority queue might work well for what 
>> I was using treaps for.
>>
>> Reminder: n is the number of files examined, m is the number of the 
>> largest files you're interested in, and m is often much smaller than n.
>>
>> IINM, a heap would allow a nice O(n*log(n)) algorithm, but wouldn't work 
>> that well for an O(n*log(m)) - because I believe that finding and 
>> evicting the smallest value in a max heap is O(n).  In "highest", the 
>> smallest value is evicted after an insertion, to keep the datastructure 
>> from becoming larger than it needs to be - we can do this, because we 
>> know we only care about the 1000 (or whatever) largest values.
>>     
>
> I'm likely misunderstanding since I missed the meeting, but:
>
> Isn't the idea to keep a *min*-heap of the up-to-m largest values seen
> so far? On considering each of the n inputs, you compare to the
> smallest of the m largest values so far and replace it if the current
> input is larger. That's O(n*log(m)) time with a good heap type like
> heapq.
>   
It could be me misunderstanding.

There were two programs we discussed that would use this treap module 
(highest does now, gprog soon will if time permits), though at this 
point I'm pretty confident there are others that could benefit.  I wrote 
the module for gprog; highest was a nice real-life test bed for it:

   1. http://stromberg.dnsalias.org/~strombrg/highest/ is the simplest
      example, and... I think use of heapq will yield an asymptotically
      equivalent implementation; I'm not sure what the constants would
      be like yet.  heapq would almost certainly require a little less
      memory, proportionate to m.
   2. For the other application (not yet on the web) named gprog, I
      believe the sometimes useful trick of storing a minheap of the m
      largest values wouldn't work because you often need to evict old,
      weak values and also often need to get a copy of the strongest
      value - that is, you need both the least and the greatest to
      obtain a safe blocksize choice and a new blocksize addition). 
      gprog is a sort of bandwidth meter for pipes - a GUI one that uses
      a cache oblivious algorithm to avoid becoming a bottleneck... So
      it forms lots of guesses as to good transfer size and keeps around
      the best m guesses out of n (where n is the number of block size
      guesses tried) for later   It also benefits from being able to
      detect collisions between the randomly generated current transfer
      size, and a previously-guessed transfer size.

You may find this interesting.  It's a performance comparison of a 
number of search tree datastructures:
http://en.wikipedia.org/wiki/Binary_search_tree#cite_note-1
The result was that treaps gave the best average performance but had a 
highish standard deviation, and red black trees gave good performance 
with a low standard deviation.
Unfortunately, standard heaps don't appear to have been compared in the 
paper - quite reasonably, as they don't really have the same purpose.

I've today added a third mode to "highest" that uses heapq (not yet on 
the web); a pair of performance graphs should complete within a few 
days.  /Initial/ indications are that, at least for varied n, heapq 
yields an implementation that's in between the insort and treap 
implementations in performance, though closer to the insort times than 
the treap times.

If I'm off in any of this, please do correct me!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/socal-piggies/attachments/20100117/975b491b/attachment.htm>