[socal-piggies] Treaps vs heaps for best m of n problems

Sat Jan 16 12:12:35 PST 2010

> From: Dan Stromberg <strombrg at gmail.com>
> 
> It was suggested that a heap or priority queue might work well for what 
> I was using treaps for.
> 
> Reminder: n is the number of files examined, m is the number of the 
> largest files you're interested in, and m is often much smaller than n.
> 
> IINM, a heap would allow a nice O(n*log(n)) algorithm, but wouldn't work 
> that well for an O(n*log(m)) - because I believe that finding and 
> evicting the smallest value in a max heap is O(n).  In "highest", the 
> smallest value is evicted after an insertion, to keep the datastructure 
> from becoming larger than it needs to be - we can do this, because we 
> know we only care about the 1000 (or whatever) largest values.

I'm likely misunderstanding since I missed the meeting, but:

Isn't the idea to keep a *min*-heap of the up-to-m largest values seen
so far? On considering each of the n inputs, you compare to the
smallest of the m largest values so far and replace it if the current
input is larger. That's O(n*log(m)) time with a good heap type like
heapq.

Darius