Keeping things in (partial) order

I this thread I will discuss about making order in our life… well… at least in a long vector of objects… OK, just a small part of that vector.

This is a fundamental aspect of search engines: you are interested in getting only the k most relevant results, whatever the size of the collection you are searching on is.
The k value is usually very small with respect to the collection size, e.g. on the Web one is likely to look at just the first ten-twenty results. Search engines typically set a z value of maximum returned results that is designed to be larger than the largest part of possible k values, e.g. Google returns at most one thousand results.

Let us say our query returns six millions results, each one with a relevance score, and we are interested in getting the one thousand results with the highest relevance score.

The first way to get that one thousand results is to sort the whole six millions vector of results by their relevance score and then get the first one thousand. This is very inefficient.

The simple observation is that we do not care much about the set of 5,999,000 results we do not want to show, and instead we are sorting them, wasting a lot of time and space. The only thing we want to be sure is that all of them have a relevance score less than the thousandth result.

A second way to accomplish our mission is to just keep a vector of just one thousand results.
We fill it with the first one thousand results, then we scroll it to find the result with the lowest relevance value.
We then start looking at each of the remaining results, replacing the current lowest-relevance result in the vector whenever it has a lower relevance than the one currently observed. When the replacement happens we have also to recheck the vector to find the new lowest-relevance result.

Good, we are now space-efficient but we can still improve on time efficiency.
A first lazy way to get the lowest-relevance result after a replacement consists in doing a linear scan of the vector.
An eager way to solve the task consists in keeping the vector sorted by relevance and reordering it after any replacement. Although this method is not efficient, we are on the good way to reach our goal. The fact is that in this phase we just need to keep track of which is the lowest-relevance result, and only at the end we have to completely sort the vector of results.

The right tool that allow us to perform the task is a heap. A heap allow us to efficiently keep track of the lowest-ranked result with logarithmic cost with respect. Finally, the heapsort algorithm can be used to sort the vector.

Now that we are happy about our space and time efficiency we may look at the code. The following class implements four static methods that can be used to create and maintain a heap structure on top of a vector of IComparable objects, to efficiently produce a completely ordered vector from a heap-structured vector and, finally, to get the top k element from an arbitrarily long enumeration of objects.

// Base - The toolbox of the non-project-specific things.
// Copyright (C) 2007-2009 Andrea Esuli
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
//
// Andrea Esuli ( a n d r e a _ at _ e s u l i . i t )

using System;
using System.Collections.Generic;

namespace Esuli.Base
{
    public class HeapUtils
    {
        public static void Heapify(List items, IComparer comparer)
        {
            if (items.Count < 2)
                return;
            int parent = (items.Count - 2) / 2;
            while (parent >= 0)
            {
                SubHeapify(items, parent, items.Count, comparer);
                --parent;
            }
        }

        private static void SubHeapify(List items, int parent, int size, IComparer comparer)
        {
            int lastChild = 2 * (parent + 1);
            while (lastChild < size)
            {
                if (comparer.Compare(items[lastChild - 1], items[lastChild]) > 0)
                    --lastChild;
                if (comparer.Compare(items[parent], items[lastChild]) > 0)
                    break;

                T temp = items[parent];
                items[parent] = items[lastChild];
                items[lastChild] = temp;

                parent = lastChild;
                lastChild = 2 * (parent + 1);
            }
            if (lastChild == size)
            {
                --lastChild;
                if (comparer.Compare(items[parent], items[lastChild]) < 0)
                {
                    T temp = items[parent];
                    items[parent] = items[lastChild];
                    items[lastChild] = temp;
                }
            }
        }

        public static void HeapSort(List items, IComparer comparer, bool isHeap)
        {
            if (!isHeap)
                Heapify(items, comparer);

            int lastPosition = items.Count - 1;
            while (lastPosition > 0)
            {
                T temp = items[lastPosition];
                items[lastPosition] = items[0];
                items[0] = temp;
                SubHeapify(items, 0, lastPosition, comparer);
                --lastPosition;
            }
        }

        public static void ReplaceItem(T newItem, int position, List items, IComparer comparer)
        {
            items[position] = newItem;
            while (true)
            {
                SubHeapify(items, position, items.Count, comparer);
                if (position == 0)
                    return;
                position = (position - 1) / 2;
            }
        }

        public static List GetTopK(int k, IEnumerator resultsEnumerator, IComparer comparer)
        {
            List topResults = new List(k);
            while (resultsEnumerator.MoveNext())
            {
                T result = resultsEnumerator.Current;
                if (topResults.Count < k)
                {
                    topResults.Add(result);
                    if (topResults.Count == k)
                        HeapUtils.Heapify(topResults, comparer);
                }
                else if (comparer.Compare(result,topResults[0])<0)
                    HeapUtils.ReplaceItem(result, 0, topResults, comparer);
            }

            if (topResults.Count < k)
                HeapUtils.Heapify(topResults, comparer);
            HeapUtils.HeapSort(topResults, comparer, true);
            return topResults;
        }
    }
}

One thought on “Keeping things in (partial) order

  1. Pingback: Keeping things in (partial) order, 2nd part | esuli.it

Leave a Reply

Your email address will not be published.


8 − one =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>