Quick Select Recap: 6 Different Methods, Median of Medians Algorithm

Overview

I discuss 6 methods (sort, heap, hash, .etc) to return kth item in an unordered array, median of medians algorithm even gives O(N) bound for the worst case.

The quick selection problem is discussed as an application of the quick sort in the Algorithm Part 1 course from Coursera, which gives me a good chance to recap the problem. I actually have discussed this problem one year ago in my posts: post1 and post2, in Chinese though. I would like to refresh all the stuff in English and give even more compact code.

6 Different methods

To tackle this problem, as far as I know there are 6 different methods:

  1. Sort the array and direct return a[k], this method put all the elements in its own right position and takes O(NlgN) time, since we only need to put the k-th item in its correct position, it turns out to be unnecessary to sort them all.
  2. Binray search [Min, Max] recursively where Min and Max are the minimum and maximum value in aray a[nLi, nHi];
  3. Make a maximum heap of size K, scan the whole array to compare every item with the top one in the heap, if the item >= top, it is ignored, if the item < top, the top one is popped out and the new item is pushed into the heap, it takes (Nlog K) time
  4. Use Hash table to calculate the counts of each item in the array, and liear scan the hash table to get the k-th item, it takes O(N + M) time where M is the maximum number in the input
  5. Use the Partition function in the quick sort and we only deal with one side. It takes O(N) average when random shuffle is adopted and O(N^2) time in the worst case
  6. Median of Medians algorithm which takes O(N) time in the worst case, however the constant is too large and make it not practical in real applications

I will discuss method 5 and 6 in detail.

Quick Selection

First, let’s take a look at the implementation of method 5, I wrote it in C++ based on the Java version in that course:

int partition(int a[], int nLo, int nHi) {
	int nPivot = a[nLo];
	int i = nLo, j = nHi + 1;

	while (true) {
		while (i < nHi && a[++i] < nPivot);
		while (j > nLo && a[--j] > nPivot);

 		if (i >= j) break;  //
		std::swap(a[i], a[j]);
	}
	std::swap(a[nLo], a[j]);
	return j;
}

int selectKthElement(int k, int naryNumbers[], int nLen) {
        if (k >= nLen) throw invalid_arugement_exception;

	std::random_shuffle(naryNumbers, naryNumbers + nLen);
	int nLo = 0, nHi = nLen - 1;
	int j = 0;
	while (nHi > nLo) {
		j = partition(naryNumbers, nLo, nHi);
		if      (j < k) nLo = j + 1;
		else if (j > k) nHi = j - 1;
		else    return naryNumbers[k]; //
	}
	return naryNumbers[k];
}

The partition function is just the normal 2-way partition I have discussed in the previous posts, and the while loop in the selectKthElement() function is the key.

We know after partition the j-th element is in its right position, and all the items before j are smaller than or queal to a[j], for those after j, they are largeer or equal, so it becomes obvious, if j < k, we need to find j in the smaller part and if j > k the k-th element has to be in later part, if j == k, then we already put k-th element in its right position, we are done,

**Remarks: **

While (nHi > nLo) is important and cannot be replaced with say while (nHi >= nLo). Note we return a[k] after while, we could safely  do this because

  1. after while (nHi > nLo) it has to be nHi == nLo
  2. k must be in its right position in the final statement “return naryNumbers[k]“, otherwise, it would be in left or right part and returned in the while loop
  3. the above code is more compact than the one in my old post1 in Chinese, in that old one, I have to calculate the relative k in the recursive call (i.e., we find a different value k-th element in the subproblems, while here it is always consistent to identify the original k-th element) and that would be error-prone, this should a noteworthy point to learn.

Median of Medians Algorithm

The median of medians algorithm tries to use the median of medians of groups (each groups is of size 5) as the pivot so that the depth of the recursive tree won’t be too big (will be logN), here is the approach.

  1. divide the whole array into groups, each group is of size 5 except the last one. O(N) time
  2. pick the median of each group and make a new array consisting of all these medians O(N) with large constant think about why?
  3. recursively call the current algorithm to identify the median of these medians in the new array, i.e., find the (M/2)th element where M is the number of groups
  4. pick the median from step 3 and put it in the left most position so we can directly call the partition function, this could be done by swapping in O(N) time, think about how?
  5. Do regular quick select as method 5

We can derive recursive formula to get the time complexity, the formal proof is quite complicated and is beyond my discussion here but roughly, step 3 and 4 try to guarantee that there are 0.3N items before the pivot and 0.3N items after that pivot and thus guarantee the final time complexity is linear although the constant becomes quite large and make this approach not practical.

The C++ code could be found in my old post2 here and you can read that for further reference.

Summary

6 methods (sort, heap, hash, .etc) to return kth item in an unordered array are discussed, median of medians algorithm even gives O(N) bound for the worst case.

So these are all what I want to summarize here in this post and I treat it as a recap for myself because I have already discussed it in Chinese one year ago. Please feel free to leave any comments or corrections. thanks a lot!


(Please specify the source  烟客旅人 sigmainfy — http://www.sigmainfy.com  as well as the original permalink

URL for any reprints,  and please do not use it for commercial purpose)

Written on October 22, 2014