Random Generate Subset with Length K By Reservoir Sampling

Overview:

We discuss the problem to generate K (K <= N) items out of N total items uniformly at random while only storing at most K of them. Reservoir Sampling  will be utilized and the probability will be proved here. Special care needs to be taken to correctly implement the algorithm and make the sampling truly uniformly random.

Problem Definition:

This problem is adopted from Algorithms, Part I on coursera:

Subset client. Write a client program Subset.java that takes a command-line integer k; reads in a sequence of N strings from standard input using StdIn.readString(); and prints out exactly k of them, uniformly at random. Each item from the sequence can be printed out at most once. You may assume that 0 ≤ kN, where N is the number of string on standard input.

% echo A B C D E F G H I | java Subset 3       % echo AA BB BB BB BB BB CC CC | java Subset 8
C                                              BB
G                                              AA
A                                              BB
                                               CC
% echo A B C D E F G H I | java Subset 3       BB
E                                              BB
F                                              CC
G                                              BB

The running time of Subset must be linear in the size of the input. You may use only a constant amount of memory plus either one Deque or RandomizedQueue object of maximum size at most N, where N is the number of strings on standard input. (For an extra challenge, use only one Deque or RandomizedQueue object of maximum size at most k.)

Solutions and Analysis:

It would be difficult for people to come up with the solution which stores at most k objects if they never touched statistics stuff like sampling, and the solution to get the bonus point for this problem turns out to be using the so called Reservoir Sampling which I actually have already discussed it one year ago in my old post: “Reservoir Sampling” (it was written in Chinese though). The idea is simple,  just follow the steps (I assume the index starts from zero):

1) for the first k objects, just store them into the queue

2) for any later objects you read from the input you need to determine whether to put it into the queue or not

3) for the j-th object where j >= k (index starts from zero), we randomly generate an integer r within the range [0, j] (both sides inclusively), if r < k, we then just replace the r-th objects in the queue with the newly encountered j-th object. Otherwise, we continue and the j-th object will not go into the queue.

Now let’s prove that this algorithm will generate the subset in an uniformly random way, that is, each object in the input would have the same probability to get into the subset.

1) The probability for any j-th object get into the final queue (or subset) is k / (j + 1), you should figure this out by yourself

2) For the j-th object to remain in the final subset, it has to be that all the t-th objects where t > j, they won’t replace the j-th object

3) The probability for any t-th object where t > j to replace j-th object is 1 / (t + 1), i.e., the probability it goes into the subset and goes into exactly the position where the j-th object resides. Think about why?

4) So the probablity for any t-th object where t > j  not to replace j-th object is 1 – 1 / (t + 1)

5) All the things from 1) to 4) need to happen at the same time, so the final probability for any j-th object to remain in the final subset is:

k / (j + 1)  * (1 – 1 / (j + 2)) * (1 – 1 / (j + 3)) * … * (1 – 1 / (N + 1)) = k / (N + 1)

which turns out to be constant for any object.

And note another advantage of this approach is that it is online and does not need to know the size of the input in advance.

The tricky part:

Actually, the implementation is quite tricky. The key point is that the random number r has to be within the closed interval [0, j] rather in the half-open interval [0, j). Otherwise, for the special case k = 1, the first object will always be replace by the second object and thus making it not uniformly random. You should figure this out by yourself and keep it in your mind that the index starts from zero.

By the way, I summarize all the key points I have in my mind for the dequeue and randomized queue problems too to pass that assignment 2:

0) You might fail to pass the subset problem if you do not use any of the randomized queue or dequeue although from the description of the Reservoir Sampling, it actually does not need any queue data structure at all

1) How to read the input and stop when encountering EOF symbol in eclipse: Call the API while (!StdIn.isEmpty()) and use Ctrl + Z as EOF

2) Randomized queue actually works more like a bag that a real queue

3) Do not try to randomize enqueue() instead of dequeue(), in the test, dequeue() need to return objects differently each time it is called

4) Do not forget to shrink the array when implementing dequeue for the randomized queue

5) Do not forget to set  Node.next = Node.prev = null  to avoid Loitering

6) Update: Someone in the forum got stuck because of different behavior due to the implementation of randomized queue when trying the Reservoir Sampling. To keep the whole system uniform, you need a bit different modification based on the original Reservoir Sampling. Here is my way to resolve that: dequeue first every time you got a new item to determine enqueue it or not, and if we got r <= k, we enqueue the new item, otherwise, we enqueue the dequeued one!

Summary:

To summarize, I described  Reservoir Sampling algorithm to tackle the problem to generate K (K <= N) items out of N total items uniformly at random while only storing at most K of them. The analysis and proof of the probability is also given and a special case is pointed out which needs to be carefully taken care of.

(End. Original articles, please associate the author information and original URL with any reprints)


(Please specify the source  烟客旅人 sigmainfy — http://www.sigmainfy.com for any reprints,  

and please do not use it for commercial purpose)

Written on October 1, 2014