Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of items from a population of unknown size in a single pass over the items.

The size of the population is not known to the algorithm, and is typically too large to fit in the memory at once. As a result, the input must be processed as a stream, and the algorithm cannot revisit previous items once they’ve been seen. Image Source: Nvidia

Motivation

Consider we have unknown number of cards in a deck and we draw cards one at a time. Each time getting a card, we can decide to hold it or discard it. If we already hold a card and decide to hold a new card, we will discard the held card and replace it with a new card.

And at the end we want to ensures all cards have been given an equal chance to be selected.

Picking One Sample

If we discard a card at a fixed probability, then we are much more likely to keep a new card and discard an old card. For example, if we discard card with a probability of , then the probability of keeping the last card is , while the probability of holding the first card is .

To compensate this, we instead accept each new card at a decreasing probability .

  • The first card is always kept with probability
  • For the second card, there is probability of it replacing the old card, and both card have probability of being retained
  • For the third card, there is probability it will replace the previous remaining card, so the first two card will each have probability remaining.

This pattern continue. For the -th card, the probability of it replacing previous card is , and the probability that each of the previous card get retained is

This ensures each card has exactly a chance of being selected.

Picking Multiple Samples

We can extend this algorithm to support sampling samples with equal probability from the stream. We will make the following change

  • We keep an array of elements in memory. This is called a reservoir (thus the algorithm name)
  • The first elements of the stream are placed directly into the reservoir
  • Rather than new element has probability to be selected, now it has probability to be selected
  • When we decide to replace an old element, we choose one of the existing elements in the array at random to replace This way, each element has equal probability to be included in the final reservoir.

Below is the pseudocode:

fn reservoir_sample(stream: Stream<T>, n: i32) {
  let mut reservoir = T[k]
  for i in 0..k {
    reservoir[i] = stream.next()
  }
 
  let mut i = k
  while element = stream.next() {
    j = random(0, i)
    if j < k:
        reservoir[j] = element
    i += 1
  }
}

See Also

References