Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of
The size of the population
Motivation
Consider we have unknown number of cards in a deck and we draw cards one at a time. Each time getting a card, we can decide to hold it or discard it. If we already hold a card and decide to hold a new card, we will discard the held card and replace it with a new card.
And at the end we want to ensures all cards have been given an equal chance to be selected.
Picking One Sample
If we discard a card at a fixed probability, then we are much more likely to keep a new card and discard an old card. For example, if we discard card with a probability of
To compensate this, we instead accept each new card
- The first card is always kept with probability
- For the second card, there is
probability of it replacing the old card, and both card have probability of being retained - For the third card, there is
probability it will replace the previous remaining card, so the first two card will each have probability remaining.
This pattern continue. For the
This ensures each card has exactly a
Picking Multiple Samples
We can extend this algorithm to support sampling
- We keep an array of
elements in memory. This is called a reservoir (thus the algorithm name) - The first
elements of the stream are placed directly into the reservoir - Rather than new element has
probability to be selected, now it has probability to be selected - When we decide to replace an old element, we choose one of the existing
elements in the array at random to replace This way, each element has equal probability to be included in the final reservoir.
Below is the pseudocode:
fn reservoir_sample(stream: Stream<T>, n: i32) {
let mut reservoir = T[k]
for i in 0..k {
reservoir[i] = stream.next()
}
let mut i = k
while element = stream.next() {
j = random(0, i)
if j < k:
reservoir[j] = element
i += 1
}
}
See Also
References
- Reservoir Sampling - A good visual introduction
- Reservoir sampling - Wikipedia - Goes into more efficient implementation and weighted reservoir sampling