Exploring Hypergeometric Random Sampling: A Guide to random.RandomState.hypergeometric()
Concept and Purpose
- It calculates the probability of getting a specific number of "good" items in your sample.
- The
hypergeometric()
function draws samples from the hypergeometric distribution, a discrete probability distribution that models situations where you draw a fixed number of items (without replacement) from a population containing two types of objects: "good" and "bad."
Parameters
size (int or tuple of ints, optional)
: The output shape. If provided, it determines the number of independent samples to draw (e.g.,(m, n)
generatesm * n
samples). IfNone
(default), a single sample is returned when all input parameters are scalars. Otherwise, the total number of samples drawn is the broadcasted size ofngood
,nbad
, andnsample
.nsample (int or array-like of ints)
: The number of items to draw in the sample (must be at least 1 and at mostngood + nbad
).nbad (int or array-like of ints)
: The number of "bad" items in the population (must be non-negative).ngood (int or array-like of ints)
: The number of "good" items in the population (must be non-negative).
Returns
ndarray or scalar
: An array containing the number of "good" items drawn in each sample (ifsize
is provided) or a single integer value (ifsize
isNone
).
Random Sampling Context
- It helps simulate scenarios where the probability of drawing a "good" item changes as you take more samples because the pool of remaining "good" items shrinks.
hypergeometric()
is used for hypergeometric random sampling, which involves drawing items from a population without replacement and with two distinct categories.
Example
import numpy as np
# Population: 5 good items (red balls), 10 bad items (blue balls)
ngood = 5
nbad = 10
# Draw 3 samples without replacement
nsample = 3
samples = np.random.default_rng().hypergeometric(ngood, nbad, nsample, size=10)
print(samples)
This code might output something like:
[1 3 2 1 0 2 2 1 2 1]
Each value in the output array represents the number of "good" items (red balls) drawn in a single sample. The results will vary due to randomness.
- Consider alternative distributions (e.g., binomial) if the sampling process involves replacement.
- Use
hypergeometric()
when the number of items drawn affects the probability of selecting a specific type. - Understand the hypergeometric distribution for interpreting the probabilities.
Probability of Drawing Specific Number of "Good" Items
This code calculates the probability of drawing exactly 2 "good" items (red balls) in a sample of 4 from a population of 8 "good" and 12 "bad" items:
import numpy as np
from scipy.stats import hypergeom
ngood = 8
nbad = 12
nsample = 4
k = 2 # Number of "good" items we're interested in
p = hypergeom.pmf(k, ngood, nbad, nsample) # Probability using SciPy's hypergeom
print("Probability of drawing exactly 2 good items:", p)
Drawing Samples with Different Sample Sizes
This code draws samples of different sizes (2 and 5) from a fixed population:
import numpy as np
ngood = 10
nbad = 15
sample_sizes = [2, 5]
samples = np.random.default_rng().hypergeometric(ngood, nbad, sample_sizes)
print("Samples with size 2:", samples[0])
print("Samples with size 5:", samples[1])
Multiple Independent Samples
This code draws 5 independent samples, each with 3 items drawn from a population of 6 "good" and 9 "bad" items:
import numpy as np
ngood = 6
nbad = 9
nsample = 3
num_samples = 5
samples = np.random.default_rng().hypergeometric(ngood, nbad, nsample, size=num_samples)
print("5 independent samples:")
print(samples)
Comparison with Binomial Distribution (for Replacement)
This code demonstrates the difference between hypergeometric and binomial distributions. It simulates drawing 10 samples of size 2, with replacement, from a population of 4 "good" and 6 "bad" items using both distributions:
import numpy as np
from scipy.stats import hypergeom, binom
ngood = 4
nbad = 6
nsample = 2
num_samples = 10
# Hypergeometric (without replacement)
hyper_samples = np.random.default_rng().hypergeometric(ngood, nbad, nsample, size=num_samples)
# Binomial (with replacement)
p_success = ngood / (ngood + nbad) # Probability of drawing a "good" item
binom_samples = binom.rvs(nsample, p_success, size=num_samples)
print("Hypergeometric samples (without replacement):")
print(hyper_samples)
print("\nBinomial samples (with replacement):")
print(binom_samples)
SciPy's hypergeom.rvs()
- It offers the same functionality as
random.RandomState.hypergeometric()
but may be more widely used in statistical analysis due to SciPy's reputation in scientific computing. - This function from the
scipy.stats
module is specifically designed for drawing random samples from the hypergeometric distribution.
Example
import numpy as np
from scipy.stats import hypergeom
ngood = 5
nbad = 10
nsample = 3
samples = hypergeom.rvs(ngood, nbad, nsample, size=10) # Draw 10 samples
print(samples)
Custom Implementation (for Educational Purposes)
- If you want to understand the logic behind hypergeometric sampling, you can create your own implementation. However, this approach is generally not recommended for production due to potential errors and inefficiencies.
import random
def hypergeometric_sample(ngood, nbad, nsample):
"""
Draws a single sample from the hypergeometric distribution.
Args:
ngood (int): Number of "good" items in the population.
nbad (int): Number of "bad" items in the population.
nsample (int): Number of items to draw in the sample.
Returns:
int: Number of "good" items drawn in the sample.
"""
population = ["good"] * ngood + ["bad"] * nbad
random.shuffle(population) # Shuffle the population
return sum(item == "good" for item in population[:nsample]) # Count "good" items
# Example usage
ngood = 8
nbad = 12
nsample = 4
sample = hypergeometric_sample(ngood, nbad, nsample)
print("Sample:", sample)
Looping over random.choice() (for Simple Cases)
- For very simple cases, you could use
random.choice()
fromrandom
module in a loop to simulate hypergeometric sampling without replacement. However, this approach can be inefficient and error-prone for larger populations or sample sizes.
Example
import random
def hypergeometric_sample_loop(ngood, nbad, nsample):
"""
Simulates hypergeometric sampling using a loop (not efficient).
Args:
ngood (int): Number of "good" items in the population.
nbad (int): Number of "bad" items in the population.
nsample (int): Number of items to draw in the sample.
Returns:
int: Number of "good" items drawn in the sample.
"""
population = ["good"] * ngood + ["bad"] * nbad
sample = []
for _ in range(nsample):
item = random.choice(population)
population.remove(item)
sample.append(item)
return sum(item == "good" for item in sample)
# Example usage (same as previous example)
ngood = 8
nbad = 12
nsample = 4
sample = hypergeometric_sample_loop(ngood, nbad, nsample)
print("Sample:", sample)
- Avoid using
random.choice()
in a loop for large datasets as it's less efficient and more error-prone. - If you're new to hypergeometric sampling and want to understand the logic, a custom implementation can be helpful for learning purposes.
- For most practical applications,
scipy.stats.hypergeom.rvs()
is the recommended alternative due to its efficiency and reliability.