Exploring Hypergeometric Random Sampling: A Guide to random.RandomState.hypergeometric()


Concept and Purpose

  • It calculates the probability of getting a specific number of "good" items in your sample.
  • The hypergeometric() function draws samples from the hypergeometric distribution, a discrete probability distribution that models situations where you draw a fixed number of items (without replacement) from a population containing two types of objects: "good" and "bad."

Parameters

  • size (int or tuple of ints, optional): The output shape. If provided, it determines the number of independent samples to draw (e.g., (m, n) generates m * n samples). If None (default), a single sample is returned when all input parameters are scalars. Otherwise, the total number of samples drawn is the broadcasted size of ngood, nbad, and nsample.
  • nsample (int or array-like of ints): The number of items to draw in the sample (must be at least 1 and at most ngood + nbad).
  • nbad (int or array-like of ints): The number of "bad" items in the population (must be non-negative).
  • ngood (int or array-like of ints): The number of "good" items in the population (must be non-negative).

Returns

  • ndarray or scalar: An array containing the number of "good" items drawn in each sample (if size is provided) or a single integer value (if size is None).

Random Sampling Context

  • It helps simulate scenarios where the probability of drawing a "good" item changes as you take more samples because the pool of remaining "good" items shrinks.
  • hypergeometric() is used for hypergeometric random sampling, which involves drawing items from a population without replacement and with two distinct categories.

Example

import numpy as np

# Population: 5 good items (red balls), 10 bad items (blue balls)
ngood = 5
nbad = 10

# Draw 3 samples without replacement
nsample = 3
samples = np.random.default_rng().hypergeometric(ngood, nbad, nsample, size=10)

print(samples)

This code might output something like:

[1 3 2 1 0 2 2 1 2 1]

Each value in the output array represents the number of "good" items (red balls) drawn in a single sample. The results will vary due to randomness.

  • Consider alternative distributions (e.g., binomial) if the sampling process involves replacement.
  • Use hypergeometric() when the number of items drawn affects the probability of selecting a specific type.
  • Understand the hypergeometric distribution for interpreting the probabilities.


Probability of Drawing Specific Number of "Good" Items

This code calculates the probability of drawing exactly 2 "good" items (red balls) in a sample of 4 from a population of 8 "good" and 12 "bad" items:

import numpy as np
from scipy.stats import hypergeom

ngood = 8
nbad = 12
nsample = 4
k = 2  # Number of "good" items we're interested in

p = hypergeom.pmf(k, ngood, nbad, nsample)  # Probability using SciPy's hypergeom
print("Probability of drawing exactly 2 good items:", p)

Drawing Samples with Different Sample Sizes

This code draws samples of different sizes (2 and 5) from a fixed population:

import numpy as np

ngood = 10
nbad = 15

sample_sizes = [2, 5]
samples = np.random.default_rng().hypergeometric(ngood, nbad, sample_sizes)

print("Samples with size 2:", samples[0])
print("Samples with size 5:", samples[1])

Multiple Independent Samples

This code draws 5 independent samples, each with 3 items drawn from a population of 6 "good" and 9 "bad" items:

import numpy as np

ngood = 6
nbad = 9
nsample = 3
num_samples = 5

samples = np.random.default_rng().hypergeometric(ngood, nbad, nsample, size=num_samples)

print("5 independent samples:")
print(samples)

Comparison with Binomial Distribution (for Replacement)

This code demonstrates the difference between hypergeometric and binomial distributions. It simulates drawing 10 samples of size 2, with replacement, from a population of 4 "good" and 6 "bad" items using both distributions:

import numpy as np
from scipy.stats import hypergeom, binom

ngood = 4
nbad = 6
nsample = 2
num_samples = 10

# Hypergeometric (without replacement)
hyper_samples = np.random.default_rng().hypergeometric(ngood, nbad, nsample, size=num_samples)

# Binomial (with replacement)
p_success = ngood / (ngood + nbad)  # Probability of drawing a "good" item
binom_samples = binom.rvs(nsample, p_success, size=num_samples)

print("Hypergeometric samples (without replacement):")
print(hyper_samples)
print("\nBinomial samples (with replacement):")
print(binom_samples)


SciPy's hypergeom.rvs()

  • It offers the same functionality as random.RandomState.hypergeometric() but may be more widely used in statistical analysis due to SciPy's reputation in scientific computing.
  • This function from the scipy.stats module is specifically designed for drawing random samples from the hypergeometric distribution.

Example

import numpy as np
from scipy.stats import hypergeom

ngood = 5
nbad = 10
nsample = 3
samples = hypergeom.rvs(ngood, nbad, nsample, size=10)  # Draw 10 samples

print(samples)

Custom Implementation (for Educational Purposes)

  • If you want to understand the logic behind hypergeometric sampling, you can create your own implementation. However, this approach is generally not recommended for production due to potential errors and inefficiencies.
import random

def hypergeometric_sample(ngood, nbad, nsample):
  """
  Draws a single sample from the hypergeometric distribution.

  Args:
      ngood (int): Number of "good" items in the population.
      nbad (int): Number of "bad" items in the population.
      nsample (int): Number of items to draw in the sample.

  Returns:
      int: Number of "good" items drawn in the sample.
  """
  population = ["good"] * ngood + ["bad"] * nbad
  random.shuffle(population)  # Shuffle the population
  return sum(item == "good" for item in population[:nsample])  # Count "good" items

# Example usage
ngood = 8
nbad = 12
nsample = 4

sample = hypergeometric_sample(ngood, nbad, nsample)
print("Sample:", sample)

Looping over random.choice() (for Simple Cases)

  • For very simple cases, you could use random.choice() from random module in a loop to simulate hypergeometric sampling without replacement. However, this approach can be inefficient and error-prone for larger populations or sample sizes.

Example

import random

def hypergeometric_sample_loop(ngood, nbad, nsample):
  """
  Simulates hypergeometric sampling using a loop (not efficient).

  Args:
      ngood (int): Number of "good" items in the population.
      nbad (int): Number of "bad" items in the population.
      nsample (int): Number of items to draw in the sample.

  Returns:
      int: Number of "good" items drawn in the sample.
  """
  population = ["good"] * ngood + ["bad"] * nbad
  sample = []
  for _ in range(nsample):
    item = random.choice(population)
    population.remove(item)
    sample.append(item)
  return sum(item == "good" for item in sample)

# Example usage (same as previous example)

ngood = 8
nbad = 12
nsample = 4

sample = hypergeometric_sample_loop(ngood, nbad, nsample)
print("Sample:", sample)
  • Avoid using random.choice() in a loop for large datasets as it's less efficient and more error-prone.
  • If you're new to hypergeometric sampling and want to understand the logic, a custom implementation can be helpful for learning purposes.
  • For most practical applications, scipy.stats.hypergeom.rvs() is the recommended alternative due to its efficiency and reliability.