SVY_PRINT_WIDTH = 65 # Set the rich console withSample Selection
In survey sampling, the target population is the entire group of individuals or units that you want to draw conclusions about. Defining the target population is one of the first steps in designing a survey. Specifying the target population will require answering many important questions such as:
- What units are of interest (e.g., individuals, households, schools)
- Where are they located (e.g., a country, region, or sector)
- When should they be observed (e.g., during a specific year or survey cycle)
For practical purposes, a sampling frame is often used to select a sample for a study. A sampling frame is a list of all units in the target population from which a sample can be drawn.
Ideally, the sampling frame should fully represent the target population—including all units that belong to it and excluding any that do not. However, in practice, sampling frames are often incomplete, outdated, or may contain duplicates or ineligible units. For more, see Section 1.2 of Lohr (2021).
In this tutorial, we do not address how to construct a high-quality sampling frame or assess its completeness and accuracy. Instead, we focus on how to use svy to select samples for a survey.
This tutorial utilizes a two-stage cluster sampling design. The sampling process consists of two stages:
First Stage: We select Primary Sampling Units (PSUs), which are geographic clusters of households that cover the entire country without overlap. These PSUs are be selected using probability proportional to size (PPS) without replacement, meaning larger clusters have a higher chance of being included.
Second Stage: Within each selected PSU, we then select households, or Secondary Sampling Units (SSUs), using simple random sampling without replacement.
First Stage: Selection of PSUs
As noted earlier, we use the World Bank synthetic population for an imaginary country to illustrate the sample-selection workflow. We begin by loading the enumeration-area (EA) frame, the listing of all primary sampling units (PSUs) in the target population. The EA frame was derived from the population data, and we intentionally perturbed the number of EAs per cluster to mirror a common reality: frames built from an earlier census are often out of date by the time a survey is fielded.
import numpy as np
from svy import load_dataset
ea_frame = load_dataset(name="ea_frame_wb_2023", limit=None)
print(ea_frame.head(5))shape: (5, 5)
┌────────┬───────────┬────────┬───────┬───────────────┐
│ geo1 ┆ geo2 ┆ urbrur ┆ ea ┆ n_hlds_census │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 │
╞════════╪═══════════╪════════╪═══════╪═══════════════╡
│ geo_01 ┆ geo_01_01 ┆ Urban ┆ 11001 ┆ 265 │
│ geo_01 ┆ geo_01_01 ┆ Urban ┆ 11002 ┆ 466 │
│ geo_01 ┆ geo_01_01 ┆ Urban ┆ 11003 ┆ 443 │
│ geo_01 ┆ geo_01_01 ┆ Urban ┆ 11004 ┆ 500 │
│ geo_01 ┆ geo_01_01 ┆ Urban ┆ 11005 ┆ 455 │
└────────┴───────────┴────────┴───────┴───────────────┘
import polars as pl
print(f"The number of clusters is {ea_frame.shape[0]}")
print(
f"The average cluster size is {round(ea_frame['n_hlds_census'].mean(), 1)}"
)The number of clusters is 5940
The average cluster size is 374.1
In the imaginary country, geo_01 contains only urban areas, whereas the other regions include both urban and rural areas. We stratify by admin-1 and residence type, sample clusters with probability proportional to size based on the number of households, and then interview 20 households per selected cluster.
from svy import Design, Sample
# Set a seed for reproducible random sample generation
seed = 12345
rng = np.random.default_rng(seed)
# Specify the sampling design
ea_design = Design(
mos="n_hlds_census",
stratum=["geo1", "urbrur"],
psu=["ea"],
)
# Create the sample object
# In this case we can view it as the frame (aal EAs prior to selection)
ea_frame = Sample(data=ea_frame, design=ea_design)
# Select the sample
ea_sample = ea_frame.sampling.pps_sys(n=20, drop_nulls=True, rstate=rng)
print(ea_sample)╭─────────────────────────── Sample ────────────────────────────╮ │ Survey Data: │ │ Number of rows: 379 │ │ Number of columns: 11 │ │ Number of strata: 19 │ │ Number of PSUs: 379 │ │ │ │ Survey Design: │ │ │ │ Field Value │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Row index svy_row_index │ │ Stratum (geo1, urbrur) │ │ PSU (ea,) │ │ SSU None │ │ Weight svy_sample_weight │ │ With replacement False │ │ Prob svy_prob_selection │ │ Hit svy_number_of_hits │ │ MOS n_hlds_census │ │ Population size None │ │ Replicate weights None │ │ │ ╰───────────────────────────────────────────────────────────────╯
Variables start with “svy_” were automatically created by the library.
The sample data has 379 rows that is 379 clusters. Therefore, one large cluster was hit twice. Using the show_records() method, we can filter the data to the records that were hit twice.
eas_hit_twice = ea_sample.show_records(
where={"svy_number_of_hits": 2},
columns=["geo1", "ea", "svy_number_of_hits"],
)
print(eas_hit_twice)shape: (1, 3)
┌────────┬───────┬────────────────────┐
│ geo1 ┆ ea ┆ svy_number_of_hits │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞════════╪═══════╪════════════════════╡
│ geo_08 ┆ 84045 ┆ 2 │
└────────┴───────┴────────────────────┘
We can show a slice of the data by using the show_data() method.
print(
ea_sample.show_data(
columns=[
"ea",
"geo1",
"urbrur",
"n_hlds_census",
"svy_prob_selection",
"svy_sample_weight",
],
how="sample",
sort_by="geo1",
rstate=seed,
).sort(["geo1"])
)shape: (5, 6)
┌───────┬────────┬────────┬───────────────┬────────────────────┬───────────────────┐
│ ea ┆ geo1 ┆ urbrur ┆ n_hlds_census ┆ svy_prob_selection ┆ svy_sample_weight │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ f64 ┆ f64 │
╞═══════╪════════╪════════╪═══════════════╪════════════════════╪═══════════════════╡
│ 13145 ┆ geo_01 ┆ Urban ┆ 460 ┆ 0.030319 ┆ 32.982717 │
│ 26021 ┆ geo_02 ┆ Urban ┆ 306 ┆ 0.113319 ┆ 8.824673 │
│ 41170 ┆ geo_04 ┆ Rural ┆ 288 ┆ 0.057882 ┆ 17.276563 │
│ 63038 ┆ geo_06 ┆ Rural ┆ 308 ┆ 0.062772 ┆ 15.930682 │
│ 72003 ┆ geo_07 ┆ Urban ┆ 585 ┆ 0.140628 ┆ 7.11094 │
└───────┴────────┴────────┴───────────────┴────────────────────┴───────────────────┘
If the sample size isn’t the same in every stratum, we pass a Python dictionary that maps each stratum to its sample size. For example:
{
("geo_01", "urban"): 23,
("geo_02", "urban"): 20,
("geo_02", "rural"): 15,
...
}Ensure the dictionary keys exactly match the stratum labels in your data (spelling, case, and type).
For strata defined by multiple columns, use tuples in the same column order. Example: for strata defined by region and sex, we can provide dictionaries similar to {(“North”, “Female”): 20, (“North”, “Male”): 25, …}.
Second Stage: Selection of Households
To select the second stage sample, we use the household listing1 as the frame.
ea_selected = ea_sample.show_data(columns="ea", n=None).unique().to_series().to_list()
print(f"Number of clusters selected: {len(ea_selected)}")
hld_frame = load_dataset(
name="hld_pop_wb_2023",
where={"ea": ("in", tuple(ea_selected))},
limit=None,
force_local=True,
)
print(f"Number of listed households from all selected clusters: {hld_frame.shape[0]}")Number of clusters selected: 379
Number of listed households from all selected clusters: 174527
Let’s use simple random sampling (SRS) to select the household. If we had to select 25 households in every EA then, we would just run the following code hh_sample = Sample(data=hld_frame).sampling.srs(n=25, by="ea"). However, we want to select 50 from the EA with 2 hits and 25 every where else. To do so, we write a simple python code to create the sample size object as a dictionary.
# base: 25 per EA
hh_sample_size = {ea: 25 for ea in ea_selected}
# bump to 50 for EAs hit twice
for ea in eas_hit_twice["ea"].to_list():
hh_sample_size[ea] = 50
hh_sample = Sample(data=hld_frame).sampling.srs(n=hh_sample_size, by="ea")
print(hh_sample)╭─────────────────────────── Sample ────────────────────────────╮ │ Survey Data: │ │ Number of rows: 9500 │ │ Number of columns: 51 │ │ Number of strata: None │ │ Number of PSUs: None │ │ │ │ Survey Design: │ │ │ │ Field Value │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Row index svy_row_index │ │ Stratum None │ │ PSU None │ │ SSU None │ │ Weight svy_sample_weight │ │ With replacement False │ │ Prob svy_prob_selection │ │ Hit svy_number_of_hits │ │ MOS None │ │ Population size None │ │ Replicate weights None │ │ │ ╰───────────────────────────────────────────────────────────────╯
Combine the two sampling stages
COMING SOON!!!
Next, we derive and adjust sample weights: Weighting
References
Footnotes
Household listing is the field operation that systematically identifies and records all dwelling units/households in a defined area to create an up-to-date sampling frame—usually capturing minimal info (location/ID, head’s name, household size, contact) for subsequent sample selection.↩︎