Sample Selection

Two-stage cluster sampling with PPS and simple random sampling

Learn how to implement two-stage cluster sampling using Python and the svy library. Step-by-step tutorial covering PSU selection with PPS sampling and household selection with SRS.

Keywords

two-stage cluster sampling, survey sampling Python, probability proportional to size, PPS sampling, cluster sampling tutorial, svy library, stratified sampling, simple random sampling, sampling frame

Two-stage cluster sampling is one of the most widely used survey sampling designs in practice. This tutorial walks you through implementing a complete two-stage cluster sampling workflow in Python using the svy library, covering everything from defining your target population to selecting Primary Sampling Units (PSUs) and households.

Understanding Target Populations and Sampling Frames

In survey sampling, the target population is the entire group of individuals or units about which you want to draw conclusions. Defining the target population is one of the first steps in designing any survey. You’ll need to answer several key questions:

What units are of interest? (e.g., individuals, households, schools, businesses)
Where are they located? (e.g., a country, region, administrative district, or economic sector)
When should they be observed? (e.g., during a specific year, quarter, or survey cycle)

What Is a Sampling Frame?

A sampling frame is a list of all units in the target population from which a sample can be drawn. For practical purposes, researchers use sampling frames to operationalize the target population.

Ideally, the sampling frame should fully represent the target population—including all units that belong to it and excluding any that do not. However, in practice, sampling frames are often:

Incomplete (missing some population members)
Outdated (based on older census data)
Containing duplicates or ineligible units

For a comprehensive treatment of sampling frame construction and assessment, see Section 1.2 of Lohr (2021), Sampling: Design and Analysis.

Note: This tutorial focuses on how to use the svy library to select samples, rather than on constructing or validating sampling frames.

Two-Stage Cluster Sampling Design Overview

This tutorial demonstrates a two-stage cluster sampling design, which is commonly used in household surveys, health surveys, and demographic research.

Stage 1: Selecting Primary Sampling Units (PSUs)

In the first stage, we select Primary Sampling Units (PSUs)—geographic clusters of households that cover the entire country without overlap. These PSUs are selected using probability proportional to size (PPS) sampling without replacement, meaning larger clusters have a higher probability of being included in the sample.

Stage 2: Selecting Secondary Sampling Units (SSUs)

In the second stage, within each selected PSU, we select Secondary Sampling Units (SSUs)—typically households—using simple random sampling (SRS) without replacement.

Setting Up Your Python Environment

If needed, configure the output display width for the svy library:

SVY_PRINT_WIDTH = 65  # Set the rich console width

Import the required libraries:

import numpy as np
import svy

Stage 1: Selecting Primary Sampling Units (PSUs)

Loading the Enumeration Area Frame

We begin by loading the enumeration area (EA) frame—a listing of all primary sampling units in the target population. This example uses the World Bank synthetic population dataset for an imaginary country.

Important: We intentionally perturbed the number of EAs per cluster to mirror a common real-world challenge: sampling frames built from an earlier census are often outdated by the time a survey is fielded.

ea_frame = svy.load_dataset(name="ea_frame_wb_2023", limit=None)
print(ea_frame.head(5))

shape: (5, 5)
┌────────┬───────────┬────────┬───────┬───────────────┐
│ geo1   ┆ geo2      ┆ urbrur ┆ ea    ┆ n_hlds_census │
│ ---    ┆ ---       ┆ ---    ┆ ---   ┆ ---           │
│ str    ┆ str       ┆ str    ┆ i64   ┆ i64           │
╞════════╪═══════════╪════════╪═══════╪═══════════════╡
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11001 ┆ 265           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11002 ┆ 466           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11003 ┆ 443           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11004 ┆ 500           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11005 ┆ 455           │
└────────┴───────────┴────────┴───────┴───────────────┘

Let’s examine the frame’s basic characteristics:

import polars as pl

print(f"The number of clusters is {ea_frame.shape[0]}")
print(f"The average cluster size is {round(ea_frame['n_hlds_census'].mean(), 1)}")

The number of clusters is 5940
The average cluster size is 374.1

Defining the Stratification Strategy

We stratify by administrative region (geo1) and residence type (urbrur), sample clusters with probability proportional to size based on the number of households, and plan to interview 20 households per selected cluster.

In this imaginary country:

The first region (geo_01) contains only urban areas
Other regions include both urban and rural areas

Implementing PPS Systematic Sampling

# Set a seed for reproducible random sample generation
seed = 12345
rng = np.random.default_rng(seed)

# Specify the sampling design
ea_design = svy.Design(
    mos="n_hlds_census",      # Measure of size variable
    stratum=["geo1", "urbrur"], # Stratification variables
    psu=["ea"],                # Primary sampling unit identifier
)

# Create the sample object (viewing it as the frame with all EAs prior to selection)
ea_frame = svy.Sample(data=ea_frame, design=ea_design)

# Select the sample using PPS systematic sampling
ea_sample = ea_frame.sampling.pps_sys(n=20, drop_nulls=True, rstate=rng)

print(ea_sample)

╭─────────────────────────── Sample ────────────────────────────╮
│ Survey Data:                                                  │
│   Number of rows: 380                                         │
│   Number of columns: 11                                       │
│   Number of strata: 19                                        │
│   Number of PSUs: 380                                         │
│                                                               │
│ Survey Design:                                                │
│                                                               │
│    Field               Value                                  │
│    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                     │
│    Row index           svy_row_index                          │
│    Stratum             (geo1, urbrur)                         │
│    PSU                 (ea,)                                  │
│    SSU                 None                                   │
│    Weight              svy_sample_weight                      │
│    With replacement    False                                  │
│    Prob                svy_prob_selection                     │
│    Hit                 svy_number_of_hits                     │
│    MOS                 n_hlds_census                          │
│    Population size     None                                   │
│    Replicate weights   None                                   │
│                                                               │
╰───────────────────────────────────────────────────────────────╯

Note: Variables starting with svy_ are automatically created by the library to track selection probabilities and weights.

Handling Clusters Selected Multiple Times

The sample data has 379 rows (379 clusters). One large cluster was selected twice (“hit twice”). We can identify it using the show_records() method:

eas_hit_twice = ea_sample.show_records(
    where={"svy_number_of_hits": 2},
    columns=["geo1", "ea", "svy_number_of_hits"],
)

print(eas_hit_twice)

shape: (0, 3)
┌──────┬─────┬────────────────────┐
│ geo1 ┆ ea  ┆ svy_number_of_hits │
│ ---  ┆ --- ┆ ---                │
│ str  ┆ i64 ┆ i64                │
╞══════╪═════╪════════════════════╡
└──────┴─────┴────────────────────┘

Viewing Sample Data with Selection Probabilities

Use the show_data() method to examine a slice of the selected sample, including selection probabilities and sample weights:

print(
    ea_sample.show_data(
        columns=[
            "ea",
            "geo1",
            "urbrur",
            "n_hlds_census",
            "svy_prob_selection",
            "svy_sample_weight",
        ],
        how="sample",
        sort_by="geo1",
        rstate=seed,
    ).sort(["geo1"])
)

shape: (5, 6)
┌───────┬────────┬────────┬───────────────┬────────────────────┬───────────────────┐
│ ea    ┆ geo1   ┆ urbrur ┆ n_hlds_census ┆ svy_prob_selection ┆ svy_sample_weight │
│ ---   ┆ ---    ┆ ---    ┆ ---           ┆ ---                ┆ ---               │
│ i64   ┆ str    ┆ str    ┆ i64           ┆ f64                ┆ f64               │
╞═══════╪════════╪════════╪═══════════════╪════════════════════╪═══════════════════╡
│ 26024 ┆ geo_02 ┆ Rural  ┆ 290           ┆ 0.035237           ┆ 28.379138         │
│ 32111 ┆ geo_03 ┆ Urban  ┆ 400           ┆ 0.04896            ┆ 20.424875         │
│ 31096 ┆ geo_03 ┆ Rural  ┆ 201           ┆ 0.04702            ┆ 21.267413         │
│ 41003 ┆ geo_04 ┆ Urban  ┆ 378           ┆ 0.032769           ┆ 30.516534         │
│ 63088 ┆ geo_06 ┆ Urban  ┆ 369           ┆ 0.105028           ┆ 9.521274          │
└───────┴────────┴────────┴───────────────┴────────────────────┴───────────────────┘

Specifying Different Sample Sizes by Stratum

If you need different sample sizes for each stratum, pass a Python dictionary mapping each stratum to its target sample size:

{
    ("geo_01", "urban"): 23,
    ("geo_02", "urban"): 20,
    ("geo_02", "rural"): 15,
    # ... additional strata
}

Tips for Stratum-Specific Sample Sizes:

Ensure dictionary keys exactly match the stratum labels in your data (spelling, case, and type)

For strata defined by multiple columns, use tuples in the same column order as specified in the design

Example: for strata defined by region and sex, use {("North", "Female"): 20, ("North", "Male"): 25, ...} ?/ ##;]| bN kls,cd[]sxac # dkljfvnj
# cm ¸, Xx z # X ds # sx Stage 2: Selecting Households Using Simple Random Sampling

Loading the Household Listing Frame

To select the second-stage sample, we use the household listing as the sampling frame. A household listing is the field operation that systematically identifies and records all dwelling units/households in a defined area to create an up-to-date sampling frame—typically capturing minimal information such as location/ID, head’s name, household size, and contact details.

ea_selected = ea_sample.show_data(columns="ea", n=None).unique().to_series().to_list()

print(f"Number of clusters selected: {len(ea_selected)}")

hld_frame = svy.load_dataset(
    name="hld_pop_wb_2023",
    where={"ea": ("in", tuple(ea_selected))},
    limit=None,
    force_local=True,
)

print(f"Number of listed households from all selected clusters: {hld_frame.shape[0]}")

Number of clusters selected: 380
Number of listed households from all selected clusters: 173511

Implementing Simple Random Sampling for Households

If we wanted to select the same number of households (e.g., 25) in every EA, we could simply run:

hh_sample = Sample(data=hld_frame).sampling.srs(n=25, by="ea")

However, we need to select 50 households from the EA that was hit twice and 25 households everywhere else. Here’s how to create a variable sample size dictionary:

# Base: 25 households per EA
hh_sample_size = {ea: 25 for ea in ea_selected}

# Increase to 50 for EAs hit twice
for ea in eas_hit_twice["ea"].to_list():
    hh_sample_size[ea] = 50

# Select the household sample
hh_sample = svy.Sample(data=hld_frame).sampling.srs(n=hh_sample_size, by="ea")

print(hh_sample)

╭─────────────────────────── Sample ────────────────────────────╮
│ Survey Data:                                                  │
│   Number of rows: 9500                                        │
│   Number of columns: 51                                       │
│   Number of strata: None                                      │
│   Number of PSUs: None                                        │
│                                                               │
│ Survey Design:                                                │
│                                                               │
│    Field               Value                                  │
│    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                     │
│    Row index           svy_row_index                          │
│    Stratum             None                                   │
│    PSU                 None                                   │
│    SSU                 None                                   │
│    Weight              svy_sample_weight                      │
│    With replacement    False                                  │
│    Prob                svy_prob_selection                     │
│    Hit                 svy_number_of_hits                     │
│    MOS                 None                                   │
│    Population size     None                                   │
│    Replicate weights   None                                   │
│                                                               │
╰───────────────────────────────────────────────────────────────╯

Combining the Two Sampling Stages

Coming Soon: The next section will cover how to merge the PSU and household selection stages to create final survey weights.

Next Steps: Survey Weighting

After selecting your sample, the next step is to derive and adjust sample weights. Continue to the Weighting Tutorial to learn about:

Base weight calculation
Nonresponse adjustments
Post-stratification and calibration

Next Steps

After selecting your sample, the next step is to derive and adjust sample weights—including base weight calculation, nonresponse adjustments, and post-stratification.

Ready to continue?
Learn how to calculate and adjust weights in Survey Weighting →

References

Lohr, S. L. (2021). Sampling: Design and Analysis (3rd ed.). CRC Press.