Sample Selection

In survey sampling, the target population is the entire group of individuals or units that you want to draw conclusions about. Defining the target population is one of the first steps in designing a survey. Specifying the target population will require answering many important questions such as:

For practical purposes, a sampling frame is often used to select a sample for a study. A sampling frame is a list of all units in the target population from which a sample can be drawn.

Ideally, the sampling frame should fully represent the target population—including all units that belong to it and excluding any that do not. However, in practice, sampling frames are often incomplete, outdated, or may contain duplicates or ineligible units. For more, see Section 1.2 of Lohr (2021).

In this tutorial, we do not address how to construct a high-quality sampling frame or assess its completeness and accuracy. Instead, we focus on how to use svy to select samples for a survey.

This tutorial utilizes a two-stage cluster sampling design. The sampling process consists of two stages:

First Stage: We select Primary Sampling Units (PSUs), which are geographic clusters of households that cover the entire country without overlap. These PSUs are be selected using probability proportional to size (PPS) without replacement, meaning larger clusters have a higher chance of being included.

Second Stage: Within each selected PSU, we then select households, or Secondary Sampling Units (SSUs), using simple random sampling without replacement.

First Stage: Selection of PSUs

As noted earlier, we use the World Bank synthetic population for an imaginary country to illustrate the sample-selection workflow. We begin by loading the enumeration-area (EA) frame, the listing of all primary sampling units (PSUs) in the target population. The EA frame was derived from the population data, and we intentionally perturbed the number of EAs per cluster to mirror a common reality: frames built from an earlier census are often out of date by the time a survey is fielded.

SVY_PRINT_WIDTH = 65  # Set the rich console with
import numpy as np

from svy import load_dataset

ea_frame = load_dataset(name="ea_frame_wb_2023", limit=None)

print(ea_frame.head(5))
shape: (5, 5)
┌────────┬───────────┬────────┬───────┬───────────────┐
│ geo1   ┆ geo2      ┆ urbrur ┆ ea    ┆ n_hlds_census │
│ ---    ┆ ---       ┆ ---    ┆ ---   ┆ ---           │
│ str    ┆ str       ┆ str    ┆ i64   ┆ i64           │
╞════════╪═══════════╪════════╪═══════╪═══════════════╡
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11001 ┆ 265           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11002 ┆ 466           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11003 ┆ 443           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11004 ┆ 500           │
│ geo_01 ┆ geo_01_01 ┆ Urban  ┆ 11005 ┆ 455           │
└────────┴───────────┴────────┴───────┴───────────────┘
import polars as pl

print(f"The number of clusters is {ea_frame.shape[0]}")

print(
    f"The average cluster size is {round(ea_frame['n_hlds_census'].mean(), 1)}"
)
The number of clusters is 5940
The average cluster size is 374.1

In the imaginary country, geo_01 contains only urban areas, whereas the other regions include both urban and rural areas. We stratify by admin-1 and residence type, sample clusters with probability proportional to size based on the number of households, and then interview 20 households per selected cluster.

from svy import Design, Sample

# Set a seed for reproducible random sample generation
seed = 12345
rng = np.random.default_rng(seed)

# Specify the sampling design
ea_design = Design(
    mos="n_hlds_census",
    stratum=["geo1", "urbrur"],
    psu=["ea"],
)

# Create the sample object
# In this case we can view it as the frame (aal EAs prior to selection)
ea_frame = Sample(data=ea_frame, design=ea_design)

# Select the sample
ea_sample = ea_frame.sampling.pps_sys(n=20, drop_nulls=True, rstate=rng)

print(ea_sample)
╭─────────────────────────── Sample ────────────────────────────╮
 Survey Data:                                                  
   Number of rows: 379                                         
   Number of columns: 11                                       
   Number of strata: 19                                        
   Number of PSUs: 379                                         
                                                               
 Survey Design:                                                
                                                               
    Field               Value                                  
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                     
    Row index           svy_row_index                          
    Stratum             (geo1, urbrur)                         
    PSU                 (ea,)                                  
    SSU                 None                                   
    Weight              svy_sample_weight                      
    With replacement    False                                  
    Prob                svy_prob_selection                     
    Hit                 svy_number_of_hits                     
    MOS                 n_hlds_census                          
    Population size     None                                   
    Replicate weights   None                                   
                                                               
╰───────────────────────────────────────────────────────────────╯

Variables start with “svy_” were automatically created by the library.

The sample data has 379 rows that is 379 clusters. Therefore, one large cluster was hit twice. Using the show_records() method, we can filter the data to the records that were hit twice.

eas_hit_twice = ea_sample.show_records(
    where={"svy_number_of_hits": 2},
    columns=["geo1", "ea", "svy_number_of_hits"],
)

print(eas_hit_twice)
shape: (1, 3)
┌────────┬───────┬────────────────────┐
│ geo1   ┆ ea    ┆ svy_number_of_hits │
│ ---    ┆ ---   ┆ ---                │
│ str    ┆ i64   ┆ i64                │
╞════════╪═══════╪════════════════════╡
│ geo_08 ┆ 84045 ┆ 2                  │
└────────┴───────┴────────────────────┘

We can show a slice of the data by using the show_data() method.

print(
    ea_sample.show_data(
        columns=[
            "ea",
            "geo1",
            "urbrur",
            "n_hlds_census",
            "svy_prob_selection",
            "svy_sample_weight",
        ],
        how="sample",
        sort_by="geo1",
        rstate=seed,
    ).sort(["geo1"])
)
shape: (5, 6)
┌───────┬────────┬────────┬───────────────┬────────────────────┬───────────────────┐
│ ea    ┆ geo1   ┆ urbrur ┆ n_hlds_census ┆ svy_prob_selection ┆ svy_sample_weight │
│ ---   ┆ ---    ┆ ---    ┆ ---           ┆ ---                ┆ ---               │
│ i64   ┆ str    ┆ str    ┆ i64           ┆ f64                ┆ f64               │
╞═══════╪════════╪════════╪═══════════════╪════════════════════╪═══════════════════╡
│ 13145 ┆ geo_01 ┆ Urban  ┆ 460           ┆ 0.030319           ┆ 32.982717         │
│ 26021 ┆ geo_02 ┆ Urban  ┆ 306           ┆ 0.113319           ┆ 8.824673          │
│ 41170 ┆ geo_04 ┆ Rural  ┆ 288           ┆ 0.057882           ┆ 17.276563         │
│ 63038 ┆ geo_06 ┆ Rural  ┆ 308           ┆ 0.062772           ┆ 15.930682         │
│ 72003 ┆ geo_07 ┆ Urban  ┆ 585           ┆ 0.140628           ┆ 7.11094           │
└───────┴────────┴────────┴───────────────┴────────────────────┴───────────────────┘

If the sample size isn’t the same in every stratum, we pass a Python dictionary that maps each stratum to its sample size. For example:

{
    ("geo_01", "urban"): 23,
    ("geo_02", "urban"): 20,
    ("geo_02", "rural"): 15,
    ...
}
  • Ensure the dictionary keys exactly match the stratum labels in your data (spelling, case, and type).

  • For strata defined by multiple columns, use tuples in the same column order. Example: for strata defined by region and sex, we can provide dictionaries similar to {(“North”, “Female”): 20, (“North”, “Male”): 25, …}.

Second Stage: Selection of Households

To select the second stage sample, we use the household listing1 as the frame.

ea_selected = ea_sample.show_data(columns="ea", n=None).unique().to_series().to_list()

print(f"Number of clusters selected: {len(ea_selected)}")

hld_frame = load_dataset(
    name="hld_pop_wb_2023",
    where={"ea": ("in", tuple(ea_selected))},
    limit=None,
    force_local=True,
)

print(f"Number of listed households from all selected clusters: {hld_frame.shape[0]}")
Number of clusters selected: 379
Number of listed households from all selected clusters: 174527

Let’s use simple random sampling (SRS) to select the household. If we had to select 25 households in every EA then, we would just run the following code hh_sample = Sample(data=hld_frame).sampling.srs(n=25, by="ea"). However, we want to select 50 from the EA with 2 hits and 25 every where else. To do so, we write a simple python code to create the sample size object as a dictionary.

# base: 25 per EA
hh_sample_size = {ea: 25 for ea in ea_selected}

# bump to 50 for EAs hit twice
for ea in eas_hit_twice["ea"].to_list():
    hh_sample_size[ea] = 50

hh_sample = Sample(data=hld_frame).sampling.srs(n=hh_sample_size, by="ea")

print(hh_sample)
╭─────────────────────────── Sample ────────────────────────────╮
 Survey Data:                                                  
   Number of rows: 9500                                        
   Number of columns: 51                                       
   Number of strata: None                                      
   Number of PSUs: None                                        
                                                               
 Survey Design:                                                
                                                               
    Field               Value                                  
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                     
    Row index           svy_row_index                          
    Stratum             None                                   
    PSU                 None                                   
    SSU                 None                                   
    Weight              svy_sample_weight                      
    With replacement    False                                  
    Prob                svy_prob_selection                     
    Hit                 svy_number_of_hits                     
    MOS                 None                                   
    Population size     None                                   
    Replicate weights   None                                   
                                                               
╰───────────────────────────────────────────────────────────────╯

Combine the two sampling stages

COMING SOON!!!

Next, we derive and adjust sample weights: Weighting

References

Lohr, Sharon L. 2021. Sampling: Design and Analysis, Third Edition. Chapman; Hall/CRC. https://doi.org/10.1201/9780429298899.

Footnotes

  1. Household listing is the field operation that systematically identifies and records all dwelling units/households in a defined area to create an up-to-date sampling frame—usually capturing minimal info (location/ID, head’s name, household size, contact) for subsequent sample selection.↩︎