Survey Planning

Planning1 involves several steps, including developing a protocol with clearly defined primary objectives, the target population, and sample size requirements. In this tutorial, we use the World Bank synthetic population as our target, then state the primary objectives and compute the minimum required sample sizes.

World Bank Synthetic Data

In this first tutorial, we introduce the dataset used throughout the series and outline the study we will simulate end-to-end. Our goal is to build practical intuition for using svy, from defining a target population and choosing a design to estimating parameters and quantifying uncertainty, via a clear, step-by-step workflow.

In these tutorials, we will use The World Bank census (World Bank 2023a) and sample (World Bank 2023b) synthetic data.

The census dataset represents a hypothetical middle-income country with over 10 million individuals in 2.5 million households. It is organized into two files:

  • Household-level file: variables measured at the household level.
  • Individual-level file: variables measured for each household member.

The dataset includes variables typically found in population censuses: demography, education, occupation, dwelling characteristics, fertility, mortality, and migration. It also includes additional measures often collected in household surveys, such as household expenditure, child anthropometrics, and asset ownership. Only ordinary households are included (community households are excluded). We will use this dataset as the current state of the population of the imaginary country.

The sample dataset, drawn from the census, consists of 8,000 households and over 32,000 individuals. Like the census, it is provided in two files (household-level and individual-level) and contains the same range of variables.

Objective Planning

As researchers, our goal is to estimate expenditure and poverty rate (Poverty Headcount Ratio) levels for the imaginary country. Specifically, we focus on two main objectives:

  1. Average household expenditure: Produce disaggregated estimates by admin 1 level (first-level administrative units).

  2. Poverty rates comparisons: Compare urban vs. rural poverty rates within each admin 1 unit.

In this study, we define the poverty rate as the proportion of individuals with per capita expenditure below the poverty line, where the poverty line is defined as 60% of the national median per capita expenditure.

Sample Size Calculation

After defining the study objectives, we will calculate the minimum required sample size. Because we have multiple primary objectives, each can lead to a different sample size. There are two common approaches:

  • Choose one objective to drive the calculation, or
  • Compute a sample size for each objective and take the largest (conservative).

We will use the second approach: compute the required sample size for each objective and then adopt the maximum as the study’s sample size.

For each objective, the procedure for computing the sample size is as follows:

  1. Specify target precision (e.g., half-width of CI) or power.
  2. Compute the effective sample size under the simple random sampling design (\(n_0\)).
  3. Inflate the effective sample size to account for the design effect.
  4. Inflate the effective sample size to account for potential non-response or attrition.

svy combines these steps but provides the effective sample size (\(n_0\)) and the final required sample size (\(n\)).

We can classify the objectives above into two categories: estimation and comparison.

  • Estimation objectives: specify the target half-width of the confidence interval (CI), also called the margin of error (MOE), along with the anticipated point estimate (e.g., a mean or a proportion) and the confidence level (e.g., 95%).

  • Comparison objectives: specify the desired power (e.g., 80%), significance level (e.g., 5%), and a minimum detectable effect (MDE). For example, a MDE can be the difference in means or in proportions between groups (e.g., urban vs. rural at admin-1).

Objective 1: Average Household Expenditure

svy provides the svy.SampSize class to compute required sample sizes. When instantiating a SampSize object, you may supply:

  • pop_size: target population size (used for the finite population correction, FPC) - not necessary when the sampling rate is small,
  • deff: assumed design effect to account for clustering, stratification, and unequal weighting,
  • resp_rate: anticipated response rate.

Values for deff and resp_rate are usually obtained from previous survey rounds or from similar studies. For this tutorial, we assume a design effect of 1.2 and an anticipated response rate of 0.9. The pop_size parameter can be omitted, since we expect the sampling fraction to be small and therefore the finite population correction negligible. Furthermore, using previous surveys and censuses, we expect a standard error (sigma) of 7,000 and we desire a margin of error (moe) of 1,000.

from svy import SampSize

obj1_samp_size = SampSize(deff=1.2, resp_rate=0.9).estimate_mean(
    sigma=7000, moe=1000
)

print("The effective sample size is:")

print(obj1_samp_size)
The effective sample size is:
╭──────────────────────────────────── Sample Size ─────────────────────────────────────╮
                                                                                      
     n0   n1_fpc   n2_deff     n                                                      
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                                      
    189      189       227   253                                                      
                                                                                      
╰──────────────────────────────────────────────────────────────────────────────────────╯

The minimum required sample size computed above is per admin-1 area. Since the country has 10 admin-1 areas, a simple national target under equal allocation is to multiply by 10.

If you need area-specific assumptions, pass per-area mappings (Python dicts) when creating SampSize. For example:

# from svy import SampSize

deff = {"region1": 1.2, "region2": 1, "region3": 1.05}
resp_rate = {"region1": 0.90, "region2": 0.90, "region3": 0.85}
sigma = {"region1": 7000, "region2": 11000, "region3": 5000}
moe = {"region1": 1000, "region2": 1300, "region3": 700}

samp_size = SampSize(
    deff=deff, resp_rate=resp_rate, stratified=True
).estimate_mean(sigma=sigma, moe=moe)

print(samp_size)
╭──────────────────────────────────── Sample Size ─────────────────────────────────────╮
                                                                                      
    stratum    n0   n1_fpc   n2_deff     n                                            
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                            
    region1   189      189       227   253                                            
    region2   276      276       276   307                                            
    region3   196      196       206   243                                            
                                                                                      
╰──────────────────────────────────────────────────────────────────────────────────────╯

Objective 2: Poverty Rates Comparisons

The poverty line is defined as 60% of the national median per capita income, here set at 6,000. We want to test for a statistically significant difference in poverty rates between urban and rural populations.

obj2_samp_size = SampSize(deff=1.1, resp_rate=0.9).compare_props(
    p1=0.4, p2=0.5, two_sides=True
)

print("The effective sample size is:")
print(obj2_samp_size)
The effective sample size is:
╭──────────────────────────────────── Sample Size ─────────────────────────────────────╮
                                                                                      
    group    n0   n1_fpc   n2_deff     n                                              
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                              
    g1      385      385       424   472                                              
    g2      385      385       424   472                                              
                                                                                      
╰──────────────────────────────────────────────────────────────────────────────────────╯

Say we want more samples in the urban areas to do additional secondary analyses, we can adjust for unequal allocation between groups:

obj2_samp_size = SampSize(deff=1.1, resp_rate=0.9).compare_props(
    p1=0.4, p2=0.5, alloc_ratio=60 / 40, two_sides=True
)


print("The effective sample size is:")
print(obj2_samp_size)
The effective sample size is:
╭──────────────────────────────────── Sample Size ─────────────────────────────────────╮
                                                                                      
    group    n0   n1_fpc   n2_deff     n                                              
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                              
    g1      483      483       532   592                                              
    g2      322      322       355   395                                              
                                                                                      
╰──────────────────────────────────────────────────────────────────────────────────────╯

The minimum required sample size is 592 households in the urban areas and 395 in the rural areas, per region, assuming a 60/40 allocation.

The design is a stratified two-stage cluster sampling. We select 20 households households per cluster for a total of 30 in urban areas and 20 in rural areas.

Next Steps

Next, we select the samples according to our study design: Selection.

References

World Bank. 2023a. “Synthetic Data for an Imaginary Country, Full Population, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/78M1-AE09.
———. 2023b. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.

Footnotes

  1. The datasets are entirely synthetic and are not intended for real-world applications. They are provided solely for educational purposes. The study objectives and design are intentionally simplified to illustrate survey analysis concepts in these tutorials. The concepts, definitions, targets, and other specifications used in these tutorials do not necessarily reflect those of the World Bank, The World Health Organization (WHO), The United Nations Children’s Fund (UNICEF), or any other referenced institutions.↩︎