Survey Parameter Estimation in Python with svy

Means, totals, proportions, ratios, and medians with proper variance estimation

Tutorials
Estimation
Variance Estimation
Python
Estimate means, totals, proportions, ratios, and medians from complex survey data in Python. Learn Taylor linearization and replicate-based variance estimation with design effects using the svy library.
Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

January 29, 2026

Keywords

survey estimation Python, Taylor linearization variance, replicate weights variance, survey mean estimation, survey proportion estimation, design effect DEFF, complex survey analysis, svy estimation tutorial, survey total estimation, survey ratio estimation, confidence interval survey, bootstrap jackknife BRR

The purpose of a survey is to draw inferences about the target population from a sampled subset. Because most surveys employ some combination of stratification, clustering, unequal probabilities of selection, and post-collection adjustments, treating observations as independent and identically distributed (i.i.d.) can misstate uncertainty.

Proper variance estimation uses design information (weights, strata, PSUs) via Taylor linearization or replication methods (BRR, jackknife, bootstrap) to produce accurate standard errors and confidence intervals.

This tutorial demonstrates how to use the svy library to produce point estimates for means, totals, proportions, ratios, and medians, along with corresponding Taylor-based and replication-based measures of uncertainty.

Setting Up the Sample

We’ll use the imaginary country household dataset from World Bank (2023):

import numpy as np
import svy

svy.Estimate.PRINT_WIDTH = 95

from rich import print as rprint

rng = np.random.default_rng(12345)

hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)

Taylor-Based Estimation

Taylor linearization is the standard approach for variance estimation in complex surveys. It accounts for stratification, clustering, and unequal weights without requiring replicate weight columns.

TipImplementation in svy
  • Use estimation.mean() to estimate means
  • Use estimation.prop() to estimate proportions
  • Use estimation.total() to estimate totals
  • Use estimation.ratio() to estimate ratios
  • Use estimation.median() to estimate medians

Estimating Means

Estimate the average total household expenditure:

tot_exp_mean = hld_sample.estimation.mean(y="tot_exp")

print(tot_exp_mean)
╭─────────────────── Estimate: MEAN (TAYLOR) ───────────────────╮
                                                               
          est         se           lci           uci   cv (%)  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  12,048.9638   229.9865   11,596.3788   12,501.5488     1.91  
                                                               
╰───────────────────────────────────────────────────────────────╯

Estimate means by region using the by parameter:

tot_exp_mean_admin1 = hld_sample.estimation.mean(y="tot_exp", by="geo1")

print(tot_exp_mean_admin1)
╭──────────────────────── Estimate: MEAN (TAYLOR) ─────────────────────────╮
                                                                          
  geo1             est           se           lci           uci   cv (%)  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  geo_09   12,265.4656     690.6409   10,906.3695   13,624.5616     5.63  
  geo_03   12,892.2285     716.3554   11,482.5295   14,301.9274     5.56  
  geo_10   10,460.2557     672.6037    9,136.6547   11,783.8567     6.43  
  geo_01   14,288.0552     675.5344   12,958.6869   15,617.4235     4.73  
  geo_07   11,255.0069     924.1644    9,436.3654   13,073.6484     8.21  
  geo_08   11,633.2305   1,290.4820    9,093.7212   14,172.7397    11.09  
  geo_05   10,288.6744     623.8749    9,060.9657   11,516.3831     6.06  
  geo_04   12,508.1484     609.8082   11,308.1212   13,708.1757     4.88  
  geo_06   11,866.6574     726.9371   10,436.1349   13,297.1798     6.13  
  geo_02   11,332.3640     661.5220   10,030.5703   12,634.1577     5.84  
                                                                          
╰──────────────────────────────────────────────────────────────────────────╯

Estimate means by multiple grouping variables (geo1 and urbrur):

tot_exp_mean_admin1_urbrur = hld_sample.estimation.mean(y="tot_exp", by=("geo1", "urbrur"))

print(tot_exp_mean_admin1_urbrur)
╭──────────────────────────────── Estimate: MEAN (TAYLOR) ────────────────────────────────╮
                                                                                         
  geo1           urbrur           est           se           lci           uci   cv (%)  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  geo_10_Rural             8,817.8009   1,021.0338    6,808.5323   10,827.0694    11.58  
  geo_09_Urban            16,115.2087   1,126.7975   13,897.8102   18,332.6071     6.99  
  geo_03_Rural             8,419.1458     681.3013    7,078.4289    9,759.8627     8.09  
  geo_03_Urban            15,249.3535     990.8078   13,299.5661   17,199.1409     6.50  
  geo_07_Urban            14,353.0549   1,163.4578   12,063.5135   16,642.5962     8.11  
  geo_02_Rural             9,260.2908     589.2201    8,100.7783   10,419.8032     6.36  
  geo_09_Rural             9,436.2621     743.0018    7,974.1263   10,898.3980     7.87  
  geo_06_Urban            14,786.6872   1,050.4005   12,719.6287   16,853.7456     7.10  
  geo_04_Rural             8,513.7156   1,182.0717    6,187.5445   10,839.8868    13.88  
  geo_07_Rural             7,756.9065   1,206.5539    5,382.5573   10,131.2556    15.55  
  geo_02_Urban            17,755.5353   1,590.1787   14,626.2600   20,884.8105     8.96  
  geo_05_Urban            13,146.8476   1,394.8279   10,401.9985   15,891.6966    10.61  
  geo_04_Urban            14,265.1253     732.6696   12,823.3220   15,706.9285     5.14  
  geo_10_Urban            11,776.0426     893.9162   10,016.9259   13,535.1593     7.59  
  geo_05_Rural             9,128.2380     650.8814    7,847.3837   10,409.0924     7.13  
  geo_08_Urban            17,918.1376   3,553.5066   10,925.2753   24,910.9999    19.83  
  geo_01_Urban            14,288.0552     675.5344   12,958.6869   15,617.4235     4.73  
  geo_06_Rural             9,806.3231     994.6903    7,848.8953   11,763.7509    10.14  
  geo_08_Rural            10,640.8358   1,451.9525    7,783.5725   13,498.0991    13.65  
                                                                                         
╰─────────────────────────────────────────────────────────────────────────────────────────╯

Estimating Proportions

We’ll estimate the share of households living below the poverty line.

In this imaginary country, assume the person-level poverty threshold is 1,800 local currency units. We convert it to a household threshold by multiplying by household size—a household is classified as poor if its welfare measure falls below this threshold.

NoteMethod note

This household scaling is for illustration only. Applied analyses commonly use adult-equivalence scales, economies-of-scale adjustments, and regional price indices.

First, create the poverty status variable using mutate():

# Create household poverty line and binary poverty status
hld_sample = hld_sample.wrangling.mutate(
    {
        "hhpovline": svy.col("hhsize") * 1800,
        "pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline")).then(1).otherwise(0),
    }
)

rprint(
    hld_sample.show_data(
        columns=[
            "hid",
            "geo1",
            "urbrur",
            "hhsize",
            "tot_exp",
            "hhpovline",
            "pov_status",
            "hhweight",
        ],
        how="sample",
        n=9,
        rstate=rng,
    ),
)
shape: (9, 8)
┌─────────────┬────────┬────────┬────────┬─────────┬───────────┬────────────┬────────────┐
│ hid         ┆ geo1   ┆ urbrur ┆ hhsize ┆ tot_exp ┆ hhpovline ┆ pov_status ┆ hhweight   │
│ ---         ┆ ---    ┆ ---    ┆ ---    ┆ ---     ┆ ---       ┆ ---        ┆ ---        │
│ str         ┆ str    ┆ str    ┆ i64    ┆ i64     ┆ i64       ┆ i32        ┆ f64        │
╞═════════════╪════════╪════════╪════════╪═════════╪═══════════╪════════════╪════════════╡
│ c3c6fb8128b ┆ geo_02 ┆ Rural  ┆ 5662990001351.538056 │
│ 48e50c183b8 ┆ geo_09 ┆ Rural  ┆ 5787890001237.332226 │
│ a627906a158 ┆ geo_09 ┆ Urban  ┆ 2778636000297.582281 │
│ a3ca4cb8183 ┆ geo_07 ┆ Rural  ┆ 78685126001260.100388 │
│ 7a4dbad563f ┆ geo_08 ┆ Urban  ┆ 712135126001333.592341 │
│ 2ecbbf32149 ┆ geo_07 ┆ Rural  ┆ 1213366216001262.554165 │
│ 1a442ef36ce ┆ geo_09 ┆ Rural  ┆ 612220108000296.665283 │
│ 179a0a159d1 ┆ geo_04 ┆ Rural  ┆ 4508072001279.200835 │
│ f0ab4534b79 ┆ geo_10 ┆ Rural  ┆ 21383136000389.558197 │
└─────────────┴────────┴────────┴────────┴─────────┴───────────┴────────────┴────────────┘

Estimate poverty rates with the design effect (deff):

hld_pov_ratio = hld_sample.estimation.prop(y="pov_status", deff=True, drop_nulls=True)

print(hld_pov_ratio)
╭───────────────────── Estimate: PROP (TAYLOR) ──────────────────────╮
                                                                    
  pov_status      est       se      lci      uci   cv (%)     deff  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  0            0.7712   0.0140   0.7425   0.7977     1.82   8.9479  
  1            0.2288   0.0140   0.2023   0.2575     6.13   8.9479  
                                                                    
╰────────────────────────────────────────────────────────────────────╯

Estimating Totals

Estimate the total count of poor households:

hld_pov_count = hld_sample.estimation.total(y="pov_status", deff=True, drop_nulls=True)

print(hld_pov_count)
╭────────────────────────── Estimate: TOTAL (TAYLOR) ──────────────────────────╮
                                                                              
           est            se            lci            uci   cv (%)     deff  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  572,308.8749   36,062.1842   501,342.9490   643,274.8008     6.30   9.4508  
                                                                              
╰──────────────────────────────────────────────────────────────────────────────╯

Estimating Medians

Estimate the median household total expenditure:

hld_median_exp = hld_sample.estimation.median(y="tot_exp")

print(hld_median_exp)
╭───────────────── Estimate: MEDIAN (TAYLOR) ──────────────────╮
                                                              
          est         se          lci           uci   cv (%)  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  10,268.0000   220.5454   9,844.0000   10,712.0123     2.15  
                                                              
╰──────────────────────────────────────────────────────────────╯

Replicate-Based Estimation

Replicate-based variance estimation uses replicate weights (bootstrap, BRR, or jackknife) instead of Taylor linearization. This approach is especially useful for non-linear statistics where linearization may be inaccurate.

Coming Soon: Examples of replicate-based estimation will be added in a future update.

Next Steps

Continue to Categorical Data Analysis to learn how to create weighted cross-tabulations and perform chi-square tests.

Ready for more analysis?
Learn categorical analysis in Categorical Data Analysis →

References

World Bank. 2023. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.