import numpy as np
import svy
hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)
# Create household poverty line and binary poverty status
hld_sample = hld_sample.wrangling.mutate(
{
"hhpovline": svy.col("hhsize") * 1800,
"pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline")).then(1).otherwise(0),
}
)Categorical Data Analysis for Complex Surveys in Python
Design-aware tabulations, cross-tabulations, and hypothesis tests
categorical data analysis Python, survey cross-tabulation, weighted contingency table, design-adjusted t-test, chi-square test survey, svy library, complex survey analysis, two-sample t-test survey, one-sample t-test survey, survey tabulation Python, weighted frequency table
Categorical data analysis spans descriptive techniques—contingency tables and cross-tabulations with design-adjusted tests—and model-based approaches for categorical outcomes (logistic, multinomial, loglinear, and mixed-effects GLMs; see Agresti (2013)).
In complex survey applications, ignoring stratification, clustering, or unequal weights can misstate uncertainty. This tutorial shows how to use svy to produce design-aware tabulations, cross-tabulations, and t-tests for group differences, with standard errors that respect the sample design.
Setting Up the Sample
We’ll use the imaginary country household dataset from World Bank (2023):
One-Way Tabulation
The tabulate() method produces weighted frequency tables that account for the survey design.
Tabulate the first-level administrative unit (geo1):
hld_admin1_tab = hld_sample.categorical.tabulate(rowvar="geo1")
print(hld_admin1_tab)╭────────────────────── Table: geo1 ───────────────────────╮ │ Type=One-Way │ │ Alpha=0.05 │ │ │ │ │ │ Row Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ geo_01 0.1370 0.0041 0.0296 0.1292 0.1452 │ │ geo_02 0.0956 0.0038 0.0402 0.0883 0.1034 │ │ geo_03 0.1126 0.0033 0.0296 0.1062 0.1193 │ │ geo_04 0.1481 0.0047 0.0319 0.1391 0.1577 │ │ geo_05 0.0843 0.0031 0.0373 0.0783 0.0907 │ │ geo_06 0.0747 0.0032 0.0425 0.0687 0.0812 │ │ geo_07 0.0727 0.0040 0.0557 0.0651 0.0810 │ │ geo_08 0.0411 0.0039 0.0938 0.0342 0.0494 │ │ geo_09 0.1148 0.0040 0.0351 0.1071 0.1229 │ │ geo_10 0.1191 0.0047 0.0398 0.1101 0.1287 │ │ │ ╰──────────────────────────────────────────────────────────╯
Changing Output Units
By default, tabulate() produces proportions. Use the units parameter to get counts or percentages:
# Counts
hld_admin1_tab_count = hld_sample.categorical.tabulate(
rowvar="geo1",
units=svy.TableUnits.COUNT,
)
print("Table with counts")
print(hld_admin1_tab_count)Table with counts
╭────────────────────────────── Table: geo1 ───────────────────────────────╮ │ Type=One-Way │ │ Alpha=0.05 │ │ │ │ │ │ Row Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ geo_01 342733.0000 10680.8501 0.0312 321714.4057 363751.5943 │ │ geo_02 239113.0000 10113.6242 0.0423 219210.6363 259015.3637 │ │ geo_03 281600.0000 8476.1378 0.0301 264920.0074 298279.9926 │ │ geo_04 370596.0000 12852.9351 0.0347 345303.0107 395888.9893 │ │ geo_05 210960.0000 8080.7698 0.0383 195058.0427 226861.9573 │ │ geo_06 186992.0000 8198.4041 0.0438 170858.5530 203125.4470 │ │ geo_07 181766.0000 10639.1062 0.0585 160829.5526 202702.4474 │ │ geo_08 102927.0000 9977.5362 0.0969 83292.4407 122561.5593 │ │ geo_09 287141.0000 10651.1400 0.0371 266180.8715 308101.1285 │ │ geo_10 297927.0000 12817.4855 0.0430 272703.7711 323150.2289 │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯
# Percentages
hld_admin1_tab_percent = hld_sample.categorical.tabulate(
rowvar="geo1",
units=svy.TableUnits.PERCENT,
)
print("Table with percentages:")
print(hld_admin1_tab_percent)Table with percentages:
╭─────────────────────── Table: geo1 ────────────────────────╮ │ Type=One-Way │ │ Alpha=0.05 │ │ │ │ │ │ Row Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ geo_01 13.6997 0.4269 0.0312 12.8595 14.5399 │ │ geo_02 9.5578 0.4043 0.0423 8.7623 10.3533 │ │ geo_03 11.2561 0.3388 0.0301 10.5894 11.9228 │ │ geo_04 14.8134 0.5138 0.0347 13.8024 15.8245 │ │ geo_05 8.4325 0.3230 0.0383 7.7968 9.0681 │ │ geo_06 7.4744 0.3277 0.0438 6.8295 8.1193 │ │ geo_07 7.2655 0.4253 0.0585 6.4287 8.1024 │ │ geo_08 4.1142 0.3988 0.0969 3.3294 4.8990 │ │ geo_09 11.4776 0.4257 0.0371 10.6398 12.3154 │ │ geo_10 11.9087 0.5123 0.0430 10.9005 12.9169 │ │ │ ╰────────────────────────────────────────────────────────────╯
Scaling Counts to a Custom Total
Use count_total to express counts on an arbitrary total (useful for scaled headcounts while preserving shares):
# Scale counts so the total sums to 1,000
hld_admin1_tab_n = hld_sample.categorical.tabulate(
rowvar="geo1",
count_total=1_000,
)
print(hld_admin1_tab_n)╭──────────────────────── Table: geo1 ─────────────────────────╮ │ Type=One-Way │ │ Alpha=0.05 │ │ │ │ │ │ Row Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ geo_01 136.9970 4.2693 0.0312 128.5955 145.3986 │ │ geo_02 95.5781 4.0426 0.0423 87.6227 103.5335 │ │ geo_03 112.5610 3.3881 0.0301 105.8937 119.2283 │ │ geo_04 148.1344 5.1376 0.0347 138.0243 158.2445 │ │ geo_05 84.3248 3.2300 0.0383 77.9685 90.6811 │ │ geo_06 74.7443 3.2771 0.0438 68.2955 81.1932 │ │ geo_07 72.6554 4.2527 0.0585 64.2867 81.0241 │ │ geo_08 41.1419 3.9882 0.0969 33.2936 48.9902 │ │ geo_09 114.7758 4.2575 0.0371 106.3977 123.1540 │ │ geo_10 119.0872 5.1234 0.0430 109.0050 129.1694 │ │ │ ╰──────────────────────────────────────────────────────────────╯
Two-Way Tabulation (Cross-Tabulation)
Cross-tabulations examine the relationship between two categorical variables. Use the colvar parameter to create a two-way table:
# Cross-tabulate urban/rural status by electricity access
urbrur_elec_tab = hld_sample.categorical.tabulate(
rowvar="urbrur",
colvar="electricity",
)
print(urbrur_elec_tab)╭───────────────── Table: urbrur × electricity ─────────────────╮ │ Type=Two-Way │ │ Alpha=0.05 │ │ │ │ │ │ Row Col Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Rural No 0.1520 0.0113 0.0746 0.1310 0.1757 │ │ Rural Yes 0.2970 0.0120 0.0403 0.2739 0.3211 │ │ Urban No 0.0186 0.0037 0.2012 0.0125 0.0275 │ │ Urban Yes 0.5325 0.0074 0.0140 0.5178 0.5471 │ │ │ ╰───────────────────────────────────────────────────────────────╯
Viewing as a Crosstab Matrix
The crosstab() method returns a Polars DataFrame in a familiar matrix format:
# Get crosstab matrix (estimates only)
urbrur_elec_crosstab = urbrur_elec_tab.crosstab()
print(urbrur_elec_crosstab)shape: (2, 3)
┌────────┬──────────┬──────────┐
│ urbrur ┆ No ┆ Yes │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞════════╪══════════╪══════════╡
│ Rural ┆ 0.151987 ┆ 0.296956 │
│ Urban ┆ 0.018563 ┆ 0.532494 │
└────────┴──────────┴──────────┘
# Include standard errors with the estimates
urbrur_elec_crosstab_se = urbrur_elec_tab.crosstab(stats=("est", "se"))
print(urbrur_elec_crosstab_se)shape: (2, 3)
┌────────┬───────────────┬───────────────┐
│ urbrur ┆ No ┆ Yes │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════╪═══════════════╪═══════════════╡
│ Rural ┆ 0.152 ± 0.011 ┆ 0.297 ± 0.012 │
│ Urban ┆ 0.019 ± 0.004 ┆ 0.532 ± 0.007 │
└────────┴───────────────┴───────────────┘
Two-Way Tables with Different Units
Just like one-way tables, you can change the output units:
# Cross-tabulation with percentages
urbrur_elec_pct = hld_sample.categorical.tabulate(
rowvar="urbrur",
colvar="electricity",
units=svy.TableUnits.PERCENT,
)
print("Cross-tabulation with percentages:")
print(urbrur_elec_pct.crosstab())Cross-tabulation with percentages:
shape: (2, 3)
┌────────┬───────────┬───────────┐
│ urbrur ┆ No ┆ Yes │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╡
│ Rural ┆ 15.198691 ┆ 29.695594 │
│ Urban ┆ 1.856347 ┆ 53.249369 │
└────────┴───────────┴───────────┘
# Cross-tabulation with counts
urbrur_elec_count = hld_sample.categorical.tabulate(
rowvar="urbrur",
colvar="electricity",
units=svy.TableUnits.COUNT,
)
print("Cross-tabulation with counts:")
print(urbrur_elec_count.crosstab())Cross-tabulation with counts:
shape: (2, 3)
┌────────┬───────────────┬───────────────┐
│ urbrur ┆ No ┆ Yes │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞════════╪═══════════════╪═══════════════╡
│ Rural ┆ 380234.000686 ┆ 742910.999314 │
│ Urban ┆ 46441.251273 ┆ 1.3322e6 │
└────────┴───────────────┴───────────────┘
Crosstab Display Options
The crosstab() method offers several formatting options:
# Show estimates with confidence intervals
print(urbrur_elec_tab.crosstab(stats=("est", "lci", "uci"), precision=4))shape: (2, 3)
┌────────┬─────────────────────────┬─────────────────────────┐
│ urbrur ┆ No ┆ Yes │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════╪═════════════════════════╪═════════════════════════╡
│ Rural ┆ 0.1520 [0.1310, 0.1757] ┆ 0.2970 [0.2739, 0.3211] │
│ Urban ┆ 0.0186 [0.0125, 0.0275] ┆ 0.5325 [0.5178, 0.5471] │
└────────┴─────────────────────────┴─────────────────────────┘
# Export full table details to Polars DataFrame
df = urbrur_elec_tab.to_polars()
print(df)shape: (4, 8)
┌────────┬────────┬──────────┬──────────┬──────────┬──────────┬────────────┬───────┐
│ rowvar ┆ colvar ┆ est ┆ se ┆ lci ┆ uci ┆ table_type ┆ alpha │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ f64 │
╞════════╪════════╪══════════╪══════════╪══════════╪══════════╪════════════╪═══════╡
│ Rural ┆ No ┆ 0.151987 ┆ 0.011336 ┆ 0.130996 ┆ 0.175662 ┆ Two-Way ┆ 0.05 │
│ Rural ┆ Yes ┆ 0.296956 ┆ 0.011976 ┆ 0.27394 ┆ 0.321051 ┆ Two-Way ┆ 0.05 │
│ Urban ┆ No ┆ 0.018563 ┆ 0.003736 ┆ 0.012477 ┆ 0.027537 ┆ Two-Way ┆ 0.05 │
│ Urban ┆ Yes ┆ 0.532494 ┆ 0.007446 ┆ 0.517817 ┆ 0.547115 ┆ Two-Way ┆ 0.05 │
└────────┴────────┴──────────┴──────────┴──────────┴──────────┴────────────┴───────┘
T-Tests for Survey Data
The ttest() method performs design-adjusted t-tests that properly account for stratification, clustering, and weighting.
One-Sample T-Test
Test whether a population mean equals a hypothesized value:
# Test: Is the poverty rate different from 25%?
pov_status_mean_h0 = hld_sample.categorical.ttest(
y="pov_status",
mean_h0=0.25,
)
print(pov_status_mean_h0)╭────────────── T-Test: One-sample ───────────────╮ │ Y = 'pov_status' │ │ H₀: μ = 0.2500 │ │ │ │ │ │ Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ 0.2288 0.0140 0.0613 0.2012 0.2564 │ │ │ │ │ │ Test statistic │ │ │ │ diff t df p_value │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ -0.0212 -1.5141 300.0000 0.1310 │ │ │ ╰─────────────────────────────────────────────────╯
Accessing Results Programmatically
# Test statistics
print(f"t-statistic: {pov_status_mean_h0.stats.t:.4f}")
print(f"Degrees of freedom: {pov_status_mean_h0.stats.df}")
print(f"p-value: {pov_status_mean_h0.stats.p_value:.4f}")t-statistic: -1.5141
Degrees of freedom: 300.0
p-value: 0.1310
# Difference from hypothesized mean
diff = pov_status_mean_h0.diff[0]
print(f"Difference: {diff.diff:.4f}")
print(f"95% CI: [{diff.lci:.4f}, {diff.uci:.4f}]")Difference: -0.0212
95% CI: [-0.0488, 0.0064]
Export to DataFrame
# Combined test results (default)
pov_status_mean_h0.to_polars()| y | diff | se | lci | uci | t | df | p_value |
|---|---|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| "pov_status" | -0.021237 | 0.014026 | -0.048839 | 0.006365 | -1.514116 | 300.0 | 0.131049 |
# Raw estimates only
pov_status_mean_h0.to_polars("estimates")| y | est | se | cv | lci | uci |
|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 |
| "pov_status" | 0.228763 | 0.014026 | 0.061313 | 0.201161 | 0.256364 |
Two-Sample T-Test
Compare means between two groups:
# Test: Does total expenditure differ between urban and rural areas?
exp_by_urbrur = hld_sample.categorical.ttest(
y="tot_exp",
group="urbrur",
)
print(exp_by_urbrur)╭─────────────────────── T-Test: Two-sample (unpaired) ───────────────────────╮ │ Y = 'tot_exp' │ │ Groups: urbrur = ['Urban' vs 'Rural'] │ │ │ │ │ │ Group Level Estimate Std Err CV Lower Upper │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ urbrur Urban 14437.9184 326.4021 0.0226 13795.5994 15080.2375 │ │ urbrur Rural 9116.6293 305.5200 0.0335 8515.4038 9717.8549 │ │ │ │ │ │ Test statistic │ │ │ │ diff t df p_value │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ 5321.2891 11.9023 300.0000 <0.0001 │ │ │ ╰─────────────────────────────────────────────────────────────────────────────╯
Two-Sample Results
# Group difference
diff = exp_by_urbrur.diff[0]
print(f"Difference: {diff.diff:.2f}")
print(f"95% CI: [{diff.lci:.2f}, {diff.uci:.2f}]")Difference: 5321.29
95% CI: [4441.48, 6201.10]
# Individual group estimates
for est in exp_by_urbrur.estimates:
print(f"{est.group_level}: {est.est:.2f} (SE: {est.se:.2f})")Rural: 9116.63 (SE: 305.52)
Urban: 14437.92 (SE: 326.40)
# Export to DataFrame
exp_by_urbrur.to_polars()| y | group_var | paired | diff | se | lci | uci | t | df | p_value |
|---|---|---|---|---|---|---|---|---|---|
| str | str | bool | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| "tot_exp" | "urbrur" | false | 5321.289092 | 447.080293 | 4441.478438 | 6201.099746 | 11.902312 | 300.0 | 0.0 |
Alternative Hypotheses
Specify one-sided tests with the alternative parameter:
# Test: Is urban expenditure GREATER than rural expenditure?
exp_by_urbrur_greater = hld_sample.categorical.ttest(
y="tot_exp",
group="urbrur",
alternative="greater",
)
print(f"One-sided p-value: {exp_by_urbrur_greater.stats.p_value:.6f}")
print(f"Alternative: {exp_by_urbrur_greater.alternative}")One-sided p-value: 0.000000
Alternative: greater
Options:
alternative |
Null Hypothesis | Alternative Hypothesis |
|---|---|---|
"two-sided" |
μ₁ = μ₂ | μ₁ ≠ μ₂ |
"less" |
μ₁ ≥ μ₂ | μ₁ < μ₂ |
"greater" |
μ₁ ≤ μ₂ | μ₁ > μ₂ |
For one-sample tests, the comparison is against mean_h0.
Domain Estimation with by
Perform separate t-tests for each level of a domain variable:
# Test poverty rate vs 25% separately by region
pov_by_region = hld_sample.categorical.ttest(
y="pov_status",
mean_h0=0.25,
by="geo1",
)
# Returns a list of TTestOneGroup objects
for r in pov_by_region:
diff = r.diff[0]
print(f"Region {diff.by_level}: diff={diff.diff:.4f}, p={r.stats.p_value:.4f}")Region geo_01: diff=-0.1641, p=0.0000
Region geo_02: diff=0.0916, p=0.1203
Region geo_03: diff=-0.0314, p=0.4125
Region geo_04: diff=-0.0785, p=0.0320
Region geo_05: diff=0.0732, p=0.2454
Region geo_06: diff=-0.0313, p=0.5516
Region geo_07: diff=-0.0223, p=0.6682
Region geo_08: diff=0.0028, p=0.9736
Region geo_09: diff=0.0701, p=0.1258
Region geo_10: diff=-0.0229, p=0.5386
Degrees of Freedom
For complex surveys, degrees of freedom depend on the design:
| Design | Degrees of Freedom |
|---|---|
| Weights only (SRS) | n - 1 |
| Stratified | n - n_strata - 1 |
| Clustered | n_psu - 1 |
| Stratified + Clustered | n_psu - n_strata - 1 |
Where:
n= number of observationsn_strata= number of stratan_psu= number of primary sampling units
Next Steps
Continue to Generalized Linear Models to learn how to fit linear and logistic regression models with design-adjusted standard errors.
Ready for regression modeling?
Learn GLMs in Generalized Linear Models →