Categorical Data Analysis for Complex Surveys in Python

Design-aware tabulations, cross-tabulations, and hypothesis tests

Tutorials
Categorical Analysis
Hypothesis Testing
Python
Analyze categorical survey data in Python with design-adjusted tabulations, cross-tabulations, and hypothesis tests. Learn to create weighted contingency tables and t-tests using the svy library.
Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

January 18, 2026

Keywords

categorical data analysis Python, survey cross-tabulation, weighted contingency table, design-adjusted t-test, chi-square test survey, svy library, complex survey analysis, two-sample t-test survey, one-sample t-test survey, survey tabulation Python, weighted frequency table

Categorical data analysis spans descriptive techniques—contingency tables and cross-tabulations with design-adjusted tests—and model-based approaches for categorical outcomes (logistic, multinomial, loglinear, and mixed-effects GLMs; see Agresti (2013)).

In complex survey applications, ignoring stratification, clustering, or unequal weights can misstate uncertainty. This tutorial shows how to use svy to produce design-aware tabulations, cross-tabulations, and t-tests for group differences, with standard errors that respect the sample design.

Setting Up the Sample

We’ll use the imaginary country household dataset from World Bank (2023):

import numpy as np
import svy

hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)

hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)

# Create household poverty line and binary poverty status
hld_sample = hld_sample.wrangling.mutate(
    {
        "hhpovline": svy.col("hhsize") * 1800,
        "pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline")).then(1).otherwise(0),
    }
)

One-Way Tabulation

The tabulate() method produces weighted frequency tables that account for the survey design.

Tabulate the first-level administrative unit (geo1):

hld_admin1_tab = hld_sample.categorical.tabulate(rowvar="geo1")
print(hld_admin1_tab)
╭────────────────────── Table: geo1 ───────────────────────╮
 Type=One-Way                                             
 Alpha=0.05                                               
                                                          
                                                          
  Row      Estimate   Std Err       CV    Lower    Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  geo_01     0.1370    0.0041   0.0296   0.1292   0.1452  
  geo_02     0.0956    0.0038   0.0402   0.0883   0.1034  
  geo_03     0.1126    0.0033   0.0296   0.1062   0.1193  
  geo_04     0.1481    0.0047   0.0319   0.1391   0.1577  
  geo_05     0.0843    0.0031   0.0373   0.0783   0.0907  
  geo_06     0.0747    0.0032   0.0425   0.0687   0.0812  
  geo_07     0.0727    0.0040   0.0557   0.0651   0.0810  
  geo_08     0.0411    0.0039   0.0938   0.0342   0.0494  
  geo_09     0.1148    0.0040   0.0351   0.1071   0.1229  
  geo_10     0.1191    0.0047   0.0398   0.1101   0.1287  
                                                          
╰──────────────────────────────────────────────────────────╯

Changing Output Units

By default, tabulate() produces proportions. Use the units parameter to get counts or percentages:

# Counts
hld_admin1_tab_count = hld_sample.categorical.tabulate(
    rowvar="geo1",
    units=svy.TableUnits.COUNT,
)
print("Table with counts")
print(hld_admin1_tab_count)
Table with counts
╭────────────────────────────── Table: geo1 ───────────────────────────────╮
 Type=One-Way                                                             
 Alpha=0.05                                                               
                                                                          
                                                                          
  Row         Estimate      Std Err       CV         Lower         Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  geo_01   342733.0000   10680.8501   0.0312   321714.4057   363751.5943  
  geo_02   239113.0000   10113.6242   0.0423   219210.6363   259015.3637  
  geo_03   281600.0000    8476.1378   0.0301   264920.0074   298279.9926  
  geo_04   370596.0000   12852.9351   0.0347   345303.0107   395888.9893  
  geo_05   210960.0000    8080.7698   0.0383   195058.0427   226861.9573  
  geo_06   186992.0000    8198.4041   0.0438   170858.5530   203125.4470  
  geo_07   181766.0000   10639.1062   0.0585   160829.5526   202702.4474  
  geo_08   102927.0000    9977.5362   0.0969    83292.4407   122561.5593  
  geo_09   287141.0000   10651.1400   0.0371   266180.8715   308101.1285  
  geo_10   297927.0000   12817.4855   0.0430   272703.7711   323150.2289  
                                                                          
╰──────────────────────────────────────────────────────────────────────────╯
# Percentages
hld_admin1_tab_percent = hld_sample.categorical.tabulate(
    rowvar="geo1",
    units=svy.TableUnits.PERCENT,
)
print("Table with percentages:")
print(hld_admin1_tab_percent)
Table with percentages:
╭─────────────────────── Table: geo1 ────────────────────────╮
 Type=One-Way                                               
 Alpha=0.05                                                 
                                                            
                                                            
  Row      Estimate   Std Err       CV     Lower     Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  geo_01    13.6997    0.4269   0.0312   12.8595   14.5399  
  geo_02     9.5578    0.4043   0.0423    8.7623   10.3533  
  geo_03    11.2561    0.3388   0.0301   10.5894   11.9228  
  geo_04    14.8134    0.5138   0.0347   13.8024   15.8245  
  geo_05     8.4325    0.3230   0.0383    7.7968    9.0681  
  geo_06     7.4744    0.3277   0.0438    6.8295    8.1193  
  geo_07     7.2655    0.4253   0.0585    6.4287    8.1024  
  geo_08     4.1142    0.3988   0.0969    3.3294    4.8990  
  geo_09    11.4776    0.4257   0.0371   10.6398   12.3154  
  geo_10    11.9087    0.5123   0.0430   10.9005   12.9169  
                                                            
╰────────────────────────────────────────────────────────────╯

Scaling Counts to a Custom Total

Use count_total to express counts on an arbitrary total (useful for scaled headcounts while preserving shares):

# Scale counts so the total sums to 1,000
hld_admin1_tab_n = hld_sample.categorical.tabulate(
    rowvar="geo1",
    count_total=1_000,
)
print(hld_admin1_tab_n)
╭──────────────────────── Table: geo1 ─────────────────────────╮
 Type=One-Way                                                 
 Alpha=0.05                                                   
                                                              
                                                              
  Row      Estimate   Std Err       CV      Lower      Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  geo_01   136.9970    4.2693   0.0312   128.5955   145.3986  
  geo_02    95.5781    4.0426   0.0423    87.6227   103.5335  
  geo_03   112.5610    3.3881   0.0301   105.8937   119.2283  
  geo_04   148.1344    5.1376   0.0347   138.0243   158.2445  
  geo_05    84.3248    3.2300   0.0383    77.9685    90.6811  
  geo_06    74.7443    3.2771   0.0438    68.2955    81.1932  
  geo_07    72.6554    4.2527   0.0585    64.2867    81.0241  
  geo_08    41.1419    3.9882   0.0969    33.2936    48.9902  
  geo_09   114.7758    4.2575   0.0371   106.3977   123.1540  
  geo_10   119.0872    5.1234   0.0430   109.0050   129.1694  
                                                              
╰──────────────────────────────────────────────────────────────╯

Two-Way Tabulation (Cross-Tabulation)

Cross-tabulations examine the relationship between two categorical variables. Use the colvar parameter to create a two-way table:

# Cross-tabulate urban/rural status by electricity access
urbrur_elec_tab = hld_sample.categorical.tabulate(
    rowvar="urbrur",
    colvar="electricity",
)
print(urbrur_elec_tab)
╭───────────────── Table: urbrur × electricity ─────────────────╮
 Type=Two-Way                                                  
 Alpha=0.05                                                    
                                                               
                                                               
  Row     Col   Estimate   Std Err       CV    Lower    Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  Rural   No      0.1520    0.0113   0.0746   0.1310   0.1757  
  Rural   Yes     0.2970    0.0120   0.0403   0.2739   0.3211  
  Urban   No      0.0186    0.0037   0.2012   0.0125   0.0275  
  Urban   Yes     0.5325    0.0074   0.0140   0.5178   0.5471  
                                                               
╰───────────────────────────────────────────────────────────────╯

Viewing as a Crosstab Matrix

The crosstab() method returns a Polars DataFrame in a familiar matrix format:

# Get crosstab matrix (estimates only)
urbrur_elec_crosstab = urbrur_elec_tab.crosstab()
print(urbrur_elec_crosstab)
shape: (2, 3)
┌────────┬──────────┬──────────┐
│ urbrur ┆ No       ┆ Yes      │
│ ---    ┆ ---      ┆ ---      │
│ str    ┆ f64      ┆ f64      │
╞════════╪══════════╪══════════╡
│ Rural  ┆ 0.151987 ┆ 0.296956 │
│ Urban  ┆ 0.018563 ┆ 0.532494 │
└────────┴──────────┴──────────┘
# Include standard errors with the estimates
urbrur_elec_crosstab_se = urbrur_elec_tab.crosstab(stats=("est", "se"))
print(urbrur_elec_crosstab_se)
shape: (2, 3)
┌────────┬───────────────┬───────────────┐
│ urbrur ┆ No            ┆ Yes           │
│ ---    ┆ ---           ┆ ---           │
│ str    ┆ str           ┆ str           │
╞════════╪═══════════════╪═══════════════╡
│ Rural  ┆ 0.152 ± 0.011 ┆ 0.297 ± 0.012 │
│ Urban  ┆ 0.019 ± 0.004 ┆ 0.532 ± 0.007 │
└────────┴───────────────┴───────────────┘

Two-Way Tables with Different Units

Just like one-way tables, you can change the output units:

# Cross-tabulation with percentages
urbrur_elec_pct = hld_sample.categorical.tabulate(
    rowvar="urbrur",
    colvar="electricity",
    units=svy.TableUnits.PERCENT,
)
print("Cross-tabulation with percentages:")
print(urbrur_elec_pct.crosstab())
Cross-tabulation with percentages:
shape: (2, 3)
┌────────┬───────────┬───────────┐
│ urbrur ┆ No        ┆ Yes       │
│ ---    ┆ ---       ┆ ---       │
│ str    ┆ f64       ┆ f64       │
╞════════╪═══════════╪═══════════╡
│ Rural  ┆ 15.198691 ┆ 29.695594 │
│ Urban  ┆ 1.856347  ┆ 53.249369 │
└────────┴───────────┴───────────┘
# Cross-tabulation with counts
urbrur_elec_count = hld_sample.categorical.tabulate(
    rowvar="urbrur",
    colvar="electricity",
    units=svy.TableUnits.COUNT,
)
print("Cross-tabulation with counts:")
print(urbrur_elec_count.crosstab())
Cross-tabulation with counts:
shape: (2, 3)
┌────────┬───────────────┬───────────────┐
│ urbrur ┆ No            ┆ Yes           │
│ ---    ┆ ---           ┆ ---           │
│ str    ┆ f64           ┆ f64           │
╞════════╪═══════════════╪═══════════════╡
│ Rural  ┆ 380234.000686 ┆ 742910.999314 │
│ Urban  ┆ 46441.251273  ┆ 1.3322e6      │
└────────┴───────────────┴───────────────┘

Crosstab Display Options

The crosstab() method offers several formatting options:

# Show estimates with confidence intervals
print(urbrur_elec_tab.crosstab(stats=("est", "lci", "uci"), precision=4))
shape: (2, 3)
┌────────┬─────────────────────────┬─────────────────────────┐
│ urbrur ┆ No                      ┆ Yes                     │
│ ---    ┆ ---                     ┆ ---                     │
│ str    ┆ str                     ┆ str                     │
╞════════╪═════════════════════════╪═════════════════════════╡
│ Rural  ┆ 0.1520 [0.1310, 0.1757] ┆ 0.2970 [0.2739, 0.3211] │
│ Urban  ┆ 0.0186 [0.0125, 0.0275] ┆ 0.5325 [0.5178, 0.5471] │
└────────┴─────────────────────────┴─────────────────────────┘
# Export full table details to Polars DataFrame
df = urbrur_elec_tab.to_polars()
print(df)
shape: (4, 8)
┌────────┬────────┬──────────┬──────────┬──────────┬──────────┬────────────┬───────┐
│ rowvar ┆ colvar ┆ est      ┆ se       ┆ lci      ┆ uci      ┆ table_type ┆ alpha │
│ ---    ┆ ---    ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---        ┆ ---   │
│ str    ┆ str    ┆ f64      ┆ f64      ┆ f64      ┆ f64      ┆ str        ┆ f64   │
╞════════╪════════╪══════════╪══════════╪══════════╪══════════╪════════════╪═══════╡
│ Rural  ┆ No     ┆ 0.151987 ┆ 0.011336 ┆ 0.130996 ┆ 0.175662 ┆ Two-Way    ┆ 0.05  │
│ Rural  ┆ Yes    ┆ 0.296956 ┆ 0.011976 ┆ 0.27394  ┆ 0.321051 ┆ Two-Way    ┆ 0.05  │
│ Urban  ┆ No     ┆ 0.018563 ┆ 0.003736 ┆ 0.012477 ┆ 0.027537 ┆ Two-Way    ┆ 0.05  │
│ Urban  ┆ Yes    ┆ 0.532494 ┆ 0.007446 ┆ 0.517817 ┆ 0.547115 ┆ Two-Way    ┆ 0.05  │
└────────┴────────┴──────────┴──────────┴──────────┴──────────┴────────────┴───────┘

T-Tests for Survey Data

The ttest() method performs design-adjusted t-tests that properly account for stratification, clustering, and weighting.

One-Sample T-Test

Test whether a population mean equals a hypothesized value:

# Test: Is the poverty rate different from 25%?
pov_status_mean_h0 = hld_sample.categorical.ttest(
    y="pov_status",
    mean_h0=0.25,
)
print(pov_status_mean_h0)
╭────────────── T-Test: One-sample ───────────────╮
 Y = 'pov_status'                                
 H₀: μ = 0.2500                                  
                                                 
                                                 
  Estimate   Std Err       CV    Lower    Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
    0.2288    0.0140   0.0613   0.2012   0.2564  
                                                 
                                                 
 Test statistic                                  
                                                 
     diff         t         df   p_value         
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━         
  -0.0212   -1.5141   300.0000    0.1310         
                                                 
╰─────────────────────────────────────────────────╯

Accessing Results Programmatically

# Test statistics
print(f"t-statistic: {pov_status_mean_h0.stats.t:.4f}")
print(f"Degrees of freedom: {pov_status_mean_h0.stats.df}")
print(f"p-value: {pov_status_mean_h0.stats.p_value:.4f}")
t-statistic: -1.5141
Degrees of freedom: 300.0
p-value: 0.1310
# Difference from hypothesized mean
diff = pov_status_mean_h0.diff[0]
print(f"Difference: {diff.diff:.4f}")
print(f"95% CI: [{diff.lci:.4f}, {diff.uci:.4f}]")
Difference: -0.0212
95% CI: [-0.0488, 0.0064]

Export to DataFrame

# Combined test results (default)
pov_status_mean_h0.to_polars()
shape: (1, 8)
y diff se lci uci t df p_value
str f64 f64 f64 f64 f64 f64 f64
"pov_status" -0.021237 0.014026 -0.048839 0.006365 -1.514116 300.0 0.131049
# Raw estimates only
pov_status_mean_h0.to_polars("estimates")
shape: (1, 6)
y est se cv lci uci
str f64 f64 f64 f64 f64
"pov_status" 0.228763 0.014026 0.061313 0.201161 0.256364

Two-Sample T-Test

Compare means between two groups:

# Test: Does total expenditure differ between urban and rural areas?
exp_by_urbrur = hld_sample.categorical.ttest(
    y="tot_exp",
    group="urbrur",
)
print(exp_by_urbrur)
╭─────────────────────── T-Test: Two-sample (unpaired) ───────────────────────╮
 Y = 'tot_exp'                                                               
 Groups: urbrur = ['Urban' vs 'Rural']                                       
                                                                             
                                                                             
  Group    Level     Estimate    Std Err       CV        Lower        Upper  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  urbrur   Urban   14437.9184   326.4021   0.0226   13795.5994   15080.2375  
  urbrur   Rural    9116.6293   305.5200   0.0335    8515.4038    9717.8549  
                                                                             
                                                                             
 Test statistic                                                              
                                                                             
       diff         t         df   p_value                                   
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                   
  5321.2891   11.9023   300.0000   <0.0001                                   
                                                                             
╰─────────────────────────────────────────────────────────────────────────────╯

Two-Sample Results

# Group difference
diff = exp_by_urbrur.diff[0]
print(f"Difference: {diff.diff:.2f}")
print(f"95% CI: [{diff.lci:.2f}, {diff.uci:.2f}]")
Difference: 5321.29
95% CI: [4441.48, 6201.10]
# Individual group estimates
for est in exp_by_urbrur.estimates:
    print(f"{est.group_level}: {est.est:.2f} (SE: {est.se:.2f})")
Rural: 9116.63 (SE: 305.52)
Urban: 14437.92 (SE: 326.40)
# Export to DataFrame
exp_by_urbrur.to_polars()
shape: (1, 10)
y group_var paired diff se lci uci t df p_value
str str bool f64 f64 f64 f64 f64 f64 f64
"tot_exp" "urbrur" false 5321.289092 447.080293 4441.478438 6201.099746 11.902312 300.0 0.0

Alternative Hypotheses

Specify one-sided tests with the alternative parameter:

# Test: Is urban expenditure GREATER than rural expenditure?
exp_by_urbrur_greater = hld_sample.categorical.ttest(
    y="tot_exp",
    group="urbrur",
    alternative="greater",
)
print(f"One-sided p-value: {exp_by_urbrur_greater.stats.p_value:.6f}")
print(f"Alternative: {exp_by_urbrur_greater.alternative}")
One-sided p-value: 0.000000
Alternative: greater

Options:

alternative Null Hypothesis Alternative Hypothesis
"two-sided" μ₁ = μ₂ μ₁ ≠ μ₂
"less" μ₁ ≥ μ₂ μ₁ < μ₂
"greater" μ₁ ≤ μ₂ μ₁ > μ₂

For one-sample tests, the comparison is against mean_h0.

Domain Estimation with by

Perform separate t-tests for each level of a domain variable:

# Test poverty rate vs 25% separately by region
pov_by_region = hld_sample.categorical.ttest(
    y="pov_status",
    mean_h0=0.25,
    by="geo1",
)

# Returns a list of TTestOneGroup objects
for r in pov_by_region:
    diff = r.diff[0]
    print(f"Region {diff.by_level}: diff={diff.diff:.4f}, p={r.stats.p_value:.4f}")
Region geo_01: diff=-0.1641, p=0.0000
Region geo_02: diff=0.0916, p=0.1203
Region geo_03: diff=-0.0314, p=0.4125
Region geo_04: diff=-0.0785, p=0.0320
Region geo_05: diff=0.0732, p=0.2454
Region geo_06: diff=-0.0313, p=0.5516
Region geo_07: diff=-0.0223, p=0.6682
Region geo_08: diff=0.0028, p=0.9736
Region geo_09: diff=0.0701, p=0.1258
Region geo_10: diff=-0.0229, p=0.5386

Degrees of Freedom

For complex surveys, degrees of freedom depend on the design:

Design Degrees of Freedom
Weights only (SRS) n - 1
Stratified n - n_strata - 1
Clustered n_psu - 1
Stratified + Clustered n_psu - n_strata - 1

Where:

  • n = number of observations
  • n_strata = number of strata
  • n_psu = number of primary sampling units

Next Steps

Continue to Generalized Linear Models to learn how to fit linear and logistic regression models with design-adjusted standard errors.

Ready for regression modeling?
Learn GLMs in Generalized Linear Models →

References

Agresti, Alan. 2013. Categorical Data Analysis, 3rd edn. John Wiley & Sons, Hoboken, New Jersey.
World Bank. 2023. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.