Categorical Data Analysis

Design-aware tabulations, cross-tabulations, and hypothesis tests

Analyze categorical survey data in Python with design-adjusted tabulations, cross-tabulations, and hypothesis tests. Learn to create weighted contingency tables and t-tests using the svy library.
Keywords

categorical data analysis Python, survey cross-tabulation, weighted contingency table, design-adjusted t-test, chi-square test survey, svy library, complex survey analysis

Categorical data analysis spans descriptive techniques—contingency tables and cross-tabulations with design-adjusted tests—and model-based approaches for categorical outcomes (logistic, multinomial, loglinear, and mixed-effects GLMs; see Agresti (2013)).

In complex survey applications, ignoring stratification, clustering, or unequal weights can misstate uncertainty. This tutorial shows how to use svy to produce design-aware tabulations, cross-tabulations, and a t-test for group differences, with standard errors that respect the sample design.

Setting Up the Sample

We’ll use the imaginary country household dataset from World Bank (2023):

import numpy as np
import svy

hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)

# Create household poverty line and binary poverty status
hld_sample = hld_sample.wrangling.mutate(
    {
        "hhpovline": svy.col("hhsize") * 1800,
        "pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline")).then(1).otherwise(0),
    }
)

Tabulation

The tabulate() method produces weighted frequency tables that account for the survey design.

Tabulate the first-level administrative unit (geo1):

hld_admin1_tab = hld_sample.categorical.tabulate(rowvar="geo1")

print(hld_admin1_tab)
╭──────────────────────── Table ─────────────────────────╮
 Type=One-Way                                           
 Alpha=0.05                                             
                                                        
 Row      Estimate   Std Err       CV    Lower    Upper 
 ────────────────────────────────────────────────────── 
 geo_01     0.1370    0.0041   0.0296   0.1292   0.1452 
 geo_02     0.0956    0.0038   0.0402   0.0883   0.1034 
 geo_03     0.1126    0.0033   0.0296   0.1062   0.1193 
 geo_04     0.1481    0.0047   0.0319   0.1391   0.1577 
 geo_05     0.0843    0.0031   0.0373   0.0783   0.0907 
 geo_06     0.0747    0.0032   0.0425   0.0687   0.0812 
 geo_07     0.0727    0.0040   0.0557   0.0651   0.0810 
 geo_08     0.0411    0.0039   0.0938   0.0342   0.0494 
 geo_09     0.1148    0.0040   0.0351   0.1071   0.1229 
 geo_10     0.1191    0.0047   0.0398   0.1101   0.1287 
╰────────────────────────────────────────────────────────╯

Changing Output Units

By default, tabulate() produces proportions. Use the units parameter to get counts or percentages:

# Counts
hld_admin1_tab_count = hld_sample.categorical.tabulate(
    rowvar="geo1",
    units=svy.TableUnits.COUNT,
)

print("Table with counts")
print(hld_admin1_tab_count)

# Percentages
hld_admin1_tab_percent = hld_sample.categorical.tabulate(
    rowvar="geo1",
    units=svy.TableUnits.PERCENT,
)

print("Table with percentages:")
print(hld_admin1_tab_percent)
Table with counts
╭──────────────────────────────── Table ─────────────────────────────────╮
 Type=One-Way                                                           
 Alpha=0.05                                                             
                                                                        
 Row         Estimate      Std Err       CV         Lower         Upper 
 ────────────────────────────────────────────────────────────────────── 
 geo_01   342733.0000   10680.8501   0.0312   321714.4057   363751.5943 
 geo_02   239113.0000   10113.6242   0.0423   219210.6363   259015.3637 
 geo_03   281600.0000    8476.1378   0.0301   264920.0074   298279.9926 
 geo_04   370596.0000   12852.9351   0.0347   345303.0107   395888.9893 
 geo_05   210960.0000    8080.7698   0.0383   195058.0427   226861.9573 
 geo_06   186992.0000    8198.4041   0.0438   170858.5530   203125.4470 
 geo_07   181766.0000   10639.1062   0.0585   160829.5526   202702.4474 
 geo_08   102927.0000    9977.5362   0.0969    83292.4407   122561.5593 
 geo_09   287141.0000   10651.1400   0.0371   266180.8715   308101.1285 
 geo_10   297927.0000   12817.4855   0.0430   272703.7711   323150.2289 
╰────────────────────────────────────────────────────────────────────────╯

Table with percentages:
╭───────────────────────── Table ──────────────────────────╮
 Type=One-Way                                             
 Alpha=0.05                                               
                                                          
 Row      Estimate   Std Err       CV     Lower     Upper 
 ──────────────────────────────────────────────────────── 
 geo_01    13.6997    0.4269   0.0312   12.8595   14.5399 
 geo_02     9.5578    0.4043   0.0423    8.7623   10.3533 
 geo_03    11.2561    0.3388   0.0301   10.5894   11.9228 
 geo_04    14.8134    0.5138   0.0347   13.8024   15.8245 
 geo_05     8.4325    0.3230   0.0383    7.7968    9.0681 
 geo_06     7.4744    0.3277   0.0438    6.8295    8.1193 
 geo_07     7.2655    0.4253   0.0585    6.4287    8.1024 
 geo_08     4.1142    0.3988   0.0969    3.3294    4.8990 
 geo_09    11.4776    0.4257   0.0371   10.6398   12.3154 
 geo_10    11.9087    0.5123   0.0430   10.9005   12.9169 
╰──────────────────────────────────────────────────────────╯

Scaling Counts to a Custom Total

Use count_total to express counts on an arbitrary total (useful for scaled headcounts while preserving shares):

# Scale counts so the total sums to 1,000
hld_admin1_tab_n = hld_sample.categorical.tabulate(
    rowvar="geo1",
    count_total=1_000,
)

print(hld_admin1_tab_n)
╭────────────────────────── Table ───────────────────────────╮
 Type=One-Way                                               
 Alpha=0.05                                                 
                                                            
 Row      Estimate   Std Err       CV      Lower      Upper 
 ────────────────────────────────────────────────────────── 
 geo_01   136.9970    4.2693   0.0312   128.5955   145.3986 
 geo_02    95.5781    4.0426   0.0423    87.6227   103.5335 
 geo_03   112.5610    3.3881   0.0301   105.8937   119.2283 
 geo_04   148.1344    5.1376   0.0347   138.0243   158.2445 
 geo_05    84.3248    3.2300   0.0383    77.9685    90.6811 
 geo_06    74.7443    3.2771   0.0438    68.2955    81.1932 
 geo_07    72.6554    4.2527   0.0585    64.2867    81.0241 
 geo_08    41.1419    3.9882   0.0969    33.2936    48.9902 
 geo_09   114.7758    4.2575   0.0371   106.3977   123.1540 
 geo_10   119.0872    5.1234   0.0430   109.0050   129.1694 
╰────────────────────────────────────────────────────────────╯

T-Test for Group Differences

The ttest() method performs a design-adjusted t-test comparing a sample estimate to a hypothesized value:

pov_status_mean_h0 = hld_sample.categorical.ttest(
    y="pov_status",
    mean_h0=0.25,
)

print(pov_status_mean_h0)
╭─────────────────────────────────── t-test results ───────────────────────────────────╮
   One-sample t-test                                                                  
   Y = 'pov_status'                                                                   
   H₀: μ = 0.2500                                                                     
                                                                                      
   Estimate   Std Err       CV    Lower    Upper                                      
   ─────────────────────────────────────────────                                      
     0.2288    0.0140   0.0613   0.2012   0.2564                                      
                                                                                      
   Test statistic                                                                     
         t         df     p(<)     p(>)     p(≠)                                      
   ─────────────────────────────────────────────                                      
   -1.5141   301.0000   0.0655   0.9345   0.1310                                      
╰──────────────────────────────────────────────────────────────────────────────────────╯

Next Steps

Continue to Generalized Linear Models to learn how to fit linear and logistic regression models with design-adjusted standard errors.

Ready for regression modeling?
Learn GLMs in Generalized Linear Models →

References

Agresti, Alan. 2013. Categorical Data Analysis, 3rd edn. John Wiley & Sons, Hoboken, New Jersey.
World Bank. 2023. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.