Generalized Linear Models (GLMs)

Linear, logistic, and count regression with design-adjusted standard errors

Fit linear and logistic regression models to complex survey data in Python. Learn design-adjusted GLMs with proper standard errors using the svy library.
Keywords

GLM survey data Python, logistic regression survey weights, linear regression complex survey, survey weighted regression, binomial GLM, Gaussian GLM, svy library

Generalized Linear Models (GLMs) extend ordinary linear regression to accommodate response variables with non-normal error distributions, such as binary, categorical, or count data.

In complex survey analysis, fitting these models requires special attention. While point estimates (coefficients) are often computed using weighted estimating equations, standard variance estimation methods (like OLS) generally underestimate uncertainty because they assume observations are independent and identically distributed (i.i.d.).

The svy GLM module provides wrappers to fit common regression models—Linear (Gaussian) and Logistic (Binomial)—while correctly estimating standard errors using the survey design information (stratification, clustering, and weighting).

Setting Up the Sample

We’ll use the World Bank (2023) synthetic sample data:

import numpy as np
import svy

# Load data and define design
hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)

# Create a binary poverty status variable for Logistic Regression examples
hld_sample = hld_sample.wrangling.mutate(
    {
        "hhpovline": svy.col("hhsize") * 1800,
        "pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline"))
        .then(1)
        .otherwise(0),
    }
)
TipImplementation in svy

Use Sample.glm.fit() to fit regression models with design-adjusted standard errors.

Linear Regression

Linear regression is used when the outcome variable is continuous. In survey analysis, this is equivalent to solving weighted least squares, but with variance estimates that account for the complex design.

Estimate a model predicting total household expenditure (tot_exp) based on household size (hhsize) and number of rooms (rooms):

# Fit linear regression model
lin_model = hld_sample.glm.fit(
    y="pov_status",
    x=["hhsize", "rooms"],
    family="Gaussian",
)

print(lin_model)
╭─────────────────────────── GLM: Gaussian (identity) ───────────────────────────╮
 Modeling: pov_status                                                           
                                                                                
 Observations       8000  AIC           -16384.6473                             
 Df Residuals      310.0  BIC           -16363.6857                             
 Deviance      1031.0840  Scale           1040.1288                             
 R-squared       0.26948  R-sq (adj)        0.26930                             
                          Iterations              2                             
 F-stat (adj)  201.56316  Prob (F-adj)       <0.001                             
                                                                                
                                                                                
  Term             Coef.   Std.Err.           t    P>|t|     [0.025     0.975]  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  _intercept_    0.02697    0.02562     1.05271   0.2933   -0.02344    0.07739  
  hhsize         0.09471    0.00518    18.28602   <0.001    0.08452    0.10490  
  rooms         -0.06363    0.00630   -10.10491   <0.001   -0.07601   -0.05124  
                                                                                
╰────────────────────────────────────────────────────────────────────────────────╯

The output includes estimated coefficients, design-based standard errors, t-statistics, and confidence intervals.

Logistic Regression

Logistic regression is used when the outcome variable is binary (0/1), such as whether a household is below the poverty line. It models the log-odds of the outcome as a linear combination of the predictors.

TipImplementation in svy

Use family="binomial" and link="logit" to fit a logistic regression model.

Model the likelihood of a household being poor (pov_status) using household size (hhsize) and number of rooms (rooms) as continuous predictors, and urban/rural status (urbrur) as a categorical predictor:

# Fit logistic regression model
logit_model = hld_sample.glm.fit(
    y="pov_status",
    x=["hhsize", "rooms", svy.Cat("urbrur", ref="Urban")],
    family="binomial",
    link="logit",
)

print(logit_model)
╭───────────────────────────── GLM: Binomial (logit) ─────────────────────────────╮
 Modeling: pov_status                                                            
                                                                                 
 Observations       8000  AIC           5243.7968                                
 Df Residuals      310.0  BIC           5271.7456                                
 Deviance      5235.7968  Scale            1.0000                                
 R-squared       0.39150  R-sq (adj)      0.39127                                
                          Iterations            7                                
 F-stat (adj)  192.61847  Prob (F-adj)     <0.001                                
                                                                                 
                                                                                 
  Term              Coef.   Std.Err.           t    P>|t|     [0.025     0.975]  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  _intercept_    -4.05200    0.22800   -17.77188   <0.001   -4.50063   -3.60338  
  hhsize          0.73361    0.04018    18.25830   <0.001    0.65456    0.81267  
  rooms          -0.67688    0.05836   -11.59835   <0.001   -0.79171   -0.56205  
  urbrur_Rural    2.05946    0.13098    15.72320   <0.001    1.80173    2.31718  
                                                                                 
╰─────────────────────────────────────────────────────────────────────────────────╯
Note

Interpretation: The coefficients are on the log-odds scale. To interpret them as Odds Ratios (OR), exponentiate the coefficients (e^β).

Understanding GLM Parameters

When fitting a GLM in svy, you must specify the structure of your data and statistical assumptions:

Specifying Predictors (x)

The x parameter accepts a list of predictor variables. You can mix continuous and categorical variables:

x=["hhsize", "rooms", svy.Cat("urbrur")]

Categorical Variables with svy.Cat()

Unlike continuous variables, categorical variables (like urbrur or region) need special handling to create dummy (indicator) variables. Wrap categorical predictors in svy.Cat():

x=[svy.Cat("sex"), svy.Cat("region")]

Setting Reference Categories

When creating dummy variables, one category is omitted to serve as the baseline or “reference” group. Coefficients for other categories are interpreted relative to this reference. By default, svy selects the first category alphabetically.

To specify a different reference category, use the ref parameter:

x=[svy.Cat("urbrur", ref="Urban")]  # Sets "Urban" as the baseline

The resulting coefficient will represent the effect of being “Rural” compared to “Urban”.

Distribution Family (family)

Specifies the probability distribution of your outcome variable (y):

Family Use Case
"gaussian" Continuous outcomes (Linear Regression)
"binomial" Binary (0/1) outcomes (Logistic Regression)
"poisson" Count data (Poisson Regression)

Supported Models

The svy library supports five canonical GLM families used in survey analysis. Since the library is pre-1.0, models are categorized by stability:

Family Stability Use Case
Gaussian ✅ Core Continuous outcomes (e.g., income)
Binomial ✅ Core Binary outcomes (e.g., poverty status)
Poisson ✅ Core Count data (e.g., number of children)
Gamma 🧪 Beta Positively skewed continuous data
Inverse Gaussian 🧪 Beta Time-to-failure data
NoteRoadmap (Phase 2)

Future updates will introduce support for Negative Binomial and Beta regression, as well as additional link functions (Probit, Cloglog).

Note: “Beta” features are fully functional but may undergo API changes or require extra care with data scaling.

Next Steps

Now that you’ve covered weighting, estimation, and modeling, explore how to visualize results or export them for reporting.

Explore more tutorials
Return to Tutorials Overview →

References

World Bank. 2023. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.