Generalized Linear Models (GLMs) for Complex Surveys in Python

Linear, logistic, and count regression with design-adjusted standard errors

Tutorials
GLM
Regression
Python
Fit linear, logistic, and Poisson regression models to complex survey data in Python. Learn design-adjusted GLMs with proper standard errors, categorical predictors, and survey weights using the svy library.
Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

February 9, 2026

Keywords

GLM survey data Python, logistic regression survey weights, linear regression complex survey, survey weighted regression, binomial GLM Python, Gaussian GLM survey, Poisson regression survey, svy glm tutorial, design-adjusted standard errors, categorical predictors survey, survey regression analysis, Taylor linearization variance

Generalized Linear Models (GLMs) extend ordinary linear regression to accommodate response variables with non-normal error distributions, such as binary, categorical, or count data.

In complex survey analysis, fitting these models requires special attention. While point estimates (coefficients) are often computed using weighted estimating equations, standard variance estimation methods (like OLS) generally underestimate uncertainty because they assume observations are independent and identically distributed (i.i.d.).

The svy GLM module provides wrappers to fit common regression models—Linear (Gaussian) and Logistic (Binomial)—while correctly estimating standard errors using the survey design information (stratification, clustering, and weighting)

Setting Up the Sample

We’ll use the World Bank (2023) synthetic sample data:

import numpy as np
import svy

# Load data and define design
hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)

# Create a binary poverty status variable for Logistic Regression examples
hld_sample = hld_sample.wrangling.mutate(
    {
        "hhpovline": svy.col("hhsize") * 1800,
        "pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline"))
        .then(1)
        .otherwise(0),
    }
)
TipImplementation in svy

Use Sample.glm.fit() to fit regression models with design-adjusted standard errors.

Linear Regression

Linear regression is used when the outcome variable is continuous. In survey analysis, this is equivalent to solving weighted least squares, but with variance estimates that account for the complex design.

Estimate a model predicting total household expenditure (tot_exp) based on household size (hhsize) and number of rooms (rooms):

# Fit linear regression model
lin_model = hld_sample.glm.fit(
    y="pov_status",
    x=["hhsize", "rooms"],
    family="Gaussian",
)

print(lin_model)
╭─────────────────────────── GLM: Gaussian (identity) ───────────────────────────╮
 Modeling: pov_status                                                           
                                                                                
 Observations       8000  AIC           1037.0840                               
 Df Residuals        310  BIC           1058.0455                               
 Deviance      1031.0840  Scale            3.3261                               
 R-squared       0.26948  R-sq (adj)      0.26930                               
                          Iterations            2                               
 F-stat (adj)    0.00000  Prob (F-adj)     1.0000                               
                                                                                
                                                                                
  Term             Coef.   Std.Err.           t    P>|t|     [0.025     0.975]  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  _intercept_    0.02697    0.02562     1.05271   0.2933   -0.02344    0.07739  
  hhsize         0.09471    0.00518    18.28602   <0.001    0.08452    0.10490  
  rooms         -0.06363    0.00630   -10.10491   <0.001   -0.07601   -0.05124  
                                                                                
╰────────────────────────────────────────────────────────────────────────────────╯

The output includes estimated coefficients, design-based standard errors, t-statistics, and confidence intervals.

Logistic Regression

Logistic regression is used when the outcome variable is binary (0/1), such as whether a household is below the poverty line. It models the log-odds of the outcome as a linear combination of the predictors.

TipImplementation in svy

Use family="binomial" and link="logit" to fit a logistic regression model.

Model the likelihood of a household being poor (pov_status) using household size (hhsize) and number of rooms (rooms) as continuous predictors, and urban/rural status (urbrur) as a categorical predictor:

# Fit logistic regression model
logit_model = hld_sample.glm.fit(
    y="pov_status",
    x=["hhsize", "rooms", svy.Cat("urbrur", ref="Urban")],
    family="binomial",
    link="logit",
)

print(logit_model)
╭───────────────────────────── GLM: Binomial (logit) ─────────────────────────────╮
 Modeling: pov_status                                                            
                                                                                 
 Observations      8000  AIC           821.1604                                  
 Df Residuals       310  BIC           849.1092                                  
 Deviance      813.1604  Scale           1.0000                                  
 R-squared      0.42388  R-sq (adj)     0.42366                                  
                         Iterations           7                                  
 F-stat (adj)   0.00000  Prob (F-adj)    1.0000                                  
                                                                                 
                                                                                 
  Term              Coef.   Std.Err.           t    P>|t|     [0.025     0.975]  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  _intercept_    -4.05200    0.22800   -17.77188   <0.001   -4.50063   -3.60338  
  hhsize          0.73361    0.04018    18.25830   <0.001    0.65456    0.81267  
  rooms          -0.67688    0.05836   -11.59836   <0.001   -0.79171   -0.56205  
  urbrur_Rural    2.05946    0.13098    15.72320   <0.001    1.80173    2.31718  
                                                                                 
╰─────────────────────────────────────────────────────────────────────────────────╯
Note

Interpretation: The coefficients are on the log-odds scale. To interpret them as Odds Ratios (OR), exponentiate the coefficients (e^β).

Understanding GLM Parameters

When fitting a GLM in svy, you must specify the structure of your data and statistical assumptions:

Specifying Predictors (x)

The x parameter accepts a list of predictor variables. You can mix continuous and categorical variables:

x=["hhsize", "rooms", svy.Cat("urbrur")]

Categorical Variables with svy.Cat()

Unlike continuous variables, categorical variables (like urbrur or region) need special handling to create dummy (indicator) variables. Wrap categorical predictors in svy.Cat():

x=[svy.Cat("sex"), svy.Cat("region")]

Setting Reference Categories

When creating dummy variables, one category is omitted to serve as the baseline or “reference” group. Coefficients for other categories are interpreted relative to this reference. By default, svy selects the first category alphabetically.

To specify a different reference category, use the ref parameter:

x=[svy.Cat("urbrur", ref="Urban")]  # Sets "Urban" as the baseline

The resulting coefficient will represent the effect of being “Rural” compared to “Urban”.

Distribution Family (family)

Specifies the probability distribution of your outcome variable (y):

Family Use Case
"gaussian" Continuous outcomes (Linear Regression)
"binomial" Binary (0/1) outcomes (Logistic Regression)
"poisson" Count data (Poisson Regression)

Supported Models

The svy library supports five canonical GLM families used in survey analysis. Since the library is pre-1.0, models are categorized by stability:

Family Stability Use Case
Gaussian ✅ Core Continuous outcomes (e.g., income)
Binomial ✅ Core Binary outcomes (e.g., poverty status)
Poisson ✅ Core Count data (e.g., number of children)
Gamma 🧪 Beta Positively skewed continuous data
Inverse Gaussian 🧪 Beta Time-to-failure data
NoteRoadmap (Phase 2)

Future updates will introduce support for Negative Binomial and Beta regression, as well as additional link functions (Probit, Cloglog).

Note: “Beta” features are fully functional but may undergo API changes or require extra care with data scaling.

Next Steps

Now that you’ve covered weighting, estimation, and modeling, explore how to visualize results or export them for reporting.

Explore more tutorials
Return to Tutorials Overview →

References

World Bank. 2023. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.