Generalized Linear Models (GLMs) for Complex Surveys in Python

Linear, logistic, and count regression with design-adjusted standard errors

Tutorials

GLM

Regression

Python

Fit linear, logistic, and Poisson regression models to complex survey data in Python. Learn design-adjusted GLMs with proper standard errors, categorical predictors, and survey weights using the svy library.

Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

February 9, 2026

Keywords

GLM survey data Python, logistic regression survey weights, linear regression complex survey, survey weighted regression, binomial GLM Python, Gaussian GLM survey, Poisson regression survey, svy glm tutorial, design-adjusted standard errors, categorical predictors survey, survey regression analysis, Taylor linearization variance

Generalized Linear Models (GLMs) extend ordinary linear regression to accommodate response variables with non-normal error distributions, such as binary, categorical, or count data.

In complex survey analysis, fitting these models requires special attention. While point estimates (coefficients) are often computed using weighted estimating equations, standard variance estimation methods (like OLS) generally underestimate uncertainty because they assume observations are independent and identically distributed (i.i.d.).

The svy GLM module provides wrappers to fit common regression models—Linear (Gaussian) and Logistic (Binomial)—while correctly estimating standard errors using the survey design information (stratification, clustering, and weighting)

Setting Up the Sample

We’ll use the World Bank (2023) synthetic sample data:

import numpy as np
import svy

# Load data and define design
hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)

# Create a binary poverty status variable for Logistic Regression examples
hld_sample = hld_sample.wrangling.mutate(
    {
        "hhpovline": svy.col("hhsize") * 1800,
        "pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline"))
        .then(1)
        .otherwise(0),
    }
)

Implementation in svy

Use Sample.glm.fit() to fit regression models with design-adjusted standard errors.

Linear Regression

Linear regression is used when the outcome variable is continuous. In survey analysis, this is equivalent to solving weighted least squares, but with variance estimates that account for the complex design.

Estimate a model predicting total household expenditure (tot_exp) based on household size (hhsize) and number of rooms (rooms):

# Fit linear regression model
lin_model = hld_sample.glm.fit(
    y="pov_status",
    x=["hhsize", "rooms"],
    family="Gaussian",
)

print(lin_model)

╭─────────────────────────── GLM: Gaussian (identity) ───────────────────────────╮
│ Modeling: pov_status                                                           │
│                                                                                │
│ Observations       8000  AIC           1037.0840                               │
│ Df Residuals        310  BIC           1058.0455                               │
│ Deviance      1031.0840  Scale            3.3261                               │
│ R-squared       0.26948  R-sq (adj)      0.26930                               │
│                          Iterations            2                               │
│ F-stat (adj)    0.00000  Prob (F-adj)     1.0000                               │
│                                                                                │
│                                                                                │
│  Term             Coef.   Std.Err.           t    P>|t|     [0.025     0.975]  │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  │
│  _intercept_    0.02697    0.02562     1.05271   0.2933   -0.02344    0.07739  │
│  hhsize         0.09471    0.00518    18.28602   <0.001    0.08452    0.10490  │
│  rooms         -0.06363    0.00630   -10.10491   <0.001   -0.07601   -0.05124  │
│                                                                                │
╰────────────────────────────────────────────────────────────────────────────────╯

The output includes estimated coefficients, design-based standard errors, t-statistics, and confidence intervals.

Logistic Regression

Logistic regression is used when the outcome variable is binary (0/1), such as whether a household is below the poverty line. It models the log-odds of the outcome as a linear combination of the predictors.

Implementation in svy

Use family="binomial" and link="logit" to fit a logistic regression model.

Model the likelihood of a household being poor (pov_status) using household size (hhsize) and number of rooms (rooms) as continuous predictors, and urban/rural status (urbrur) as a categorical predictor:

# Fit logistic regression model
logit_model = hld_sample.glm.fit(
    y="pov_status",
    x=["hhsize", "rooms", svy.Cat("urbrur", ref="Urban")],
    family="binomial",
    link="logit",
)

print(logit_model)

╭───────────────────────────── GLM: Binomial (logit) ─────────────────────────────╮
│ Modeling: pov_status                                                            │
│                                                                                 │
│ Observations      8000  AIC           821.1604                                  │
│ Df Residuals       310  BIC           849.1092                                  │
│ Deviance      813.1604  Scale           1.0000                                  │
│ R-squared      0.42388  R-sq (adj)     0.42366                                  │
│                         Iterations           7                                  │
│ F-stat (adj)   0.00000  Prob (F-adj)    1.0000                                  │
│                                                                                 │
│                                                                                 │
│  Term              Coef.   Std.Err.           t    P>|t|     [0.025     0.975]  │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  │
│  _intercept_    -4.05200    0.22800   -17.77188   <0.001   -4.50063   -3.60338  │
│  hhsize          0.73361    0.04018    18.25830   <0.001    0.65456    0.81267  │
│  rooms          -0.67688    0.05836   -11.59836   <0.001   -0.79171   -0.56205  │
│  urbrur_Rural    2.05946    0.13098    15.72320   <0.001    1.80173    2.31718  │
│                                                                                 │
╰─────────────────────────────────────────────────────────────────────────────────╯

Note

Interpretation: The coefficients are on the log-odds scale. To interpret them as Odds Ratios (OR), exponentiate the coefficients (e^β).

Understanding GLM Parameters

When fitting a GLM in svy, you must specify the structure of your data and statistical assumptions:

Specifying Predictors (`x`)

The x parameter accepts a list of predictor variables. You can mix continuous and categorical variables:

x=["hhsize", "rooms", svy.Cat("urbrur")]

Categorical Variables with `svy.Cat()`

Unlike continuous variables, categorical variables (like urbrur or region) need special handling to create dummy (indicator) variables. Wrap categorical predictors in svy.Cat():

x=[svy.Cat("sex"), svy.Cat("region")]

Setting Reference Categories

When creating dummy variables, one category is omitted to serve as the baseline or “reference” group. Coefficients for other categories are interpreted relative to this reference. By default, svy selects the first category alphabetically.

To specify a different reference category, use the ref parameter:

x=[svy.Cat("urbrur", ref="Urban")]  # Sets "Urban" as the baseline

The resulting coefficient will represent the effect of being “Rural” compared to “Urban”.

Distribution Family (`family`)

Specifies the probability distribution of your outcome variable (y):

Family	Use Case
`"gaussian"`	Continuous outcomes (Linear Regression)
`"binomial"`	Binary (0/1) outcomes (Logistic Regression)
`"poisson"`	Count data (Poisson Regression)

Link Function (`link`)

Connects the linear predictors (Xβ) to the expected mean of the distribution (μ):

Link	Description	Typical Use
`"identity"`	No transformation (y = Xβ)	Gaussian families
`"logit"`	Log of the odds: ln(p/(1-p)) = Xβ	Logistic Regression

Supported Models

The svy library supports five canonical GLM families used in survey analysis. Since the library is pre-1.0, models are categorized by stability:

Family	Stability	Use Case
Gaussian	✅ Core	Continuous outcomes (e.g., income)
Binomial	✅ Core	Binary outcomes (e.g., poverty status)
Poisson	✅ Core	Count data (e.g., number of children)
Gamma	🧪 Beta	Positively skewed continuous data
Inverse Gaussian	🧪 Beta	Time-to-failure data

Roadmap (Phase 2)

Future updates will introduce support for Negative Binomial and Beta regression, as well as additional link functions (Probit, Cloglog).

Note: “Beta” features are fully functional but may undergo API changes or require extra care with data scaling.

Next Steps

Now that you’ve covered weighting, estimation, and modeling, explore how to visualize results or export them for reporting.

Explore more tutorials
Return to Tutorials Overview →

References

World Bank. 2023. “Synthetic Data for an Imaginary Country, Sample, 2023.” World Bank, Development Data Group. https://doi.org/10.48529/MC1F-QH23.

Setting Up the Sample

Linear Regression

Logistic Regression

Understanding GLM Parameters

Specifying Predictors (x)

Categorical Variables with svy.Cat()

Setting Reference Categories

Distribution Family (family)

Link Function (link)

Supported Models

Next Steps

References

Specifying Predictors (`x`)

Categorical Variables with `svy.Cat()`

Distribution Family (`family`)

Link Function (`link`)