import numpy as np
import svy
# Load data and define design
hld_data = svy.load_dataset(name="hld_sample_wb_2023", limit=None)
hld_design = svy.Design(stratum=("geo1", "urbrur"), psu="ea", wgt="hhweight")
hld_sample = svy.Sample(data=hld_data, design=hld_design)
# Create a binary poverty status variable for Logistic Regression examples
hld_sample = hld_sample.wrangling.mutate(
{
"hhpovline": svy.col("hhsize") * 1800,
"pov_status": svy.when(svy.col("tot_exp") < svy.col("hhpovline"))
.then(1)
.otherwise(0),
}
)Generalized Linear Models (GLMs)
Linear, logistic, and count regression with design-adjusted standard errors
GLM survey data Python, logistic regression survey weights, linear regression complex survey, survey weighted regression, binomial GLM, Gaussian GLM, svy library
Generalized Linear Models (GLMs) extend ordinary linear regression to accommodate response variables with non-normal error distributions, such as binary, categorical, or count data.
In complex survey analysis, fitting these models requires special attention. While point estimates (coefficients) are often computed using weighted estimating equations, standard variance estimation methods (like OLS) generally underestimate uncertainty because they assume observations are independent and identically distributed (i.i.d.).
The svy GLM module provides wrappers to fit common regression models—Linear (Gaussian) and Logistic (Binomial)—while correctly estimating standard errors using the survey design information (stratification, clustering, and weighting).
Setting Up the Sample
We’ll use the World Bank (2023) synthetic sample data:
svy
Use Sample.glm.fit() to fit regression models with design-adjusted standard errors.
Linear Regression
Linear regression is used when the outcome variable is continuous. In survey analysis, this is equivalent to solving weighted least squares, but with variance estimates that account for the complex design.
Estimate a model predicting total household expenditure (tot_exp) based on household size (hhsize) and number of rooms (rooms):
# Fit linear regression model
lin_model = hld_sample.glm.fit(
y="pov_status",
x=["hhsize", "rooms"],
family="Gaussian",
)
print(lin_model)╭─────────────────────────── GLM: Gaussian (identity) ───────────────────────────╮ │ Modeling: pov_status │ │ │ │ Observations 8000 AIC -16384.6473 │ │ Df Residuals 310.0 BIC -16363.6857 │ │ Deviance 1031.0840 Scale 1040.1288 │ │ R-squared 0.26948 R-sq (adj) 0.26930 │ │ Iterations 2 │ │ F-stat (adj) 201.56316 Prob (F-adj) <0.001 │ │ │ │ │ │ Term Coef. Std.Err. t P>|t| [0.025 0.975] │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ _intercept_ 0.02697 0.02562 1.05271 0.2933 -0.02344 0.07739 │ │ hhsize 0.09471 0.00518 18.28602 <0.001 0.08452 0.10490 │ │ rooms -0.06363 0.00630 -10.10491 <0.001 -0.07601 -0.05124 │ │ │ ╰────────────────────────────────────────────────────────────────────────────────╯
The output includes estimated coefficients, design-based standard errors, t-statistics, and confidence intervals.
Logistic Regression
Logistic regression is used when the outcome variable is binary (0/1), such as whether a household is below the poverty line. It models the log-odds of the outcome as a linear combination of the predictors.
svy
Use family="binomial" and link="logit" to fit a logistic regression model.
Model the likelihood of a household being poor (pov_status) using household size (hhsize) and number of rooms (rooms) as continuous predictors, and urban/rural status (urbrur) as a categorical predictor:
# Fit logistic regression model
logit_model = hld_sample.glm.fit(
y="pov_status",
x=["hhsize", "rooms", svy.Cat("urbrur", ref="Urban")],
family="binomial",
link="logit",
)
print(logit_model)╭───────────────────────────── GLM: Binomial (logit) ─────────────────────────────╮ │ Modeling: pov_status │ │ │ │ Observations 8000 AIC 5243.7968 │ │ Df Residuals 310.0 BIC 5271.7456 │ │ Deviance 5235.7968 Scale 1.0000 │ │ R-squared 0.39150 R-sq (adj) 0.39127 │ │ Iterations 7 │ │ F-stat (adj) 192.61847 Prob (F-adj) <0.001 │ │ │ │ │ │ Term Coef. Std.Err. t P>|t| [0.025 0.975] │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ _intercept_ -4.05200 0.22800 -17.77188 <0.001 -4.50063 -3.60338 │ │ hhsize 0.73361 0.04018 18.25830 <0.001 0.65456 0.81267 │ │ rooms -0.67688 0.05836 -11.59835 <0.001 -0.79171 -0.56205 │ │ urbrur_Rural 2.05946 0.13098 15.72320 <0.001 1.80173 2.31718 │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────╯
Interpretation: The coefficients are on the log-odds scale. To interpret them as Odds Ratios (OR), exponentiate the coefficients (e^β).
Understanding GLM Parameters
When fitting a GLM in svy, you must specify the structure of your data and statistical assumptions:
Specifying Predictors (x)
The x parameter accepts a list of predictor variables. You can mix continuous and categorical variables:
x=["hhsize", "rooms", svy.Cat("urbrur")]Categorical Variables with svy.Cat()
Unlike continuous variables, categorical variables (like urbrur or region) need special handling to create dummy (indicator) variables. Wrap categorical predictors in svy.Cat():
x=[svy.Cat("sex"), svy.Cat("region")]Setting Reference Categories
When creating dummy variables, one category is omitted to serve as the baseline or “reference” group. Coefficients for other categories are interpreted relative to this reference. By default, svy selects the first category alphabetically.
To specify a different reference category, use the ref parameter:
x=[svy.Cat("urbrur", ref="Urban")] # Sets "Urban" as the baselineThe resulting coefficient will represent the effect of being “Rural” compared to “Urban”.
Distribution Family (family)
Specifies the probability distribution of your outcome variable (y):
| Family | Use Case |
|---|---|
"gaussian" |
Continuous outcomes (Linear Regression) |
"binomial" |
Binary (0/1) outcomes (Logistic Regression) |
"poisson" |
Count data (Poisson Regression) |
Link Function (link)
Connects the linear predictors (Xβ) to the expected mean of the distribution (μ):
| Link | Description | Typical Use |
|---|---|---|
"identity" |
No transformation (y = Xβ) | Gaussian families |
"logit" |
Log of the odds: ln(p/(1-p)) = Xβ | Logistic Regression |
Supported Models
The svy library supports five canonical GLM families used in survey analysis. Since the library is pre-1.0, models are categorized by stability:
| Family | Stability | Use Case |
|---|---|---|
| Gaussian | ✅ Core | Continuous outcomes (e.g., income) |
| Binomial | ✅ Core | Binary outcomes (e.g., poverty status) |
| Poisson | ✅ Core | Count data (e.g., number of children) |
| Gamma | 🧪 Beta | Positively skewed continuous data |
| Inverse Gaussian | 🧪 Beta | Time-to-failure data |
Future updates will introduce support for Negative Binomial and Beta regression, as well as additional link functions (Probit, Cloglog).
Note: “Beta” features are fully functional but may undergo API changes or require extra care with data scaling.
Next Steps
Now that you’ve covered weighting, estimation, and modeling, explore how to visualize results or export them for reporting.
Explore more tutorials
Return to Tutorials Overview →