from pathlib import Path
import polars as pl
import svy
import svy_sae as saeEB Unit-level modeling
The Molina–Rao model with svy-sae
What this tutorial covers
This tutorial introduces the Molina–Rao Empirical Bayes (EB) unit-level model as implemented in svy-sae.
While the BHF model (Part 1) focuses on predicting linear parameters such as area means, the Molina–Rao framework is designed for nonlinear indicators, including poverty incidence, poverty gap, and other threshold-based measures.
In this tutorial, we demonstrate:
- fitting an EB unit-level model,
- handling transformations of skewed outcomes,
- defining and estimating nonlinear indicators,
- estimating uncertainty using parametric bootstrap.
Load the data
To illustrate the APIs, We use the Spain synthetic income datasets from the R package sae.
Sample data
prov— small area identifierincome— household income- covariates — demographic and socioeconomic variables
Population data
- covariates for non-sampled units
- one row per population unit
df_sample = svy.load_dataset("spain_synth_income", limit=None)
df_pop = svy.load_dataset("spain_synth_covariates_oos", limit=None)
print(f"Synthetic income sample data : {df_sample.head()}")
print(f"Synthetic covariates out of sample data (population): {df_pop.head()}")
print(f"Number of small areas: {len(df_sample['prov'].unique())}")
print(f"Number of out of sample records: {df_pop.shape[0]}")Synthetic income sample data : shape: (5, 22)
┌─────┬─────────┬──────┬─────┬───┬────────┬────────┬──────────────┬─────────────┐
│ ┆ provlab ┆ prov ┆ ac ┆ … ┆ labor2 ┆ labor3 ┆ income ┆ weight │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪═════════╪══════╪═════╪═══╪════════╪════════╪══════════════╪═════════════╡
│ 1 ┆ Alava ┆ 1 ┆ 16 ┆ … ┆ 0 ┆ 1 ┆ 12021.832116 ┆ 2804.031333 │
│ 2 ┆ Alava ┆ 1 ┆ 16 ┆ … ┆ 0 ┆ 0 ┆ 8609.853817 ┆ 1347.639615 │
│ 3 ┆ Alava ┆ 1 ┆ 16 ┆ … ┆ 0 ┆ 0 ┆ 3384.197292 ┆ 2568.176149 │
│ 4 ┆ Alava ┆ 1 ┆ 16 ┆ … ┆ 0 ┆ 0 ┆ 2343.3552 ┆ 1168.233266 │
│ 5 ┆ Alava ┆ 1 ┆ 16 ┆ … ┆ 0 ┆ 0 ┆ 9023.845796 ┆ 953.438826 │
└─────┴─────────┴──────┴─────┴───┴────────┴────────┴──────────────┴─────────────┘
Synthetic covariates out of sample data (population): shape: (5, 11)
┌─────┬──────┬──────┬──────┬───┬───────┬───────┬────────┬────────┐
│ ┆ prov ┆ age2 ┆ age3 ┆ … ┆ educ1 ┆ educ3 ┆ labor1 ┆ labor2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪══════╪═══╪═══════╪═══════╪════════╪════════╡
│ 1 ┆ 42 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 42 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ 3 ┆ 42 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ 4 ┆ 42 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ 5 ┆ 42 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└─────┴──────┴──────┴──────┴───┴───────┴───────┴────────┴────────┘
Number of small areas: 52
Number of out of sample records: 713301
Defining the poverty line
We define the poverty line relative to the median income of the sample.
income_median = df_sample["income"].median()
poverty_line = 0.6 * income_median
print(f"Poverty line: {poverty_line}")Poverty line: 6477.48423337962
Transformation of the response
Income data are typically right-skewed. We apply a Box–Cox transformation (with a positive shift) to stabilize variance and improve normality. The parameter \(\lambda\) of the Box–Cox transformation controls the shape of the transformation:
boxcox_param=None(default): automatically estimate \(\lambda\) from the data to maximize normalityboxcox_param=1: identity (no shape change)boxcox_param=0.5: square-root transformation (\(\sqrt{y}\))boxcox_param=0: log transformation (\(\ln(y)\))boxcox_param=-1: inverse transformation (\(1/y\))
You can restrict the search range using boxcox_bounds to prevent unrealistic values of \(\lambda\).
transformation = "boxcox"
transform_shift = 3600.0 # Shitf y prior to transformation e.g. positive values for log
boxcox_param = None # Estimate lambda
boxcox_bounds = (-2, 2) # Search within [-2, 2]Specify covariates and indicators
We select our predictors and the target indicator (Poverty Gap).
feature_cols = ["age2", "age3", "age4", "age5", "nat1", "educ1", "educ3", "labor1", "labor2"]
indicators = [sae.Indicator.POVERTY_GAP]Indicators
svy-sae supports multiple indicators in a single run via the indicators=[...] argument. For threshold-based measures (poverty), you must supply threshold=....
EB predictors and bootstrap MSE
We fit the model using ebp(). Because MSE is estimated via parametric bootstrap, we must specify:
n_reps: bootstrap replications (MSE precision)n_mc: Monte Carlo draws per replication (integration precision)
The population data only contain information for 5 small areas, rather than the full set of 52. Nonetheless, the model will still produce predictions for the small areas present in the population data.
When this situation occurs, svy-sae emits a warning message indicating that the provided population data do not cover all small areas observed in the sample. This warning is informational and does not prevent model fitting or prediction, but it serves to alert users that estimates will be limited to the subset of areas represented in the population data.
model = sae.UnitLevel(df_sample)
results = model.ebp(
y="income",
x=feature_cols,
area="prov",
pop_data=df_pop,
transformation="boxcox",
transform_shift=transform_shift,
boxcox_param=boxcox_param,
boxcox_bounds=boxcox_bounds,
indicators=indicators,
threshold=poverty_line,
n_reps=50,
n_mc=20,
batch_size=5,
rstate=123,
progress_bar=False
)
print(results)Bootstrap: 47 sample areas were not found in the population data. They will be excluded from the MSE refitting step.
Batch 1/10 completed.
Batch 2/10 completed.
Batch 3/10 completed.
Batch 4/10 completed.
Batch 5/10 completed.
Batch 6/10 completed.
Batch 7/10 completed.
Batch 8/10 completed.
Batch 9/10 completed.
Batch 10/10 completed.
╭──────────────────── SAE: EBP (boxcox) - Unit Level ─────────────────────╮ │ Method REML Log Likelihood -27462.4360 │ │ No. Areas 5 AIC 54948.8721 │ │ No. Obs 17199 BIC 55041.9034 │ │ ICC 0.0507 KIC 54960.8721 │ │ │ │ │ │ Term Estimate Std.Err z p-value [0.025 0.975] │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Intercept 16.9528 0.0633 267.76 0.0000 16.8287 17.0769 │ │ age2 -0.0825 0.0375 -2.20 0.0276 -0.1559 -0.0091 │ │ age3 -0.0795 0.0343 -2.32 0.0203 -0.1467 -0.0124 │ │ age4 0.2156 0.0373 5.77 0.0000 0.1424 0.2888 │ │ age5 0.1259 0.0384 3.28 0.0010 0.0508 0.2011 │ │ nat1 -0.0783 0.0461 -1.70 0.0892 -0.1686 0.0120 │ │ educ1 -0.4580 0.0261 -17.54 0.0000 -0.5092 -0.4068 │ │ educ3 0.8341 0.0302 27.60 0.0000 0.7748 0.8933 │ │ labor1 0.4720 0.0254 18.61 0.0000 0.4223 0.5217 │ │ labor2 -0.1606 0.0509 -3.16 0.0016 -0.2603 -0.0610 │ │ │ │ │ │ Random Effects: Area (Normal) Sigma: 0.2748 Sigma²: 0.0755 │ ╰─────────────────────────────────────────────────────────────────────────╯
EB estimates
We extract the results into a Polars DataFrame.
df_results = results.to_polars(component="predictions")
target_indicator = str(sae.Indicator.POVERTY_GAP)
df_results.filter(
pl.col("indicator") == target_indicator
).select(
[
pl.col("area").alias("prov"),
pl.col("pred").alias("Poverty Gap"),
pl.col("mse"),
]
)| prov | Poverty Gap | mse |
|---|---|---|
| str | f64 | f64 |
| "34" | 0.078728 | 0.005564 |
| "40" | 0.093512 | 0.005837 |
| "42" | 0.07587 | 0.005961 |
| "44" | 0.097141 | 0.00526 |
| "5" | 0.054882 | 0.006293 |
Interpretation
predis the Empirical Bayes estimate of the indicator (here: Poverty Gap).mseis the mean squared error estimated via parametric bootstrap.
Unlike the BHF model, an analytical MSE is typically not available for these nonlinear indicators.
Performance considerations
EB estimation relies heavily on simulation. To manage runtime, you can adjust:
n_reps: controls the precision of the MSE estimaten_mc: controls the precision of the Monte Carlo integration used for point estimatesbatch_size: controls memory usage during simulation
Example of Workflow
import jax.numpy as jnp
import polars as pl
import svy
import svy_sae as sae
from svy_sae.core.enumerations import Indicator
# 1) Define a custom indicator function
# Must accept (y_simulated, threshold) and return a scalar float.
# Using jax.numpy (jnp) is recommended for performance during simulation.
def poverty_severity(y, z):
"""
Calculates FGT(2) - Squared Poverty Gap.
Formula: mean( ((z - y)/z)^2 ) for y < z, else 0.
"""
is_poor = y < z
# Calculate squared gap for the poor, 0 for others
gap_squared = jnp.where(is_poor, jnp.square(1.0 - y / z), 0.0)
return jnp.mean(gap_squared)
# 2) Load data
# - sample: Unit-level survey data with area identifiers
# - pop: Census/Population covariate data (one row per unit or aggregated)
df_sample = svy.io.read_csv("data/spain_synth_income.csv")
df_pop = svy.io.read_csv("data/spain_synth_covariates_oos.csv")
# 3) Define model parameters
# Calculate poverty line from the sample median
poverty_line = 0.6 * df_sample["income"].median()
# Mix built-in Enums with custom functions
mixed_indicators = [
Indicator.POVERTY_GAP, # Built-in (FGT1)
poverty_severity # Custom (FGT2)
]
# 4) Fit the Molina-Rao (EBP) model
# - boxcox_param=None: Auto-estimate the optimal transformation lambda
# - n_reps: Number of bootstrap replications for MSE
model = sae.UnitLevel(df_sample)
results = model.ebp(
y="income",
x=["age2", "age3", "age4", "age5", "nat1", "educ1", "educ3", "labor1", "labor2"],
area="prov",
pop_data=df_pop,
transformation="boxcox",
boxcox_param=None, # Estimate lambda
boxcox_bounds=(-2, 2),
indicators=mixed_indicators,
threshold=poverty_line,
n_reps=50, # Low reps for quick demo; use >=100 for production
n_mc=50,
rstate=42
)
# 5) Inspect results
# The custom function name ('poverty_severity') becomes the indicator label
preds = results.to_polars("predictions")
print(preds.head())Common outputs
After calling ebp(...), the results object provides:
results.stats— global model statistics (e.g., log-likelihood and transformation info)results.fixed_effects— regression coefficients (\(\beta\))results.random_effects— area-level random effects (\(u_i\))results.to_polars("predictions")— final small-area estimates for each indicator
Model diagnostics
Diagnostics for EB unit-level models will be added in a future release.
What’s next
Return to EBLUP - Unit-level modeling: The BHF model