EB Unit-level modeling

The Molina–Rao model with svy-sae

What this tutorial covers

This tutorial introduces the Molina–Rao Empirical Bayes (EB) unit-level model as implemented in svy-sae.

While the BHF model (Part 1) focuses on predicting linear parameters such as area means, the Molina–Rao framework is designed for nonlinear indicators, including poverty incidence, poverty gap, and other threshold-based measures.

In this tutorial, we demonstrate:

fitting an EB unit-level model,
handling transformations of skewed outcomes,
defining and estimating nonlinear indicators,
estimating uncertainty using parametric bootstrap.

Load the data

To illustrate the APIs, We use the Spain synthetic income datasets from the R package sae.

Sample data

prov — small area identifier
income — household income
covariates — demographic and socioeconomic variables

Population data

covariates for non-sampled units
one row per population unit

from pathlib import Path

import polars as pl
import svy
import svy_sae as sae

df_sample = svy.load_dataset("spain_synth_income", limit=None)
df_pop = svy.load_dataset("spain_synth_covariates_oos", limit=None)

print(f"Synthetic income sample data : {df_sample.head()}")
print(f"Synthetic covariates out of sample data (population): {df_pop.head()}")

print(f"Number of small areas: {len(df_sample['prov'].unique())}")
print(f"Number of out of sample records: {df_pop.shape[0]}")

Synthetic income sample data : shape: (5, 22)
┌─────┬─────────┬──────┬─────┬───┬────────┬────────┬──────────────┬─────────────┐
│     ┆ provlab ┆ prov ┆ ac  ┆ … ┆ labor2 ┆ labor3 ┆ income       ┆ weight      │
│ --- ┆ ---     ┆ ---  ┆ --- ┆   ┆ ---    ┆ ---    ┆ ---          ┆ ---         │
│ i64 ┆ str     ┆ i64  ┆ i64 ┆   ┆ i64    ┆ i64    ┆ f64          ┆ f64         │
╞═════╪═════════╪══════╪═════╪═══╪════════╪════════╪══════════════╪═════════════╡
│ 1   ┆ Alava   ┆ 1    ┆ 16  ┆ … ┆ 0      ┆ 1      ┆ 12021.832116 ┆ 2804.031333 │
│ 2   ┆ Alava   ┆ 1    ┆ 16  ┆ … ┆ 0      ┆ 0      ┆ 8609.853817  ┆ 1347.639615 │
│ 3   ┆ Alava   ┆ 1    ┆ 16  ┆ … ┆ 0      ┆ 0      ┆ 3384.197292  ┆ 2568.176149 │
│ 4   ┆ Alava   ┆ 1    ┆ 16  ┆ … ┆ 0      ┆ 0      ┆ 2343.3552    ┆ 1168.233266 │
│ 5   ┆ Alava   ┆ 1    ┆ 16  ┆ … ┆ 0      ┆ 0      ┆ 9023.845796  ┆ 953.438826  │
└─────┴─────────┴──────┴─────┴───┴────────┴────────┴──────────────┴─────────────┘
Synthetic covariates out of sample data (population): shape: (5, 11)
┌─────┬──────┬──────┬──────┬───┬───────┬───────┬────────┬────────┐
│     ┆ prov ┆ age2 ┆ age3 ┆ … ┆ educ1 ┆ educ3 ┆ labor1 ┆ labor2 │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆   ┆ ---   ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ i64  ┆ i64  ┆ i64  ┆   ┆ i64   ┆ i64   ┆ i64    ┆ i64    │
╞═════╪══════╪══════╪══════╪═══╪═══════╪═══════╪════════╪════════╡
│ 1   ┆ 42   ┆ 0    ┆ 0    ┆ … ┆ 0     ┆ 0     ┆ 0      ┆ 0      │
│ 2   ┆ 42   ┆ 0    ┆ 0    ┆ … ┆ 0     ┆ 0     ┆ 0      ┆ 0      │
│ 3   ┆ 42   ┆ 0    ┆ 0    ┆ … ┆ 0     ┆ 0     ┆ 0      ┆ 0      │
│ 4   ┆ 42   ┆ 0    ┆ 0    ┆ … ┆ 0     ┆ 0     ┆ 0      ┆ 0      │
│ 5   ┆ 42   ┆ 0    ┆ 0    ┆ … ┆ 0     ┆ 0     ┆ 0      ┆ 0      │
└─────┴──────┴──────┴──────┴───┴───────┴───────┴────────┴────────┘
Number of small areas: 52
Number of out of sample records: 713301

Defining the poverty line

We define the poverty line relative to the median income of the sample.

income_median = df_sample["income"].median()
poverty_line = 0.6 * income_median

print(f"Poverty line: {poverty_line}")

Poverty line: 6477.48423337962

Transformation of the response

Income data are typically right-skewed. We apply a Box–Cox transformation (with a positive shift) to stabilize variance and improve normality. The parameter \(\lambda\) of the Box–Cox transformation controls the shape of the transformation:

boxcox_param=None (default): automatically estimate \(\lambda\) from the data to maximize normality
boxcox_param=1: identity (no shape change)
boxcox_param=0.5: square-root transformation (\(\sqrt{y}\))
boxcox_param=0: log transformation (\(\ln(y)\))
boxcox_param=-1: inverse transformation (\(1/y\))

You can restrict the search range using boxcox_bounds to prevent unrealistic values of \(\lambda\).

transformation = "boxcox"
transform_shift = 3600.0   # Shitf y prior to transformation e.g. positive values for log

boxcox_param = None        # Estimate lambda
boxcox_bounds = (-2, 2)    # Search within [-2, 2]

Specify covariates and indicators

We select our predictors and the target indicator (Poverty Gap).

feature_cols = ["age2", "age3", "age4", "age5", "nat1", "educ1", "educ3", "labor1", "labor2"]

indicators = [sae.Indicator.POVERTY_GAP]

Indicators

svy-sae supports multiple indicators in a single run via the indicators=[...] argument. For threshold-based measures (poverty), you must supply threshold=....

EB predictors and bootstrap MSE

We fit the model using ebp(). Because MSE is estimated via parametric bootstrap, we must specify:

n_reps: bootstrap replications (MSE precision)
n_mc: Monte Carlo draws per replication (integration precision)

The population data only contain information for 5 small areas, rather than the full set of 52. Nonetheless, the model will still produce predictions for the small areas present in the population data.

When this situation occurs, svy-sae emits a warning message indicating that the provided population data do not cover all small areas observed in the sample. This warning is informational and does not prevent model fitting or prediction, but it serves to alert users that estimates will be limited to the subset of areas represented in the population data.

model = sae.UnitLevel(df_sample)

results = model.ebp(
    y="income",
    x=feature_cols,
    area="prov",
    pop_data=df_pop,
    transformation="boxcox",
    transform_shift=transform_shift,
    boxcox_param=boxcox_param,
    boxcox_bounds=boxcox_bounds,
    indicators=indicators,
    threshold=poverty_line,
    n_reps=50,
    n_mc=20,
    batch_size=5,
    rstate=123,
    progress_bar=False
)

print(results)

Bootstrap: 47 sample areas were not found in the population data. They will be excluded from the MSE refitting step.

  Batch 1/10 completed.
  Batch 2/10 completed.
  Batch 3/10 completed.
  Batch 4/10 completed.
  Batch 5/10 completed.
  Batch 6/10 completed.
  Batch 7/10 completed.
  Batch 8/10 completed.
  Batch 9/10 completed.
  Batch 10/10 completed.

╭──────────────────── SAE: EBP (boxcox) - Unit Level ─────────────────────╮
│ Method       REML  Log Likelihood  -27462.4360                          │
│ No. Areas       5  AIC              54948.8721                          │
│ No. Obs     17199  BIC              55041.9034                          │
│ ICC        0.0507  KIC              54960.8721                          │
│                                                                         │
│                                                                         │
│  Term        Estimate   Std.Err        z   p-value    [0.025    0.975]  │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  │
│  Intercept    16.9528    0.0633   267.76    0.0000   16.8287   17.0769  │
│  age2         -0.0825    0.0375    -2.20    0.0276   -0.1559   -0.0091  │
│  age3         -0.0795    0.0343    -2.32    0.0203   -0.1467   -0.0124  │
│  age4          0.2156    0.0373     5.77    0.0000    0.1424    0.2888  │
│  age5          0.1259    0.0384     3.28    0.0010    0.0508    0.2011  │
│  nat1         -0.0783    0.0461    -1.70    0.0892   -0.1686    0.0120  │
│  educ1        -0.4580    0.0261   -17.54    0.0000   -0.5092   -0.4068  │
│  educ3         0.8341    0.0302    27.60    0.0000    0.7748    0.8933  │
│  labor1        0.4720    0.0254    18.61    0.0000    0.4223    0.5217  │
│  labor2       -0.1606    0.0509    -3.16    0.0016   -0.2603   -0.0610  │
│                                                                         │
│                                                                         │
│ Random Effects: Area (Normal) Sigma: 0.2748  Sigma²: 0.0755             │
╰─────────────────────────────────────────────────────────────────────────╯

EB estimates

We extract the results into a Polars DataFrame.

df_results = results.to_polars(component="predictions")
target_indicator = str(sae.Indicator.POVERTY_GAP)

df_results.filter(
    pl.col("indicator") == target_indicator
).select(
    [
        pl.col("area").alias("prov"),
        pl.col("pred").alias("Poverty Gap"),
        pl.col("mse"),
    ]
)

shape: (5, 3)

prov	Poverty Gap	mse
str	f64	f64
"34"	0.078728	0.005564
"40"	0.093512	0.005837
"42"	0.07587	0.005961
"44"	0.097141	0.00526
"5"	0.054882	0.006293

Interpretation

pred is the Empirical Bayes estimate of the indicator (here: Poverty Gap).
mse is the mean squared error estimated via parametric bootstrap.

Unlike the BHF model, an analytical MSE is typically not available for these nonlinear indicators.

Performance considerations

EB estimation relies heavily on simulation. To manage runtime, you can adjust:

n_reps: controls the precision of the MSE estimate
n_mc: controls the precision of the Monte Carlo integration used for point estimates
batch_size: controls memory usage during simulation

Example of Workflow

import jax.numpy as jnp
import polars as pl
import svy
import svy_sae as sae

from svy_sae.core.enumerations import Indicator

# 1) Define a custom indicator function
#    Must accept (y_simulated, threshold) and return a scalar float.
#    Using jax.numpy (jnp) is recommended for performance during simulation.
def poverty_severity(y, z):
    """
    Calculates FGT(2) - Squared Poverty Gap.
    Formula: mean( ((z - y)/z)^2 ) for y < z, else 0.
    """
    is_poor = y < z
    # Calculate squared gap for the poor, 0 for others
    gap_squared = jnp.where(is_poor, jnp.square(1.0 - y / z), 0.0)
    return jnp.mean(gap_squared)

# 2) Load data
#    - sample: Unit-level survey data with area identifiers
#    - pop: Census/Population covariate data (one row per unit or aggregated)
df_sample = svy.io.read_csv("data/spain_synth_income.csv")
df_pop = svy.io.read_csv("data/spain_synth_covariates_oos.csv")

# 3) Define model parameters
#    Calculate poverty line from the sample median
poverty_line = 0.6 * df_sample["income"].median()

#    Mix built-in Enums with custom functions
mixed_indicators = [
    Indicator.POVERTY_GAP,  # Built-in (FGT1)
    poverty_severity        # Custom (FGT2)
]

# 4) Fit the Molina-Rao (EBP) model
#    - boxcox_param=None: Auto-estimate the optimal transformation lambda
#    - n_reps: Number of bootstrap replications for MSE
model = sae.UnitLevel(df_sample)

results = model.ebp(
    y="income",
    x=["age2", "age3", "age4", "age5", "nat1", "educ1", "educ3", "labor1", "labor2"],
    area="prov",
    pop_data=df_pop,
    transformation="boxcox",
    boxcox_param=None,      # Estimate lambda
    boxcox_bounds=(-2, 2),
    indicators=mixed_indicators,
    threshold=poverty_line,
    n_reps=50,              # Low reps for quick demo; use >=100 for production
    n_mc=50,
    rstate=42
)

# 5) Inspect results
#    The custom function name ('poverty_severity') becomes the indicator label
preds = results.to_polars("predictions")
print(preds.head())

Common outputs

After calling ebp(...), the results object provides:

results.stats — global model statistics (e.g., log-likelihood and transformation info)
results.fixed_effects — regression coefficients (\(\beta\))
results.random_effects — area-level random effects (\(u_i\))
results.to_polars("predictions") — final small-area estimates for each indicator

Model diagnostics

Diagnostics for EB unit-level models will be added in a future release.

What’s next

Return to EBLUP - Unit-level modeling: The BHF model