import polars as pl
import svy
import svy_sae as saeEBLUP - Unit-level modeling
The Battese–Harter–Fuller (BHF) model with svy-sae
What this tutorial covers
This tutorial introduces the Battese–Harter–Fuller (BHF) model as implemented in svy-sae.
Unlike area-level models, unit-level models fit the mixed model directly at the unit level, allowing both unit-level error variance and area-level random effects variance to be estimated from the data.
In this tutorial, we focus on predicting area-level means using EBLUPs, and we demonstrate:
- fitting the BHF model using REML and ML,
- extracting EBLUP predictions,
- computing analytical and bootstrap MSE,
- using auxiliary information available at the population level.
Imports
County crop (corn and soybeans) data
We use the classic county crop data introduced by Battese, Harter, and Fuller (1988).
Each record in the sample corresponds to a segment within a county (small area).
Sample (unit-level) data
At a minimum, the sample data contain:
County— county identifier (small area)CornHec— area (hectares) under corn (response variable)CornPix— number of pixels classified as cornSoyBeansPix— number of pixels classified as soybeans
Population (auxiliary) data
Population-level data provide auxiliary information aggregated at the county level, including:
- county identifiers,
- auxiliary totals or means,
- population size (number of segments per county).
Load the data
# Load example datasets
smp = svy.load_dataset("cornsoybean", limit=None)
pop = svy.load_dataset("cornsoybeanmeans", limit=None)
print(f"Sample data: {smp.head()}")
print(f"Population means: {pop.head()}")Sample data: shape: (5, 5)
┌────────┬─────────┬─────────────┬─────────┬─────────────┐
│ County ┆ CornHec ┆ SoyBeansHec ┆ CornPix ┆ SoyBeansPix │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ i64 ┆ i64 │
╞════════╪═════════╪═════════════╪═════════╪═════════════╡
│ 1 ┆ 165.76 ┆ 8.09 ┆ 374 ┆ 55 │
│ 2 ┆ 96.32 ┆ 106.03 ┆ 209 ┆ 218 │
│ 3 ┆ 76.08 ┆ 103.6 ┆ 253 ┆ 250 │
│ 4 ┆ 185.35 ┆ 6.47 ┆ 432 ┆ 96 │
│ 4 ┆ 116.43 ┆ 63.82 ┆ 367 ┆ 178 │
└────────┴─────────┴─────────────┴─────────┴─────────────┘
Population means: shape: (5, 6)
┌────────┬────────────┬──────────────┬──────────────┬─────────┬─────────────┐
│ County ┆ CountyName ┆ SampSegments ┆ PopnSegments ┆ CornPix ┆ SoyBeansPix │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞════════╪════════════╪══════════════╪══════════════╪═════════╪═════════════╡
│ 1 ┆ CerroGordo ┆ 1 ┆ 545 ┆ 295.29 ┆ 189.7 │
│ 2 ┆ Hamilton ┆ 1 ┆ 566 ┆ 300.4 ┆ 196.65 │
│ 3 ┆ Worth ┆ 1 ┆ 394 ┆ 289.6 ┆ 205.28 │
│ 4 ┆ Humboldt ┆ 2 ┆ 424 ┆ 290.74 ┆ 220.22 │
│ 5 ┆ Franklin ┆ 3 ┆ 564 ┆ 318.21 ┆ 188.06 │
└────────┴────────────┴──────────────┴──────────────┴─────────┴─────────────┘
Model formulation (conceptual)
The BHF model is given by:
\[ y_{ij} = \mathbf{x}_{ij}^\top \beta + u_i + e_{ij}, \]
where:
- \(u_i \sim N(0, \sigma_u^2)\) is the area-level random effect,
- \(e_{ij} \sim N(0, \sigma_e^2)\) is the unit-level error.
A common target parameter is the area mean:
\[ \theta_i \;=\; \bar{Y}_i \;=\; \frac{1}{N_i}\sum_{j=1}^{N_i} Y_{ij}. \]
Under the BHF model, the EBLUP of the area mean can be written as:
\[ \widehat{\theta}_i^{EBLUP} \;=\; \frac{1}{N_i}\left( \sum_{j\in s_i} y_{ij} \;+\; \sum_{j\in r_i} \mathbf{x}_{ij}^\top \widehat{\beta} \;+\; (N_i-n_i)\widehat{u}_i \right), \]
where \(s_i\) denotes the sampled units in area \(i\), \(r_i\) the non-sampled units, \(n_i\) is the sample size for area \(i\), and \(N_i\) is the population size for area \(i\). The term \(\widehat{u}_i\) is the empirical (estimated) area random effect, which shrinks the direct component toward the regression fit depending on the relative sizes of \(\sigma_u^2\) and \(\sigma_e^2\).
Fit the BHF model (REML + analytical MSE)
model = sae.UnitLevel(smp)
res = model.eblup(
y="CornHec",
x=["CornPix", "SoyBeansPix"],
area="County",
pop_data=pop,
pop_size="PopnSegments",
method="REML",
mse="analytical",
intercept=True,
drop_nulls=True,
)
print(res)╭───────────── SAE: BHF (EBLUP) - MSE: Analytical - Unit Level ─────────────╮ │ Method REML Log Likelihood -161.0062 │ │ No. Areas 12 AIC 332.0125 │ │ No. Obs 37 BIC 340.0671 │ │ ICC 0.1754 KIC 337.0125 │ │ │ │ │ │ Term Estimate Std.Err z p-value [0.025 0.975] │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Intercept 17.9468 30.9597 0.58 0.5621 -42.7342 78.6277 │ │ CornPix 0.3664 0.0649 5.64 0.0000 0.2391 0.4936 │ │ SoyBeansPix -0.0303 0.0676 -0.45 0.6534 -0.1627 0.1021 │ │ │ │ │ │ Random Effects: Area (Normal) Sigma: 7.9571 Sigma²: 63.3149 │ ╰───────────────────────────────────────────────────────────────────────────╯
Fitted statistics
print(res.stats)SAEStats(method=REML, n_areas=12, n_obs=37, sigma_u=7.9571, icc=0.1754, logLik=-161.0062)
Fixed and random effects
Fixed effects
for fe in res.fixed_effects:
print(fe)FixedEffect(term='Intercept', est=17.9468, se=30.9597, z=0.58, p=0.5621)
FixedEffect(term='CornPix', est=0.3664, se=0.0649, z=5.64, p=1.676e-08)
FixedEffect(term='SoyBeansPix', est=-0.0303, se=0.0676, z=-0.45, p=0.6534)
Random effects
print(res.random_effects)RandomEffect(term='Area', dist=Normal, sigma=7.9571)
EBLUP predictions and analytical MSE
pred = res.to_polars("predictions")
print(pred.head())shape: (5, 7)
┌──────┬────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ area ┆ pred ┆ mse ┆ cv ┆ lci ┆ uci ┆ indicator │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════╪════════════╪═══════════╪══════════╪════════════╪════════════╪═══════════╡
│ 1 ┆ 122.563395 ┆ 62.464222 ┆ 0.064484 ┆ 107.07267 ┆ 138.05412 ┆ Mean │
│ 10 ┆ 124.18042 ┆ 34.274651 ┆ 0.047145 ┆ 112.705687 ┆ 135.655153 ┆ Mean │
│ 11 ┆ 112.504598 ┆ 34.088246 ┆ 0.051896 ┆ 101.061111 ┆ 123.948086 ┆ Mean │
│ 12 ┆ 131.257976 ┆ 30.895556 ┆ 0.042347 ┆ 120.363557 ┆ 142.152395 ┆ Mean │
│ 2 ┆ 123.515319 ┆ 61.952696 ┆ 0.063725 ┆ 108.088152 ┆ 138.942486 ┆ Mean │
└──────┴────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘
Maximum likelihood (ML) estimation
res_ml = model.eblup(
y="CornHec",
x=["CornPix", "SoyBeansPix"],
area="County",
pop_data=pop,
pop_size="PopnSegments",
method="ML",
mse="analytical",
intercept=True,
drop_nulls=True,
)
print(res_ml.stats)SAEStats(method=ML, n_areas=12, n_obs=37, sigma_u=6.9134, icc=0.1457, logLik=-156.4413)
Bootstrap MSE
res_boot = model.eblup(
y="CornHec",
x=["CornPix", "SoyBeansPix"],
area="County",
pop_data=pop,
method="REML",
mse="bootstrap",
n_reps=200,
batch_size=50,
rstate=42,
progress_bar=False,
)
pred_boot = res_boot.to_polars("predictions")
print(pred_boot.head()) Batch 1/4 completed.
Batch 2/4 completed.
Batch 3/4 completed.
Batch 4/4 completed.
shape: (5, 7)
┌──────┬────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ area ┆ pred ┆ mse ┆ cv ┆ lci ┆ uci ┆ indicator │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════╪════════════╪═══════════╪══════════╪════════════╪════════════╪═══════════╡
│ 1 ┆ 122.563395 ┆ 71.965985 ┆ 0.069215 ┆ 105.936173 ┆ 139.190618 ┆ Mean │
│ 10 ┆ 124.18042 ┆ 40.901878 ┆ 0.051501 ┆ 111.645323 ┆ 136.715517 ┆ Mean │
│ 11 ┆ 112.504598 ┆ 41.200252 ┆ 0.057053 ┆ 99.923864 ┆ 125.085333 ┆ Mean │
│ 12 ┆ 131.257976 ┆ 36.970289 ┆ 0.046323 ┆ 119.340549 ┆ 143.175403 ┆ Mean │
│ 2 ┆ 123.515319 ┆ 96.945001 ┆ 0.079715 ┆ 104.217031 ┆ 142.813607 ┆ Mean │
└──────┴────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘
Data validation and missing values
Unit-level models are sensitive to missing data. Setting drop_nulls=True removes rows with missing values in variables used by the model.
Common outputs you will use
After calling eblup(...), the returned results object provides:
results.stats— fitted statistics,results.fixed_effects— fixed effects estimates,results.random_effects— variance components,results.to_polars("predictions")— EBLUPs and MSE.
Model diagnostics
Diagnostics for unit-level models will be added in a future release.
What’s next
Next, we will review the EB version of the unit level model: Molina-Rao model