EBLUP - Unit-level modeling

The Battese–Harter–Fuller (BHF) model with svy-sae

What this tutorial covers

This tutorial introduces the Battese–Harter–Fuller (BHF) model as implemented in svy-sae.

Unlike area-level models, unit-level models fit the mixed model directly at the unit level, allowing both unit-level error variance and area-level random effects variance to be estimated from the data.

In this tutorial, we focus on predicting area-level means using EBLUPs, and we demonstrate:

fitting the BHF model using REML and ML,
extracting EBLUP predictions,
computing analytical and bootstrap MSE,
using auxiliary information available at the population level.

Imports

import polars as pl
import svy

import svy_sae as sae

County crop (corn and soybeans) data

We use the classic county crop data introduced by Battese, Harter, and Fuller (1988).

Each record in the sample corresponds to a segment within a county (small area).

Sample (unit-level) data

At a minimum, the sample data contain:

County — county identifier (small area)
CornHec — area (hectares) under corn (response variable)
CornPix — number of pixels classified as corn
SoyBeansPix — number of pixels classified as soybeans

Population (auxiliary) data

Population-level data provide auxiliary information aggregated at the county level, including:

county identifiers,
auxiliary totals or means,
population size (number of segments per county).

Load the data

# Load example datasets
smp = svy.load_dataset("cornsoybean", limit=None)
pop = svy.load_dataset("cornsoybeanmeans", limit=None)

print(f"Sample data: {smp.head()}")
print(f"Population means: {pop.head()}")

Sample data: shape: (5, 5)
┌────────┬─────────┬─────────────┬─────────┬─────────────┐
│ County ┆ CornHec ┆ SoyBeansHec ┆ CornPix ┆ SoyBeansPix │
│ ---    ┆ ---     ┆ ---         ┆ ---     ┆ ---         │
│ i64    ┆ f64     ┆ f64         ┆ i64     ┆ i64         │
╞════════╪═════════╪═════════════╪═════════╪═════════════╡
│ 1      ┆ 165.76  ┆ 8.09        ┆ 374     ┆ 55          │
│ 2      ┆ 96.32   ┆ 106.03      ┆ 209     ┆ 218         │
│ 3      ┆ 76.08   ┆ 103.6       ┆ 253     ┆ 250         │
│ 4      ┆ 185.35  ┆ 6.47        ┆ 432     ┆ 96          │
│ 4      ┆ 116.43  ┆ 63.82       ┆ 367     ┆ 178         │
└────────┴─────────┴─────────────┴─────────┴─────────────┘
Population means: shape: (5, 6)
┌────────┬────────────┬──────────────┬──────────────┬─────────┬─────────────┐
│ County ┆ CountyName ┆ SampSegments ┆ PopnSegments ┆ CornPix ┆ SoyBeansPix │
│ ---    ┆ ---        ┆ ---          ┆ ---          ┆ ---     ┆ ---         │
│ i64    ┆ str        ┆ i64          ┆ i64          ┆ f64     ┆ f64         │
╞════════╪════════════╪══════════════╪══════════════╪═════════╪═════════════╡
│ 1      ┆ CerroGordo ┆ 1            ┆ 545          ┆ 295.29  ┆ 189.7       │
│ 2      ┆ Hamilton   ┆ 1            ┆ 566          ┆ 300.4   ┆ 196.65      │
│ 3      ┆ Worth      ┆ 1            ┆ 394          ┆ 289.6   ┆ 205.28      │
│ 4      ┆ Humboldt   ┆ 2            ┆ 424          ┆ 290.74  ┆ 220.22      │
│ 5      ┆ Franklin   ┆ 3            ┆ 564          ┆ 318.21  ┆ 188.06      │
└────────┴────────────┴──────────────┴──────────────┴─────────┴─────────────┘

Model formulation (conceptual)

The BHF model is given by:

\[ y_{ij} = \mathbf{x}_{ij}^\top \beta + u_i + e_{ij}, \]

where:

\(u_i \sim N(0, \sigma_u^2)\) is the area-level random effect,
\(e_{ij} \sim N(0, \sigma_e^2)\) is the unit-level error.

A common target parameter is the area mean:

\[ \theta_i \;=\; \bar{Y}_i \;=\; \frac{1}{N_i}\sum_{j=1}^{N_i} Y_{ij}. \]

Under the BHF model, the EBLUP of the area mean can be written as:

\[ \widehat{\theta}_i^{EBLUP} \;=\; \frac{1}{N_i}\left( \sum_{j\in s_i} y_{ij} \;+\; \sum_{j\in r_i} \mathbf{x}_{ij}^\top \widehat{\beta} \;+\; (N_i-n_i)\widehat{u}_i \right), \]

where \(s_i\) denotes the sampled units in area \(i\), \(r_i\) the non-sampled units, \(n_i\) is the sample size for area \(i\), and \(N_i\) is the population size for area \(i\). The term \(\widehat{u}_i\) is the empirical (estimated) area random effect, which shrinks the direct component toward the regression fit depending on the relative sizes of \(\sigma_u^2\) and \(\sigma_e^2\).

Fit the BHF model (REML + analytical MSE)

model = sae.UnitLevel(smp)

res = model.eblup(
    y="CornHec",
    x=["CornPix", "SoyBeansPix"],
    area="County",
    pop_data=pop,
    pop_size="PopnSegments",
    method="REML",
    mse="analytical",
    intercept=True,
    drop_nulls=True,
)

print(res)

╭───────────── SAE: BHF (EBLUP) - MSE: Analytical - Unit Level ─────────────╮
│ Method       REML  Log Likelihood  -161.0062                              │
│ No. Areas      12  AIC              332.0125                              │
│ No. Obs        37  BIC              340.0671                              │
│ ICC        0.1754  KIC              337.0125                              │
│                                                                           │
│                                                                           │
│  Term          Estimate   Std.Err       z   p-value     [0.025    0.975]  │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  │
│  Intercept      17.9468   30.9597    0.58    0.5621   -42.7342   78.6277  │
│  CornPix         0.3664    0.0649    5.64    0.0000     0.2391    0.4936  │
│  SoyBeansPix    -0.0303    0.0676   -0.45    0.6534    -0.1627    0.1021  │
│                                                                           │
│                                                                           │
│ Random Effects: Area (Normal) Sigma: 7.9571  Sigma²: 63.3149              │
╰───────────────────────────────────────────────────────────────────────────╯

Fitted statistics

print(res.stats)

SAEStats(method=REML, n_areas=12, n_obs=37, sigma_u=7.9571, icc=0.1754, logLik=-161.0062)

Fixed and random effects

Fixed effects

for fe in res.fixed_effects:
    print(fe)

FixedEffect(term='Intercept', est=17.9468, se=30.9597, z=0.58, p=0.5621)
FixedEffect(term='CornPix', est=0.3664, se=0.0649, z=5.64, p=1.676e-08)
FixedEffect(term='SoyBeansPix', est=-0.0303, se=0.0676, z=-0.45, p=0.6534)

Random effects

print(res.random_effects)

RandomEffect(term='Area', dist=Normal, sigma=7.9571)

EBLUP predictions and analytical MSE

pred = res.to_polars("predictions")
print(pred.head())

shape: (5, 7)
┌──────┬────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ area ┆ pred       ┆ mse       ┆ cv       ┆ lci        ┆ uci        ┆ indicator │
│ ---  ┆ ---        ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---       │
│ str  ┆ f64        ┆ f64       ┆ f64      ┆ f64        ┆ f64        ┆ str       │
╞══════╪════════════╪═══════════╪══════════╪════════════╪════════════╪═══════════╡
│ 1    ┆ 122.563395 ┆ 62.464222 ┆ 0.064484 ┆ 107.07267  ┆ 138.05412  ┆ Mean      │
│ 10   ┆ 124.18042  ┆ 34.274651 ┆ 0.047145 ┆ 112.705687 ┆ 135.655153 ┆ Mean      │
│ 11   ┆ 112.504598 ┆ 34.088246 ┆ 0.051896 ┆ 101.061111 ┆ 123.948086 ┆ Mean      │
│ 12   ┆ 131.257976 ┆ 30.895556 ┆ 0.042347 ┆ 120.363557 ┆ 142.152395 ┆ Mean      │
│ 2    ┆ 123.515319 ┆ 61.952696 ┆ 0.063725 ┆ 108.088152 ┆ 138.942486 ┆ Mean      │
└──────┴────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘

Maximum likelihood (ML) estimation

res_ml = model.eblup(
    y="CornHec",
    x=["CornPix", "SoyBeansPix"],
    area="County",
    pop_data=pop,
    pop_size="PopnSegments",
    method="ML",
    mse="analytical",
    intercept=True,
    drop_nulls=True,
)

print(res_ml.stats)

SAEStats(method=ML, n_areas=12, n_obs=37, sigma_u=6.9134, icc=0.1457, logLik=-156.4413)

Bootstrap MSE

res_boot = model.eblup(
    y="CornHec",
    x=["CornPix", "SoyBeansPix"],
    area="County",
    pop_data=pop,
    method="REML",
    mse="bootstrap",
    n_reps=200,
    batch_size=50,
    rstate=42,
    progress_bar=False,
)

pred_boot = res_boot.to_polars("predictions")
print(pred_boot.head())

  Batch 1/4 completed.
  Batch 2/4 completed.
  Batch 3/4 completed.
  Batch 4/4 completed.
shape: (5, 7)
┌──────┬────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ area ┆ pred       ┆ mse       ┆ cv       ┆ lci        ┆ uci        ┆ indicator │
│ ---  ┆ ---        ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---       │
│ str  ┆ f64        ┆ f64       ┆ f64      ┆ f64        ┆ f64        ┆ str       │
╞══════╪════════════╪═══════════╪══════════╪════════════╪════════════╪═══════════╡
│ 1    ┆ 122.563395 ┆ 71.965985 ┆ 0.069215 ┆ 105.936173 ┆ 139.190618 ┆ Mean      │
│ 10   ┆ 124.18042  ┆ 40.901878 ┆ 0.051501 ┆ 111.645323 ┆ 136.715517 ┆ Mean      │
│ 11   ┆ 112.504598 ┆ 41.200252 ┆ 0.057053 ┆ 99.923864  ┆ 125.085333 ┆ Mean      │
│ 12   ┆ 131.257976 ┆ 36.970289 ┆ 0.046323 ┆ 119.340549 ┆ 143.175403 ┆ Mean      │
│ 2    ┆ 123.515319 ┆ 96.945001 ┆ 0.079715 ┆ 104.217031 ┆ 142.813607 ┆ Mean      │
└──────┴────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘

Data validation and missing values

Unit-level models are sensitive to missing data. Setting drop_nulls=True removes rows with missing values in variables used by the model.

Common outputs you will use

After calling eblup(...), the returned results object provides:

results.stats — fitted statistics,
results.fixed_effects — fixed effects estimates,
results.random_effects — variance components,
results.to_polars("predictions") — EBLUPs and MSE.

Model diagnostics

COMING SOON!!!

Diagnostics for unit-level models will be added in a future release.

What’s next

Next, we will review the EB version of the unit level model: Molina-Rao model