EBLUP - Unit-level modeling

The Battese–Harter–Fuller (BHF) model with svy-sae

What this tutorial covers

This tutorial introduces the Battese–Harter–Fuller (BHF) model as implemented in svy-sae.

Unlike area-level models, unit-level models fit the mixed model directly at the unit level, allowing both unit-level error variance and area-level random effects variance to be estimated from the data.

In this tutorial, we focus on predicting area-level means using EBLUPs, and we demonstrate:

  • fitting the BHF model using REML and ML,
  • extracting EBLUP predictions,
  • computing analytical and bootstrap MSE,
  • using auxiliary information available at the population level.

Imports

import polars as pl
import svy

import svy_sae as sae

County crop (corn and soybeans) data

We use the classic county crop data introduced by Battese, Harter, and Fuller (1988).

Each record in the sample corresponds to a segment within a county (small area).

Sample (unit-level) data

At a minimum, the sample data contain:

  • County — county identifier (small area)
  • CornHec — area (hectares) under corn (response variable)
  • CornPix — number of pixels classified as corn
  • SoyBeansPix — number of pixels classified as soybeans

Population (auxiliary) data

Population-level data provide auxiliary information aggregated at the county level, including:

  • county identifiers,
  • auxiliary totals or means,
  • population size (number of segments per county).

Load the data

# Load example datasets
smp = svy.load_dataset("cornsoybean", limit=None)
pop = svy.load_dataset("cornsoybeanmeans", limit=None)

print(f"Sample data: {smp.head()}")
print(f"Population means: {pop.head()}")
Sample data: shape: (5, 5)
┌────────┬─────────┬─────────────┬─────────┬─────────────┐
│ County ┆ CornHec ┆ SoyBeansHec ┆ CornPix ┆ SoyBeansPix │
│ ---    ┆ ---     ┆ ---         ┆ ---     ┆ ---         │
│ i64    ┆ f64     ┆ f64         ┆ i64     ┆ i64         │
╞════════╪═════════╪═════════════╪═════════╪═════════════╡
│ 1      ┆ 165.76  ┆ 8.09        ┆ 374     ┆ 55          │
│ 2      ┆ 96.32   ┆ 106.03      ┆ 209     ┆ 218         │
│ 3      ┆ 76.08   ┆ 103.6       ┆ 253     ┆ 250         │
│ 4      ┆ 185.35  ┆ 6.47        ┆ 432     ┆ 96          │
│ 4      ┆ 116.43  ┆ 63.82       ┆ 367     ┆ 178         │
└────────┴─────────┴─────────────┴─────────┴─────────────┘
Population means: shape: (5, 6)
┌────────┬────────────┬──────────────┬──────────────┬─────────┬─────────────┐
│ County ┆ CountyName ┆ SampSegments ┆ PopnSegments ┆ CornPix ┆ SoyBeansPix │
│ ---    ┆ ---        ┆ ---          ┆ ---          ┆ ---     ┆ ---         │
│ i64    ┆ str        ┆ i64          ┆ i64          ┆ f64     ┆ f64         │
╞════════╪════════════╪══════════════╪══════════════╪═════════╪═════════════╡
│ 1      ┆ CerroGordo ┆ 1            ┆ 545          ┆ 295.29  ┆ 189.7       │
│ 2      ┆ Hamilton   ┆ 1            ┆ 566          ┆ 300.4   ┆ 196.65      │
│ 3      ┆ Worth      ┆ 1            ┆ 394          ┆ 289.6   ┆ 205.28      │
│ 4      ┆ Humboldt   ┆ 2            ┆ 424          ┆ 290.74  ┆ 220.22      │
│ 5      ┆ Franklin   ┆ 3            ┆ 564          ┆ 318.21  ┆ 188.06      │
└────────┴────────────┴──────────────┴──────────────┴─────────┴─────────────┘

Model formulation (conceptual)

The BHF model is given by:

\[ y_{ij} = \mathbf{x}_{ij}^\top \beta + u_i + e_{ij}, \]

where:

  • \(u_i \sim N(0, \sigma_u^2)\) is the area-level random effect,
  • \(e_{ij} \sim N(0, \sigma_e^2)\) is the unit-level error.

A common target parameter is the area mean:

\[ \theta_i \;=\; \bar{Y}_i \;=\; \frac{1}{N_i}\sum_{j=1}^{N_i} Y_{ij}. \]

Under the BHF model, the EBLUP of the area mean can be written as:

\[ \widehat{\theta}_i^{EBLUP} \;=\; \frac{1}{N_i}\left( \sum_{j\in s_i} y_{ij} \;+\; \sum_{j\in r_i} \mathbf{x}_{ij}^\top \widehat{\beta} \;+\; (N_i-n_i)\widehat{u}_i \right), \]

where \(s_i\) denotes the sampled units in area \(i\), \(r_i\) the non-sampled units, \(n_i\) is the sample size for area \(i\), and \(N_i\) is the population size for area \(i\). The term \(\widehat{u}_i\) is the empirical (estimated) area random effect, which shrinks the direct component toward the regression fit depending on the relative sizes of \(\sigma_u^2\) and \(\sigma_e^2\).

Fit the BHF model (REML + analytical MSE)

model = sae.UnitLevel(smp)

res = model.eblup(
    y="CornHec",
    x=["CornPix", "SoyBeansPix"],
    area="County",
    pop_data=pop,
    pop_size="PopnSegments",
    method="REML",
    mse="analytical",
    intercept=True,
    drop_nulls=True,
)

print(res)
╭───────────── SAE: BHF (EBLUP) - MSE: Analytical - Unit Level ─────────────╮
 Method       REML  Log Likelihood  -161.0062                              
 No. Areas      12  AIC              332.0125                              
 No. Obs        37  BIC              340.0671                              
 ICC        0.1754  KIC              337.0125                              
                                                                           
                                                                           
  Term          Estimate   Std.Err       z   p-value     [0.025    0.975]  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  Intercept      17.9468   30.9597    0.58    0.5621   -42.7342   78.6277  
  CornPix         0.3664    0.0649    5.64    0.0000     0.2391    0.4936  
  SoyBeansPix    -0.0303    0.0676   -0.45    0.6534    -0.1627    0.1021  
                                                                           
                                                                           
 Random Effects: Area (Normal) Sigma: 7.9571  Sigma²: 63.3149              
╰───────────────────────────────────────────────────────────────────────────╯

Fitted statistics

print(res.stats)
SAEStats(method=REML, n_areas=12, n_obs=37, sigma_u=7.9571, icc=0.1754, logLik=-161.0062)

Fixed and random effects

Fixed effects

for fe in res.fixed_effects:
    print(fe)
FixedEffect(term='Intercept', est=17.9468, se=30.9597, z=0.58, p=0.5621)
FixedEffect(term='CornPix', est=0.3664, se=0.0649, z=5.64, p=1.676e-08)
FixedEffect(term='SoyBeansPix', est=-0.0303, se=0.0676, z=-0.45, p=0.6534)

Random effects

print(res.random_effects)
RandomEffect(term='Area', dist=Normal, sigma=7.9571)

EBLUP predictions and analytical MSE

pred = res.to_polars("predictions")
print(pred.head())
shape: (5, 7)
┌──────┬────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ area ┆ pred       ┆ mse       ┆ cv       ┆ lci        ┆ uci        ┆ indicator │
│ ---  ┆ ---        ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---       │
│ str  ┆ f64        ┆ f64       ┆ f64      ┆ f64        ┆ f64        ┆ str       │
╞══════╪════════════╪═══════════╪══════════╪════════════╪════════════╪═══════════╡
│ 1    ┆ 122.563395 ┆ 62.464222 ┆ 0.064484 ┆ 107.07267  ┆ 138.05412  ┆ Mean      │
│ 10   ┆ 124.18042  ┆ 34.274651 ┆ 0.047145 ┆ 112.705687 ┆ 135.655153 ┆ Mean      │
│ 11   ┆ 112.504598 ┆ 34.088246 ┆ 0.051896 ┆ 101.061111 ┆ 123.948086 ┆ Mean      │
│ 12   ┆ 131.257976 ┆ 30.895556 ┆ 0.042347 ┆ 120.363557 ┆ 142.152395 ┆ Mean      │
│ 2    ┆ 123.515319 ┆ 61.952696 ┆ 0.063725 ┆ 108.088152 ┆ 138.942486 ┆ Mean      │
└──────┴────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘

Maximum likelihood (ML) estimation

res_ml = model.eblup(
    y="CornHec",
    x=["CornPix", "SoyBeansPix"],
    area="County",
    pop_data=pop,
    pop_size="PopnSegments",
    method="ML",
    mse="analytical",
    intercept=True,
    drop_nulls=True,
)

print(res_ml.stats)
SAEStats(method=ML, n_areas=12, n_obs=37, sigma_u=6.9134, icc=0.1457, logLik=-156.4413)

Bootstrap MSE

res_boot = model.eblup(
    y="CornHec",
    x=["CornPix", "SoyBeansPix"],
    area="County",
    pop_data=pop,
    method="REML",
    mse="bootstrap",
    n_reps=200,
    batch_size=50,
    rstate=42,
    progress_bar=False,
)

pred_boot = res_boot.to_polars("predictions")
print(pred_boot.head())
  Batch 1/4 completed.
  Batch 2/4 completed.
  Batch 3/4 completed.
  Batch 4/4 completed.
shape: (5, 7)
┌──────┬────────────┬───────────┬──────────┬────────────┬────────────┬───────────┐
│ area ┆ pred       ┆ mse       ┆ cv       ┆ lci        ┆ uci        ┆ indicator │
│ ---  ┆ ---        ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---       │
│ str  ┆ f64        ┆ f64       ┆ f64      ┆ f64        ┆ f64        ┆ str       │
╞══════╪════════════╪═══════════╪══════════╪════════════╪════════════╪═══════════╡
│ 1    ┆ 122.563395 ┆ 71.965985 ┆ 0.069215 ┆ 105.936173 ┆ 139.190618 ┆ Mean      │
│ 10   ┆ 124.18042  ┆ 40.901878 ┆ 0.051501 ┆ 111.645323 ┆ 136.715517 ┆ Mean      │
│ 11   ┆ 112.504598 ┆ 41.200252 ┆ 0.057053 ┆ 99.923864  ┆ 125.085333 ┆ Mean      │
│ 12   ┆ 131.257976 ┆ 36.970289 ┆ 0.046323 ┆ 119.340549 ┆ 143.175403 ┆ Mean      │
│ 2    ┆ 123.515319 ┆ 96.945001 ┆ 0.079715 ┆ 104.217031 ┆ 142.813607 ┆ Mean      │
└──────┴────────────┴───────────┴──────────┴────────────┴────────────┴───────────┘

Data validation and missing values

Unit-level models are sensitive to missing data. Setting drop_nulls=True removes rows with missing values in variables used by the model.

Common outputs you will use

After calling eblup(...), the returned results object provides:

  • results.stats — fitted statistics,
  • results.fixed_effects — fixed effects estimates,
  • results.random_effects — variance components,
  • results.to_polars("predictions") — EBLUPs and MSE.

Model diagnostics

CautionCOMING SOON!!!

Diagnostics for unit-level models will be added in a future release.

What’s next

Next, we will review the EB version of the unit level model: Molina-Rao model