Fay-Herriot Model Tutorial - Area-Level SAE

Learn how to implement the Fay-Herriot model for area-level Small Area Estimation in Python using svy-sae. Includes code examples and best practices.
Keywords

Fay-Herriot model, area-level estimation, SAE Python, small area estimation tutorial

What this tutorial covers

This tutorial introduces area-level small area estimation using the Fay–Herriot (FH) model in svy-sae. We show two ways to use the API:

  1. Aggregated mode: you provide one row per area with the direct estimate and its sampling variance.
  2. Sample mode: you provide a svy.Sample object and let svy-sae compute the required aggregates.

In both cases, we fit the FH model and extract:

  • FH EBLUP predictions (pred)
  • Analytical MSE (Prasad–Rao)
  • Bootstrap MSE (parametric bootstrap)

Imports

from pathlib import Path

import polars as pl
import svy

import svy_sae as sae

Milk expenditure data

We use the Milk Expenditure dataset commonly used in SAE tutorials (43 small areas). In the examples below, the dataset is assumed to be available as a CSV file.

Load the data

milk = svy.load_dataset(name="milk", limit=None)

print(f"Milk expenditure data : {milk.head()}")
Milk expenditure data : shape: (5, 7)
┌───────────┬─────┬───────┬───────┬───────┬───────────┬──────────┐
│ SmallArea ┆ ni  ┆ yi    ┆ SD    ┆ CV    ┆ MajorArea ┆ variance │
│ ---       ┆ --- ┆ ---   ┆ ---   ┆ ---   ┆ ---       ┆ ---      │
│ str       ┆ i64 ┆ f64   ┆ f64   ┆ f64   ┆ str       ┆ f64      │
╞═══════════╪═════╪═══════╪═══════╪═══════╪═══════════╪══════════╡
│ 1         ┆ 191 ┆ 1.099 ┆ 0.163 ┆ 0.148 ┆ 1         ┆ 0.026569 │
│ 2         ┆ 633 ┆ 1.075 ┆ 0.08  ┆ 0.074 ┆ 1         ┆ 0.0064   │
│ 3         ┆ 597 ┆ 1.105 ┆ 0.083 ┆ 0.075 ┆ 1         ┆ 0.006889 │
│ 4         ┆ 221 ┆ 0.628 ┆ 0.109 ┆ 0.174 ┆ 1         ┆ 0.011881 │
│ 5         ┆ 195 ┆ 0.753 ┆ 0.119 ┆ 0.158 ┆ 1         ┆ 0.014161 │
└───────────┴─────┴───────┴───────┴───────┴───────────┴──────────┘

At a minimum, the dataset must contain the following columns, with one row per small area:

  • SmallArea : area identifier
  • yi : direct (design-based) estimate of the target parameter for the small area
  • variance: estimated sampling variance of the direct estimator (computed as $SD^2$)
  • MajorArea : categorical covariate defining the major area to which each small area belongs

1) Aggregated mode (one row per area)

In aggregated mode, your input must contain one row per small area with:

  • the direct estimate (e.g. yi)
  • its sampling variance (e.g. variance)
  • one or more covariates (e.g. MajorArea)

Fit the FH model (REML + analytical MSE)

model = sae.AreaLevel(milk)

res_pr = model.fh(
    y="yi",
    x=svy.Cat("MajorArea", ref=1),
    variance="variance",
    area="SmallArea",
    method="REML",
    mse="prasad_rao",
)

print(res_pr)
╭──────────── SAE: Fay-Herriot (MSE: Prasad Rao) - Area Level ─────────────╮
 Method       REML  Log Likelihood   12.6775                              
 No. Areas      43  AIC             -15.3549                              
 No. Obs        43  BIC              -6.5489                              
 ICC        0.5134  KIC             -10.3549                              
                                                                          
                                                                          
  Term          Estimate   Std.Err       z   p-value    [0.025    0.975]  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  
  Intercept       0.9682    0.0694   13.96    0.0000    0.8322    1.1041  
  MajorArea_2     0.1328    0.1030    1.29    0.1974   -0.0691    0.3347  
  MajorArea_3     0.2269    0.0923    2.46    0.0140    0.0460    0.4079  
  MajorArea_4    -0.2413    0.0816   -2.96    0.0031   -0.4013   -0.0813  
                                                                          
                                                                          
 Random Effects: Area (Normal) Sigma: 0.1362  Sigma²: 0.0186              
╰──────────────────────────────────────────────────────────────────────────╯

Fitted statistics

print(res_pr.stats)
SAEStats(method=REML, n_areas=43, n_obs=43, sigma_u=0.1362, icc=0.5134, logLik=12.6775)

Inspect fitted parameters

# Fixed effects (includes Intercept + encoded categorical terms)
for fe in res_pr.fixed_effects:
    print(fe)
FixedEffect(term='Intercept', est=0.9682, se=0.0694, z=13.96, p=0)
FixedEffect(term='MajorArea_2', est=0.1328, se=0.1030, z=1.29, p=0.1974)
FixedEffect(term='MajorArea_3', est=0.2269, se=0.0923, z=2.46, p=0.01397)
FixedEffect(term='MajorArea_4', est=-0.2413, se=0.0816, z=-2.96, p=0.003112)
# Random effects summary (area-level variance component)
print(res_pr.random_effects)
RandomEffect(term='Area', dist=Normal, sigma=0.1362)

FH predictions and analytical MSE

svy-sae returns predictions as a structured result object. Use to_polars("predictions") to extract a table.

pred_pr = res_pr.to_polars("predictions")
print(pred_pr.head())
shape: (5, 7)
┌──────┬──────────┬──────────┬──────────┬──────────┬──────────┬───────────┐
│ area ┆ pred     ┆ mse      ┆ cv       ┆ lci      ┆ uci      ┆ indicator │
│ ---  ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ str  ┆ f64      ┆ f64      ┆ f64      ┆ f64      ┆ f64      ┆ str       │
╞══════╪══════════╪══════════╪══════════╪══════════╪══════════╪═══════════╡
│ 1    ┆ 1.021971 ┆ 0.01346  ┆ 0.113524 ┆ 0.794575 ┆ 1.249367 ┆ mean      │
│ 2    ┆ 1.047602 ┆ 0.005373 ┆ 0.069969 ┆ 0.903934 ┆ 1.19127  ┆ mean      │
│ 3    ┆ 1.067951 ┆ 0.005702 ┆ 0.070707 ┆ 0.919949 ┆ 1.215954 ┆ mean      │
│ 4    ┆ 0.760817 ┆ 0.008542 ┆ 0.121477 ┆ 0.57967  ┆ 0.941963 ┆ mean      │
│ 5    ┆ 0.846157 ┆ 0.00958  ┆ 0.115671 ┆ 0.654321 ┆ 1.037993 ┆ mean      │
└──────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┘

The predictions table typically includes:

  • area : area identifier (aligned with area=...)
  • pred : FH EBLUP estimate for the area
  • mse : estimated MSE for the area (here: Prasad–Rao)

Bootstrap MSE (same fitted model, bootstrap uncertainty)

You can request bootstrap MSE by setting mse="bootstrap" and specifying the number of replicates. Point predictions are the same (same fitted parameters), but MSE will differ.

res_boot = model.fh(
    y="yi",
    x=svy.Cat("MajorArea", ref=1),
    variance="variance",
    area="SmallArea",
    method="REML",
    mse="bootstrap",
    n_reps=200,     # increase for more stable results (e.g., 500–2000)
    rstate=12345,
)

pred_boot = res_boot.to_polars("predictions")
pred_boot.head()
Bootstrap FH (Batch size: 1) (Batch size: 1):   0%|          | 0/200 [00:00<?, ?batch/s]Bootstrap FH (Batch size: 1) (Batch size: 1):   0%|          | 1/200 [00:01<03:21,  1.01s/batch]Bootstrap FH (Batch size: 1) (Batch size: 1): 100%|██████████| 200/200 [00:01<00:00, 190.73batch/s]
shape: (5, 7)
area pred mse cv lci uci indicator
str f64 f64 f64 f64 f64 str
"1" 1.021971 0.013012 0.111618 0.798393 1.245548 "mean"
"2" 1.047602 0.004952 0.067173 0.909675 1.185529 "mean"
"3" 1.067951 0.005516 0.069547 0.922378 1.213525 "mean"
"4" 0.760817 0.007901 0.116835 0.586592 0.935041 "mean"
"5" 0.846157 0.009235 0.113572 0.657802 1.034512 "mean"

Compare analytical vs bootstrap MSE

cmp = (
    pred_pr
    .select(["area", pl.col("pred").alias("pred_fh"), pl.col("mse").alias("mse_prasad_rao")])
    .join(
        pred_boot.select(["area", pl.col("mse").alias("mse_bootstrap")]),
        on="area",
        how="inner",
    )
    .with_columns(
        (pl.col("mse_bootstrap") - pl.col("mse_prasad_rao")).alias("mse_diff"),
        (pl.col("mse_bootstrap") / pl.col("mse_prasad_rao")).alias("mse_ratio"),
    )
)

cmp.head()
shape: (5, 6)
area pred_fh mse_prasad_rao mse_bootstrap mse_diff mse_ratio
str f64 f64 f64 f64 f64
"1" 1.021971 0.01346 0.013012 -0.000448 0.966699
"2" 1.047602 0.005373 0.004952 -0.000421 0.921675
"3" 1.067951 0.005702 0.005516 -0.000186 0.967447
"4" 0.760817 0.008542 0.007901 -0.00064 0.925033
"5" 0.846157 0.00958 0.009235 -0.000345 0.964038

A quick sanity check:

  • pred_fh should match across MSE methods (same fitted parameters)
  • mse_bootstrap should be in the same ballpark as mse_prasad_rao, but not identical

2) Sample mode (start from a survey sample)

In sample mode, you start from a svy.Sample object (unit-level records). svy-sae can then: - compute direct estimates by area, and - compute the associated sampling variances (using the survey design),

before fitting the FH model.

The exact sample setup (weights, strata/psu, etc.) depends on your survey design and data. The code below shows the shape of the workflow.

Example workflow (template)

import polars as pl
import svy
import svy_sae as sas

# 1) Load unit-level survey data
unit = svy.io.read_csv("data/milk_unit.csv")

# 2) Build a survey sample object (example fields; adapt to your design)
#    - weight column
#    - strata / psu if applicable
sample = svy.Sample(
    unit,
    weight="wgt",
    strata="strata",   # optional
    psu="psu",         # optional
)

# 3) Fit FH directly from the sample
#    - y: unit-level outcome OR a direct-estimate target depending on your workflow
#    - area: the small-area identifier on unit records
#    - x: covariates (unit or area-level, depending on how you structure them)
model = sae.AreaLevel(sample)

res = model.fh(
    y="y",
    x=["x1", "x2"],
    area="SmallArea",
    method="REML",
    mse="prasad_rao",
)

pred = res.to_polars("predictions")
pred.head()

Notes for sample mode

  • Use sample mode when you want svy-sae to compute the aggregated inputs (direct estimates + sampling variances) from a design-aware svy.Sample.
  • Use aggregated mode when you already have reliable direct estimates and sampling variances (e.g., produced by another pipeline) and want a clean FH fit.

Common outputs you will use

After calling fh(...), the returned results object typically provides:

  • results.stats — model statistics (e.g., log-likelihood, AIC/BIC, σᵤ, ICC, etc.)
  • results.fixed_effects — fixed effects estimates and standard errors
  • results.random_effects — random effect component estimates
  • results.to_polars("predictions") — area predictions and MSE

Model diagnostics

CautionCOMING SOON!!!

Diagnostics for area-level models will be added in a future release.

What’s next

Next, we will review the basic unit level model: Battese–Harter–Fuller (BHF) model