Fay-Herriot Model Tutorial - Area-Level SAE

Learn how to implement the Fay-Herriot model for area-level Small Area Estimation in Python using svy-sae. Includes code examples and best practices.

Keywords

Fay-Herriot model, area-level estimation, SAE Python, small area estimation tutorial

What this tutorial covers

This tutorial introduces area-level small area estimation using the Fay–Herriot (FH) model in svy-sae. We show two ways to use the API:

Aggregated mode: you provide one row per area with the direct estimate and its sampling variance.
Sample mode: you provide a svy.Sample object and let svy-sae compute the required aggregates.

In both cases, we fit the FH model and extract:

FH EBLUP predictions (pred)
Analytical MSE (Prasad–Rao)
Bootstrap MSE (parametric bootstrap)

Imports

from pathlib import Path

import polars as pl
import svy

import svy_sae as sae

Milk expenditure data

We use the Milk Expenditure dataset commonly used in SAE tutorials (43 small areas). In the examples below, the dataset is assumed to be available as a CSV file.

Load the data

milk = svy.load_dataset(name="milk", limit=None)

print(f"Milk expenditure data : {milk.head()}")

Milk expenditure data : shape: (5, 7)
┌───────────┬─────┬───────┬───────┬───────┬───────────┬──────────┐
│ SmallArea ┆ ni  ┆ yi    ┆ SD    ┆ CV    ┆ MajorArea ┆ variance │
│ ---       ┆ --- ┆ ---   ┆ ---   ┆ ---   ┆ ---       ┆ ---      │
│ str       ┆ i64 ┆ f64   ┆ f64   ┆ f64   ┆ str       ┆ f64      │
╞═══════════╪═════╪═══════╪═══════╪═══════╪═══════════╪══════════╡
│ 1         ┆ 191 ┆ 1.099 ┆ 0.163 ┆ 0.148 ┆ 1         ┆ 0.026569 │
│ 2         ┆ 633 ┆ 1.075 ┆ 0.08  ┆ 0.074 ┆ 1         ┆ 0.0064   │
│ 3         ┆ 597 ┆ 1.105 ┆ 0.083 ┆ 0.075 ┆ 1         ┆ 0.006889 │
│ 4         ┆ 221 ┆ 0.628 ┆ 0.109 ┆ 0.174 ┆ 1         ┆ 0.011881 │
│ 5         ┆ 195 ┆ 0.753 ┆ 0.119 ┆ 0.158 ┆ 1         ┆ 0.014161 │
└───────────┴─────┴───────┴───────┴───────┴───────────┴──────────┘

At a minimum, the dataset must contain the following columns, with one row per small area:

SmallArea : area identifier
yi : direct (design-based) estimate of the target parameter for the small area
variance: estimated sampling variance of the direct estimator (computed as $SD^2$ )
MajorArea : categorical covariate defining the major area to which each small area belongs

1) Aggregated mode (one row per area)

In aggregated mode, your input must contain one row per small area with:

the direct estimate (e.g. yi)
its sampling variance (e.g. variance)
one or more covariates (e.g. MajorArea)

Fit the FH model (REML + analytical MSE)

model = sae.AreaLevel(milk)

res_pr = model.fh(
    y="yi",
    x=svy.Cat("MajorArea", ref=1),
    variance="variance",
    area="SmallArea",
    method="REML",
    mse="prasad_rao",
)

print(res_pr)

╭──────────── SAE: Fay-Herriot (MSE: Prasad Rao) - Area Level ─────────────╮
│ Method       REML  Log Likelihood   12.6775                              │
│ No. Areas      43  AIC             -15.3549                              │
│ No. Obs        43  BIC              -6.5489                              │
│ ICC        0.5134  KIC             -10.3549                              │
│                                                                          │
│                                                                          │
│  Term          Estimate   Std.Err       z   p-value    [0.025    0.975]  │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  │
│  Intercept       0.9682    0.0694   13.96    0.0000    0.8322    1.1041  │
│  MajorArea_2     0.1328    0.1030    1.29    0.1974   -0.0691    0.3347  │
│  MajorArea_3     0.2269    0.0923    2.46    0.0140    0.0460    0.4079  │
│  MajorArea_4    -0.2413    0.0816   -2.96    0.0031   -0.4013   -0.0813  │
│                                                                          │
│                                                                          │
│ Random Effects: Area (Normal) Sigma: 0.1362  Sigma²: 0.0186              │
╰──────────────────────────────────────────────────────────────────────────╯

Fitted statistics

print(res_pr.stats)

SAEStats(method=REML, n_areas=43, n_obs=43, sigma_u=0.1362, icc=0.5134, logLik=12.6775)

Inspect fitted parameters

# Fixed effects (includes Intercept + encoded categorical terms)
for fe in res_pr.fixed_effects:
    print(fe)

FixedEffect(term='Intercept', est=0.9682, se=0.0694, z=13.96, p=0)
FixedEffect(term='MajorArea_2', est=0.1328, se=0.1030, z=1.29, p=0.1974)
FixedEffect(term='MajorArea_3', est=0.2269, se=0.0923, z=2.46, p=0.01397)
FixedEffect(term='MajorArea_4', est=-0.2413, se=0.0816, z=-2.96, p=0.003112)

# Random effects summary (area-level variance component)
print(res_pr.random_effects)

RandomEffect(term='Area', dist=Normal, sigma=0.1362)

FH predictions and analytical MSE

svy-sae returns predictions as a structured result object. Use to_polars("predictions") to extract a table.

pred_pr = res_pr.to_polars("predictions")
print(pred_pr.head())

shape: (5, 7)
┌──────┬──────────┬──────────┬──────────┬──────────┬──────────┬───────────┐
│ area ┆ pred     ┆ mse      ┆ cv       ┆ lci      ┆ uci      ┆ indicator │
│ ---  ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ str  ┆ f64      ┆ f64      ┆ f64      ┆ f64      ┆ f64      ┆ str       │
╞══════╪══════════╪══════════╪══════════╪══════════╪══════════╪═══════════╡
│ 1    ┆ 1.021971 ┆ 0.01346  ┆ 0.113524 ┆ 0.794575 ┆ 1.249367 ┆ mean      │
│ 2    ┆ 1.047602 ┆ 0.005373 ┆ 0.069969 ┆ 0.903934 ┆ 1.19127  ┆ mean      │
│ 3    ┆ 1.067951 ┆ 0.005702 ┆ 0.070707 ┆ 0.919949 ┆ 1.215954 ┆ mean      │
│ 4    ┆ 0.760817 ┆ 0.008542 ┆ 0.121477 ┆ 0.57967  ┆ 0.941963 ┆ mean      │
│ 5    ┆ 0.846157 ┆ 0.00958  ┆ 0.115671 ┆ 0.654321 ┆ 1.037993 ┆ mean      │
└──────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┘

The predictions table typically includes:

area : area identifier (aligned with area=...)
pred : FH EBLUP estimate for the area
mse : estimated MSE for the area (here: Prasad–Rao)

Bootstrap MSE (same fitted model, bootstrap uncertainty)

You can request bootstrap MSE by setting mse="bootstrap" and specifying the number of replicates. Point predictions are the same (same fitted parameters), but MSE will differ.

res_boot = model.fh(
    y="yi",
    x=svy.Cat("MajorArea", ref=1),
    variance="variance",
    area="SmallArea",
    method="REML",
    mse="bootstrap",
    n_reps=200,     # increase for more stable results (e.g., 500–2000)
    rstate=12345,
)

pred_boot = res_boot.to_polars("predictions")
pred_boot.head()

Bootstrap FH (Batch size: 1) (Batch size: 1):   0%|          | 0/200 [00:00<?, ?batch/s]Bootstrap FH (Batch size: 1) (Batch size: 1):   0%|          | 1/200 [00:01<03:21,  1.01s/batch]Bootstrap FH (Batch size: 1) (Batch size: 1): 100%|██████████| 200/200 [00:01<00:00, 190.73batch/s]

shape: (5, 7)

area	pred	mse	cv	lci	uci	indicator
str	f64	f64	f64	f64	f64	str
"1"	1.021971	0.013012	0.111618	0.798393	1.245548	"mean"
"2"	1.047602	0.004952	0.067173	0.909675	1.185529	"mean"
"3"	1.067951	0.005516	0.069547	0.922378	1.213525	"mean"
"4"	0.760817	0.007901	0.116835	0.586592	0.935041	"mean"
"5"	0.846157	0.009235	0.113572	0.657802	1.034512	"mean"

Compare analytical vs bootstrap MSE

cmp = (
    pred_pr
    .select(["area", pl.col("pred").alias("pred_fh"), pl.col("mse").alias("mse_prasad_rao")])
    .join(
        pred_boot.select(["area", pl.col("mse").alias("mse_bootstrap")]),
        on="area",
        how="inner",
    )
    .with_columns(
        (pl.col("mse_bootstrap") - pl.col("mse_prasad_rao")).alias("mse_diff"),
        (pl.col("mse_bootstrap") / pl.col("mse_prasad_rao")).alias("mse_ratio"),
    )
)

cmp.head()

shape: (5, 6)

area	pred_fh	mse_prasad_rao	mse_bootstrap	mse_diff	mse_ratio
str	f64	f64	f64	f64	f64
"1"	1.021971	0.01346	0.013012	-0.000448	0.966699
"2"	1.047602	0.005373	0.004952	-0.000421	0.921675
"3"	1.067951	0.005702	0.005516	-0.000186	0.967447
"4"	0.760817	0.008542	0.007901	-0.00064	0.925033
"5"	0.846157	0.00958	0.009235	-0.000345	0.964038

A quick sanity check:

pred_fh should match across MSE methods (same fitted parameters)
mse_bootstrap should be in the same ballpark as mse_prasad_rao, but not identical

2) Sample mode (start from a survey sample)

In sample mode, you start from a svy.Sample object (unit-level records). svy-sae can then: - compute direct estimates by area, and - compute the associated sampling variances (using the survey design),

before fitting the FH model.

The exact sample setup (weights, strata/psu, etc.) depends on your survey design and data. The code below shows the shape of the workflow.

Example workflow (template)

import polars as pl
import svy
import svy_sae as sas

# 1) Load unit-level survey data
unit = svy.io.read_csv("data/milk_unit.csv")

# 2) Build a survey sample object (example fields; adapt to your design)
#    - weight column
#    - strata / psu if applicable
sample = svy.Sample(
    unit,
    weight="wgt",
    strata="strata",   # optional
    psu="psu",         # optional
)

# 3) Fit FH directly from the sample
#    - y: unit-level outcome OR a direct-estimate target depending on your workflow
#    - area: the small-area identifier on unit records
#    - x: covariates (unit or area-level, depending on how you structure them)
model = sae.AreaLevel(sample)

res = model.fh(
    y="y",
    x=["x1", "x2"],
    area="SmallArea",
    method="REML",
    mse="prasad_rao",
)

pred = res.to_polars("predictions")
pred.head()

Notes for sample mode

Use sample mode when you want svy-sae to compute the aggregated inputs (direct estimates + sampling variances) from a design-aware svy.Sample.
Use aggregated mode when you already have reliable direct estimates and sampling variances (e.g., produced by another pipeline) and want a clean FH fit.

Common outputs you will use

After calling fh(...), the returned results object typically provides:

results.stats — model statistics (e.g., log-likelihood, AIC/BIC, σᵤ, ICC, etc.)
results.fixed_effects — fixed effects estimates and standard errors
results.random_effects — random effect component estimates
results.to_polars("predictions") — area predictions and MSE

Model diagnostics

COMING SOON!!!

Diagnostics for area-level models will be added in a future release.

What’s next

Next, we will review the basic unit level model: Battese–Harter–Fuller (BHF) model