from pathlib import Path
import polars as pl
import svy
import svy_sae as saeFay-Herriot Model Tutorial - Area-Level SAE
Fay-Herriot model, area-level estimation, SAE Python, small area estimation tutorial
What this tutorial covers
This tutorial introduces area-level small area estimation using the Fay–Herriot (FH) model in svy-sae. We show two ways to use the API:
- Aggregated mode: you provide one row per area with the direct estimate and its sampling variance.
- Sample mode: you provide a
svy.Sampleobject and letsvy-saecompute the required aggregates.
In both cases, we fit the FH model and extract:
- FH EBLUP predictions (
pred) - Analytical MSE (Prasad–Rao)
- Bootstrap MSE (parametric bootstrap)
Imports
Milk expenditure data
We use the Milk Expenditure dataset commonly used in SAE tutorials (43 small areas). In the examples below, the dataset is assumed to be available as a CSV file.
Load the data
milk = svy.load_dataset(name="milk", limit=None)
print(f"Milk expenditure data : {milk.head()}")Milk expenditure data : shape: (5, 7)
┌───────────┬─────┬───────┬───────┬───────┬───────────┬──────────┐
│ SmallArea ┆ ni ┆ yi ┆ SD ┆ CV ┆ MajorArea ┆ variance │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ f64 │
╞═══════════╪═════╪═══════╪═══════╪═══════╪═══════════╪══════════╡
│ 1 ┆ 191 ┆ 1.099 ┆ 0.163 ┆ 0.148 ┆ 1 ┆ 0.026569 │
│ 2 ┆ 633 ┆ 1.075 ┆ 0.08 ┆ 0.074 ┆ 1 ┆ 0.0064 │
│ 3 ┆ 597 ┆ 1.105 ┆ 0.083 ┆ 0.075 ┆ 1 ┆ 0.006889 │
│ 4 ┆ 221 ┆ 0.628 ┆ 0.109 ┆ 0.174 ┆ 1 ┆ 0.011881 │
│ 5 ┆ 195 ┆ 0.753 ┆ 0.119 ┆ 0.158 ┆ 1 ┆ 0.014161 │
└───────────┴─────┴───────┴───────┴───────┴───────────┴──────────┘
At a minimum, the dataset must contain the following columns, with one row per small area:
SmallArea: area identifieryi: direct (design-based) estimate of the target parameter for the small areavariance: estimated sampling variance of the direct estimator (computed as$SD^2$)MajorArea: categorical covariate defining the major area to which each small area belongs
1) Aggregated mode (one row per area)
In aggregated mode, your input must contain one row per small area with:
- the direct estimate (e.g.
yi) - its sampling variance (e.g.
variance) - one or more covariates (e.g.
MajorArea)
Fit the FH model (REML + analytical MSE)
model = sae.AreaLevel(milk)
res_pr = model.fh(
y="yi",
x=svy.Cat("MajorArea", ref=1),
variance="variance",
area="SmallArea",
method="REML",
mse="prasad_rao",
)
print(res_pr)╭──────────── SAE: Fay-Herriot (MSE: Prasad Rao) - Area Level ─────────────╮ │ Method REML Log Likelihood 12.6775 │ │ No. Areas 43 AIC -15.3549 │ │ No. Obs 43 BIC -6.5489 │ │ ICC 0.5134 KIC -10.3549 │ │ │ │ │ │ Term Estimate Std.Err z p-value [0.025 0.975] │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Intercept 0.9682 0.0694 13.96 0.0000 0.8322 1.1041 │ │ MajorArea_2 0.1328 0.1030 1.29 0.1974 -0.0691 0.3347 │ │ MajorArea_3 0.2269 0.0923 2.46 0.0140 0.0460 0.4079 │ │ MajorArea_4 -0.2413 0.0816 -2.96 0.0031 -0.4013 -0.0813 │ │ │ │ │ │ Random Effects: Area (Normal) Sigma: 0.1362 Sigma²: 0.0186 │ ╰──────────────────────────────────────────────────────────────────────────╯
Fitted statistics
print(res_pr.stats)SAEStats(method=REML, n_areas=43, n_obs=43, sigma_u=0.1362, icc=0.5134, logLik=12.6775)
Inspect fitted parameters
# Fixed effects (includes Intercept + encoded categorical terms)
for fe in res_pr.fixed_effects:
print(fe)FixedEffect(term='Intercept', est=0.9682, se=0.0694, z=13.96, p=0)
FixedEffect(term='MajorArea_2', est=0.1328, se=0.1030, z=1.29, p=0.1974)
FixedEffect(term='MajorArea_3', est=0.2269, se=0.0923, z=2.46, p=0.01397)
FixedEffect(term='MajorArea_4', est=-0.2413, se=0.0816, z=-2.96, p=0.003112)
# Random effects summary (area-level variance component)
print(res_pr.random_effects)RandomEffect(term='Area', dist=Normal, sigma=0.1362)
FH predictions and analytical MSE
svy-sae returns predictions as a structured result object. Use to_polars("predictions") to extract a table.
pred_pr = res_pr.to_polars("predictions")
print(pred_pr.head())shape: (5, 7)
┌──────┬──────────┬──────────┬──────────┬──────────┬──────────┬───────────┐
│ area ┆ pred ┆ mse ┆ cv ┆ lci ┆ uci ┆ indicator │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════╪══════════╪══════════╪══════════╪══════════╪══════════╪═══════════╡
│ 1 ┆ 1.021971 ┆ 0.01346 ┆ 0.113524 ┆ 0.794575 ┆ 1.249367 ┆ mean │
│ 2 ┆ 1.047602 ┆ 0.005373 ┆ 0.069969 ┆ 0.903934 ┆ 1.19127 ┆ mean │
│ 3 ┆ 1.067951 ┆ 0.005702 ┆ 0.070707 ┆ 0.919949 ┆ 1.215954 ┆ mean │
│ 4 ┆ 0.760817 ┆ 0.008542 ┆ 0.121477 ┆ 0.57967 ┆ 0.941963 ┆ mean │
│ 5 ┆ 0.846157 ┆ 0.00958 ┆ 0.115671 ┆ 0.654321 ┆ 1.037993 ┆ mean │
└──────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────────┘
The predictions table typically includes:
area: area identifier (aligned witharea=...)pred: FH EBLUP estimate for the areamse: estimated MSE for the area (here: Prasad–Rao)
Bootstrap MSE (same fitted model, bootstrap uncertainty)
You can request bootstrap MSE by setting mse="bootstrap" and specifying the number of replicates. Point predictions are the same (same fitted parameters), but MSE will differ.
res_boot = model.fh(
y="yi",
x=svy.Cat("MajorArea", ref=1),
variance="variance",
area="SmallArea",
method="REML",
mse="bootstrap",
n_reps=200, # increase for more stable results (e.g., 500–2000)
rstate=12345,
)
pred_boot = res_boot.to_polars("predictions")
pred_boot.head()Bootstrap FH (Batch size: 1) (Batch size: 1): 0%| | 0/200 [00:00<?, ?batch/s]Bootstrap FH (Batch size: 1) (Batch size: 1): 0%| | 1/200 [00:01<03:21, 1.01s/batch]Bootstrap FH (Batch size: 1) (Batch size: 1): 100%|██████████| 200/200 [00:01<00:00, 190.73batch/s]
| area | pred | mse | cv | lci | uci | indicator |
|---|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 | str |
| "1" | 1.021971 | 0.013012 | 0.111618 | 0.798393 | 1.245548 | "mean" |
| "2" | 1.047602 | 0.004952 | 0.067173 | 0.909675 | 1.185529 | "mean" |
| "3" | 1.067951 | 0.005516 | 0.069547 | 0.922378 | 1.213525 | "mean" |
| "4" | 0.760817 | 0.007901 | 0.116835 | 0.586592 | 0.935041 | "mean" |
| "5" | 0.846157 | 0.009235 | 0.113572 | 0.657802 | 1.034512 | "mean" |
Compare analytical vs bootstrap MSE
cmp = (
pred_pr
.select(["area", pl.col("pred").alias("pred_fh"), pl.col("mse").alias("mse_prasad_rao")])
.join(
pred_boot.select(["area", pl.col("mse").alias("mse_bootstrap")]),
on="area",
how="inner",
)
.with_columns(
(pl.col("mse_bootstrap") - pl.col("mse_prasad_rao")).alias("mse_diff"),
(pl.col("mse_bootstrap") / pl.col("mse_prasad_rao")).alias("mse_ratio"),
)
)
cmp.head()| area | pred_fh | mse_prasad_rao | mse_bootstrap | mse_diff | mse_ratio |
|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 |
| "1" | 1.021971 | 0.01346 | 0.013012 | -0.000448 | 0.966699 |
| "2" | 1.047602 | 0.005373 | 0.004952 | -0.000421 | 0.921675 |
| "3" | 1.067951 | 0.005702 | 0.005516 | -0.000186 | 0.967447 |
| "4" | 0.760817 | 0.008542 | 0.007901 | -0.00064 | 0.925033 |
| "5" | 0.846157 | 0.00958 | 0.009235 | -0.000345 | 0.964038 |
A quick sanity check:
pred_fhshould match across MSE methods (same fitted parameters)mse_bootstrapshould be in the same ballpark asmse_prasad_rao, but not identical
2) Sample mode (start from a survey sample)
In sample mode, you start from a svy.Sample object (unit-level records). svy-sae can then: - compute direct estimates by area, and - compute the associated sampling variances (using the survey design),
before fitting the FH model.
The exact sample setup (weights, strata/psu, etc.) depends on your survey design and data. The code below shows the shape of the workflow.
Example workflow (template)
import polars as pl
import svy
import svy_sae as sas
# 1) Load unit-level survey data
unit = svy.io.read_csv("data/milk_unit.csv")
# 2) Build a survey sample object (example fields; adapt to your design)
# - weight column
# - strata / psu if applicable
sample = svy.Sample(
unit,
weight="wgt",
strata="strata", # optional
psu="psu", # optional
)
# 3) Fit FH directly from the sample
# - y: unit-level outcome OR a direct-estimate target depending on your workflow
# - area: the small-area identifier on unit records
# - x: covariates (unit or area-level, depending on how you structure them)
model = sae.AreaLevel(sample)
res = model.fh(
y="y",
x=["x1", "x2"],
area="SmallArea",
method="REML",
mse="prasad_rao",
)
pred = res.to_polars("predictions")
pred.head()Notes for sample mode
- Use sample mode when you want
svy-saeto compute the aggregated inputs (direct estimates + sampling variances) from a design-awaresvy.Sample. - Use aggregated mode when you already have reliable direct estimates and sampling variances (e.g., produced by another pipeline) and want a clean FH fit.
Common outputs you will use
After calling fh(...), the returned results object typically provides:
results.stats— model statistics (e.g., log-likelihood, AIC/BIC, σᵤ, ICC, etc.)results.fixed_effects— fixed effects estimates and standard errorsresults.random_effects— random effect component estimatesresults.to_polars("predictions")— area predictions and MSE
Model diagnostics
Diagnostics for area-level models will be added in a future release.
What’s next
Next, we will review the basic unit level model: Battese–Harter–Fuller (BHF) model