A Quick Tour

A guided overview of the svy-sae library

This page gives you a fast, practical overview of svy-sae: what it does, how the API is structured, and what a typical workflow looks like.

The key idea is to choose the modeling path based on the type of data you have:

Area-level models (Fay–Herriot) → AreaLevel.fh()
- Aggregated workflow: one row per area with direct estimates and variances
- Microdata workflow: start from survey records and let svy-sae derive the aggregates
Microdata + population auxiliaries → start with a unit-level model:
- UnitLevel.eblup() for linear targets (means/totals),
- UnitLevel.ebp() for nonlinear indicators (poverty, distributional measures).

Beta software

svy-sae is currently beta. Core APIs are stabilizing, and additional diagnostics/benchmarking features are actively being added.

Imports and API style

Both of the following styles work well. Pick one and use it consistently.

Option A: Direct class imports

from svy_sae.models import AreaLevel, UnitLevel

Option B: Namespace import

import svy_sae as sae

In the examples below, we use the namespace import:

import polars as pl
import svy
import svy_sae as sae

The results object at a glance

Most model-fitting calls return a SAEResults object that prints nicely and provides structured access to:

results.stats — fitted statistics (method, σᵤ, AIC/BIC, log-likelihood, etc.)
results.fixed_effects — fixed effects estimates (regression coefficients)
results.random_effects — variance component / random effects summary
results.to_polars("predictions") — predictions + uncertainty (MSE, CV, intervals)

print(results)
print(results.stats)

pred = results.to_polars("predictions")
pred.head()

1) Area-level models: Fay–Herriot (FH)

Use AreaLevel when you do not have access to unit-level records, but you do have one row per area with:

a direct estimate for each area,
its sampling variance,
one or more area-level covariates.

Example: FH from aggregated inputs

import polars as pl
import svy_sae as sae

df_area = pl.DataFrame(
    {
        "district_id": ["D1", "D2", "D3", "D4"],
        "direct_est": [0.12, 0.08, 0.25, 0.15],     # e.g., poverty rate
        "variance": [0.001, 0.004, 0.002, 0.001],   # sampling variance of the direct estimator
        "night_lights": [5.2, 8.1, 2.5, 4.8],
        "avg_income": [1200, 2500, 800, 1500],
    }
)

model = sae.AreaLevel(df_area)

# Analytical MSE (Prasad–Rao)
res_pr = model.fh(
    y="direct_est",
    x=["night_lights", "avg_income"],
    variance="variance",
    area="district_id",
    method="REML",
    mse="prasad_rao",
)

pred_pr = res_pr.to_polars("predictions")
pred_pr

Bootstrap MSE

You can request bootstrap MSE by setting mse="bootstrap" and specifying the number of replicates. Point predictions are typically the same (same fitted parameters), but MSE will differ.

res_boot = model.fh(
    y="direct_est",
    x=["night_lights", "avg_income"],
    variance="variance",
    area="district_id",
    method="REML",
    mse="bootstrap",
    n_reps=200,
    rstate=123,
)

pred_boot = res_boot.to_polars("predictions")
pred_boot

2) Unit-level models: BHF (EBLUP)

Use UnitLevel when you have:

survey microdata (unit records),
population auxiliary data for the units (or a census-like file).

eblup() is best when you need linear parameters such as area means or totals. It is typically faster than EBP because it does not require Monte Carlo integration for nonlinear indicators.

import polars as pl
import svy_sae as sae

sample_df = pl.read_csv("household_survey.csv")
pop_df = pl.read_csv("census_records.csv")

model = sae.UnitLevel(sample_df)

res = model.eblup(
    y="income",
    x=["education", "urban"],
    area="district_id",
    pop_data=pop_df,
    method="REML",
    mse="analytical",
    intercept=True,
    drop_nulls=True,
)

pred = res.to_polars("predictions")
pred.head()

Population coverage warning

If the population data only contain information for a subset of small areas, svy-sae will still predict for the areas that are present.

The model logs a warning message indicating that the population data do not cover all small areas observed in the sample. This warning is informational: it helps you verify coverage before interpreting results.

3) Unit-level models: Molina–Rao (EBP)

Use ebp() when you need nonlinear indicators (poverty headcount, poverty gap, etc.). This workflow typically uses:

a transformation (often Box–Cox for skewed income),
Monte Carlo integration for the indicator,
parametric bootstrap for MSE.

Example: EBP poverty indicator (bootstrap MSE)

import polars as pl
import svy_sae as sae
from svy_sae.core.enumerations import Indicator

sample_df = pl.read_csv("incomedata.csv")
pop_df = pl.read_csv("Xoutsamp.csv")

feature_cols = [
    "age2", "age3", "age4", "age5",
    "nat1",
    "educ1", "educ3",
    "labor1", "labor2",
]

# Poverty line (example): 60% of sample median income
income_median = sample_df["income"].median()
poverty_line = 0.6 * income_median

model = sae.UnitLevel(sample_df)

res = model.ebp(
    y="income",
    x=feature_cols,
    area="prov",
    pop_data=pop_df,
    indicators=[Indicator.POVERTY_GAP],
    threshold=poverty_line,
    transformation="boxcox",
    transform_shift=3600.0,
    boxcox_param=None,
    boxcox_bounds=(-2, 2),
    n_reps=200,   # bootstrap for MSE
    n_mc=200,     # Monte Carlo for indicator integration
    batch_size=10,
    rstate=123,
)

pred = res.to_polars("predictions")

# Filter results for a specific indicator
pred.filter(pl.col("indicator") == str(Indicator.POVERTY_GAP))

Where to go next

If you start from direct estimates + variances: go to Area-level models (FH).
If you start from microdata + population auxiliaries:
- use BHF/EBLUP for means/totals,
- use Molina–Rao/EBP for poverty and other nonlinear indicators.