svy: Design-Based Survey Analysis in Python

true

Survey Methods

Python

Data Science

svy is an open-source Python library for complex survey data design and analysis—bringing stratified, clustered, and weighted estimators to the modern data-science ecosystem.

Author

Mamadou S. Diallo, Ph.D.

Published

November 1, 2025

Modified

January 10, 2026

Keywords

survey analysis in Python, complex survey data, survey weights, design-based inference, Taylor linearization, R survey package, survey statistics

Work in Progress

svy is under active development. Some features described here are still being implemented or refined. Feedback and contributions are welcome as the library continues to evolve.

Bringing Design-Based Survey Analysis to Python with `svy`

For years, survey analysis in Python has lagged behind R’s mature ecosystem. R’s survey package set the gold standard for design-based inference, and its tidyverse-friendly wrapper, srvyr, made it easier for survey statisticians to integrate complex sampling methods into modern data workflows.

To fill that gap, I initially developed samplics, one of the first Python packages for survey statistics. While it introduced foundational features such as stratification, clustering, and weighting, it became clear that a more comprehensive, scalable, and modern framework was needed.

That next step is svy — developed under the Samplics umbrella — a fully reimagined open-source library that brings design-based survey statistics into the modern Python ecosystem. With svy, researchers can design, wrangle, analyze, and report survey data using statistically principled methods consistent with official-statistics standards.

svy is more than a statistical engine—it’s a comprehensive framework for survey research. It supports not only complex survey estimators, but also metadata management, data-pipeline integration, and reproducible reporting. By combining methodological rigor with modern computational tools, svy brings survey methodology fully into the Python era.

The Problem: Survey Analysis Is Stuck in Silos

Most survey researchers work with complex sample designs—stratification, clustering, multistage selection, and replicate weights—often in the context of national health, education, or labor-force surveys. For decades, these analyses have depended on specialized or legacy tools such as SAS, Stata, or R’s survey package.

Meanwhile, Python has become the dominant language for data engineering, machine learning, and reproducible analytics, yet it has long lacked a first-class framework for design-based survey inference. The result is a fragmented workflow: survey statisticians on one side, data scientists on the other—converting files, duplicating effort, and losing methodological transparency along the way.

svy changes that. It brings the full design-based framework for survey analysis into the same Python ecosystem that already powers modern data pipelines and machine-learning systems.

The Innovation: Introducing `svy`

svy is a modern, extensible Python library for survey statistics, built from the ground up with clarity, performance, and interoperability in mind.

At its core, svy provides:

Data I/O: read and write support for SAS, SPSS, and Stata files
Sample selection: equal and unequal probability sampling techniques
Weight adjustment: nonresponse adjustment, post-stratification, raking, and calibration (GREG)
Design representation: objects that capture stratification, clustering, and weighting
Estimation: design-based estimators for means, totals, ratios, proportions, and regression models
Variance estimation: Taylor linearization and replicate-weight methods (BRR, Jackknife, Bootstrap)
Performance engine: built on NumPy, Polars, msgspec, and rust, offering speed and strict typing

svy makes it possible to perform survey analysis in Python without giving up methodological precision.

Why Python?

Survey analysis shouldn’t live apart from data science. Python has become the lingua franca of modern analytics—powering everything from data engineering to machine learning—yet survey methodology has often remained isolated from that ecosystem.

By building on Python, svy bridges this gap and integrates naturally with the tools researchers already use. For instance, researchers can use svy to help in:

Machine learning and modeling — combine design weights with scikit-learn, PyTorch, or other ML frameworks to build predictive models that respect complex survey design.
Data engineering — manage and process large-scale survey microdata from Parquet, Arrow, or BigQuery using Polars for speed and scalability.
Visualization and reporting — produce reproducible Quarto reports, Plotly dashboards, or interactive Marimo notebooks for analysis and dissemination.

This unified approach allows survey statisticians, data scientists, and policymakers to work in the same environment—sharing code, data, and results—while preserving the methodological rigor of design-based inference.

Example: Reproducing the MEPS Workshop in Python

In a recent AHRQ MEPS workshop, researchers demonstrated design-based analysis using R. We recreated the same workflow in Python with svy—and obtained identical estimates.

👉 Explore the live notebook: MEPS 2025 Workshop (Python)

from svy import Sample, Design

meps = Sample(
    data=fyc23,
    design=Design(stratum="VARSTR", psu="VARPSU", wgt="PERWT"),
)

# Estimate the average 2023 total expenditure
result = meps.estimation.mean("TOTEXP23")
print(result)

For more examples, visit the svy documentation.

Open Source and Interoperable

svy is part of the growing svyLab ecosystem, an expanding collection of mostly open-source projects that bring together survey methodology, data engineering, and modern analytics. The table below shows the current state of the ecosystem:

Component	Description
svy	Core Python library for complex survey design and analysis
svy-io	Rust-powered I/O layer for reading SPSS, Stata, and SAS files using the ReadStat C library
svyLab	Cloud platform for running, visualizing, and sharing survey analyses
svy-sae (coming soon)	Small Area Estimation (SAE)
svy-agent (coming soon)	AI assistant that helps design, select, weight, analyze, and report surveys interactively

These components are interoperable, free, and designed to support real-world survey workflows—from national household surveys to program evaluations.

Installation and Getting Started

Install the pre-release version of svy directly from PyPI:

pip install svy[report]

Then explore the documentation: 👉 svylab.com/docs/svy

Within minutes, you can:

Define a complex survey design
Estimate weighted means, totals, or regressions
and much more

All within Python — no additional license required.

The Vision: Lowering the Barriers to Survey Research

The goal of svy goes beyond code.

We aim to democratize access to survey methodology by reducing both the cost and complexity barriers that have long limited researchers from applying design-based principles correctly and efficiently.

A key part of that vision is enabling seamless integration between survey methods and modern data ecosystems. svy makes it easier to combine traditional survey frameworks with big data sources and other auxiliary information to strengthen estimation and inference.

Design-based inference remains the foundation of credible survey statistics, underpinning many aspects of official statistics and evidence-based policymaking. svy brings that tradition forward, making it accessible, transparent, and ready for the data landscape of the future.

FAQ

What is svy in Python? svy is an open-source Python library for survey design and analysis. It supports data wrangling, sample selection, sample weighting, and design-based estimation—covering weights, stratification, clustering, and replicate-weight variance estimation. It also includes modules for categorical data analysis, general linear models, and more.

Is svy similar to R’s survey package? Yes. It implements comparable statistical estimators but is built with modern Python tools for speed, extensibility, and interoperability. Compared to R’s survey package, svy is more comprehensive in scope: it includes functionalities such as sample selection that are not part of survey itself but typically distributed across other packages in the R ecosystem.

Can I use svy for official household surveys? Absolutely. It’s designed for household, health, labor-force, demographic, and business surveys—any dataset with a complex sample design.

Learn More

Documentation: svylab.com/docs/svy
MEPS Workshop: MEPS 2025 Tutorial

Bringing Design-Based Survey Analysis to Python with svy

The Problem: Survey Analysis Is Stuck in Silos

The Innovation: Introducing svy

Why Python?

Example: Reproducing the MEPS Workshop in Python

Open Source and Interoperable

Installation and Getting Started

The Vision: Lowering the Barriers to Survey Research

FAQ

Learn More

Bringing Design-Based Survey Analysis to Python with `svy`

The Innovation: Introducing `svy`