svy: Python Package for Complex Survey Design and Analysis

Design-based inference for stratified, cluster, and multi-stage surveys

Documentation

Survey Analysis

Python

Statistics

Professional Python toolkit for designing complex surveys, calculating sample sizes, weighting survey data, and performing statistical analysis with stratified, cluster, and multi-stage sampling designs.

Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

January 18, 2026

Keywords

complex survey analysis Python, survey design Python package, stratified sampling Python, cluster sampling analysis, PPS sampling Python, survey weighting calibration, variance estimation surveys, design-based inference, Taylor linearization Python, survey bootstrap replication, multi-stage sampling design, probability sampling Python, survey statistics Python, NHANES BRFSS DHS analysis, official statistics Python

📌 Validation

Validation of Design-Based Survey Estimators in Python: A Comparison of svy and R survey

Wondering if you can trust svy for production work? This validation note demonstrates numerically identical results to R’s survey package across Taylor linearization, replication methods, and complex designs.

Read the full validation →

Development Status

svy is under active development with rapid improvements.

Core functionality for survey design, weighting, and variance estimation is stable and production-ready. APIs and documentation continue to mature based on user feedback.

📧 Feedback welcome: info@svylab.com
🐛 Report issues: GitHub Issues

What is svy?

svy is a comprehensive Python package for complex survey design and analysis. When surveys use sophisticated sampling methods—stratification, clustering, or unequal probability selection—standard statistical software produces incorrect standard errors and confidence intervals. svy implements design-based inference, ensuring accurate population estimates with proper variance estimation.

Why Complex Survey Analysis Matters

Large-scale surveys (health surveys, labor force studies, demographic censuses) use complex sampling for practical and statistical reasons:

Stratified sampling increases precision by grouping similar units
Cluster sampling reduces field costs through geographic grouping
Multi-stage designs enable nationwide coverage efficiently
Probability proportional to size (PPS) improves efficiency for heterogeneous populations

Standard analysis methods assume simple random sampling and produce:

❌ Incorrect standard errors (usually underestimated)
❌ Invalid confidence intervals
❌ Wrong hypothesis test results
❌ Biased population inferences

svy corrects these problems by accounting for the actual survey design in all calculations.

Core Capabilities

1. Survey Design & Planning

Design surveys with statistical rigor:

Sample size calculation - Determine required sample sizes for target precision levels
Power analysis - Calculate detection probability for effects of interest
Optimal allocation - Distribute samples across strata to minimize variance or cost
Cost-variance tradeoffs - Balance statistical precision against field expenses

2. Sample Selection

Draw probability samples using proven methods:

Simple Random Sampling (SRS) - Equal probability, with or without replacement
Systematic Sampling (SYS) - Ordered selection with random start point
Probability Proportional to Size (PPS) - Selection probability tied to auxiliary variable
Stratified sampling - Independent selection within predefined groups
Multi-stage sampling - Hierarchical selection (PSUs → SSUs → respondents)

3. Survey Weighting

Create and calibrate survey weights:

Design weights - Base weights from known selection probabilities
Nonresponse adjustment - Compensate for unit and item nonresponse
Post-stratification - Align sample margins to population control totals
Raking (iterative proportional fitting) - Calibrate to multiple population margins simultaneously
GREG calibration - Generalized regression estimation for efficient estimation

4. Statistical Estimation

Produce design-consistent population estimates:

Descriptive statistics - Means, totals, proportions, quantiles with design-based variance
Domain (subpopulation) estimation - Analyze subgroups correctly
Ratio estimation - Ratios and their standard errors
Regression analysis - Linear, logistic, Poisson, and other GLMs with survey weights
Categorical data analysis - Design-adjusted chi-square tests and cross-tabulations

5. Variance Estimation

Calculate standard errors that reflect the survey design:

Taylor linearization - Analytic variance for smooth statistics
Bootstrap - General resampling-based variance estimation
Balanced Repeated Replication (BRR) - Efficient replication for paired PSU designs
Jackknife - Delete-one-group-at-a-time replication
Domain variance - Correct standard errors for subpopulation analyses

Quick Start Example

import svy

# Load survey data with design variables
smp_data = svy.io.read_csv("survey_data.csv")

# Specify survey design
smp_design = svy.Design(
    stratum=("region_id", "urban_rural"),  # Stratification variables
    psu="psu_id",                          # Primary sampling units
    wgt="weight",                          # Survey weights
)

# Create survey sample object
sample = svy.Sample(
    data=smp_data,
    design=smp_design
)

# Population mean with design-based standard error
mean_income = sample.estimation.mean("income")
print(mean_income)
# Output includes: estimate, SE, confidence interval, design effect

# Regression accounting for complex design
linear_model = sample.glm.fit(
    y="income",
    x=["age", svy.Cat("education")],
    family="gaussian"
)
print(linear_model)
# Output: coefficients, design-based SEs, t-tests

Who Uses svy?

svy serves diverse survey research communities:

📊 Survey methodologists developing sampling strategies
📈 Biostatisticians analyzing health survey data (NHANES, BRFSS, DHS)
🎓 Social scientists studying populations through sample surveys
🏛️ Government statisticians producing official statistics
🔬 Epidemiologists estimating disease prevalence and risk factors
🏢 Market researchers analyzing customer surveys and panels
👨‍🏫 Educators teaching survey sampling and analysis methods

Documentation Structure

📖 Getting Started

Five-minute quickstart: installation, first analysis, key concepts.

🎓 Tutorials

Hands-on walkthroughs with real data:

Designing surveys and selecting samples
Computing and adjusting survey weights
Producing population estimates with proper variance
Fitting regression models to survey data
Analyzing cross-tabulations and categorical data

📚 User Guides

Coming soon - In-depth conceptual explanations and best practices

🔧 API Reference

Coming soon - Complete technical documentation of all classes and methods

Community & Support

Get help and connect with other users:

💬 Questions & Discussion: GitHub Discussions
🐛 Bug Reports & Features: GitHub Issues
📧 Direct Contact: info@svylab.com
💼 Professional Network: LinkedIn (svylab?)
🐙 Source Code: GitHub samplics-org/svy

Community signal

Help make svy the standard for survey analysis in Python

If rigorous, design-based survey inference in Python matters to you, starring the repository helps signal demand and prioritize validation and stability work.

→ Star svy on GitHub

Academic Citation

If you use svy in published research, please cite:

@software{svy2025,
  title = {svy: Python Package for Complex Survey Analysis},
  author = {Diallo, Mamadou S.},
  year = {2025},
  url = {https://github.com/samplics-org/svy},
  doi = {10.5281/zenodo.XXXXXXX},
  version = {0.2.0}
}

License

svy is open source software released under the MIT License. See LICENSE for full terms.

Ready to analyze complex survey data correctly?
Get Started →