Getting Started with svy

From installation to population estimates in under 10 minutes

Getting Started

Installation

Tutorial

Python

Quick start guide for svy Python package. Install, load survey data, define complex sampling designs, and produce your first population estimates with proper variance estimation in minutes.

Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

February 9, 2026

Keywords

svy getting started, svy installation Python, survey analysis quickstart, complex survey Python tutorial, svy Sample object, svy Design object, survey weighting Python, NHANES analysis Python, survey estimation tutorial, design-based inference Python, Polars survey data, survey regression Python

Get up and running with svy in under 10 minutes. This guide walks you through installation, your first analysis, and key concepts.

Installation

Install svy from PyPI using pip:

pip install svy

For enhanced output formatting and tables:

pip install "svy[report]"

For the fastest installation experience, use uv:

uv add "svy[report]"

System Requirements: - Python 3.11, 3.12, or 3.13 - Works on Linux, macOS, and Windows

Your First Analysis

Let’s analyze a simple survey with stratified sampling.

Step 1: Load Data

import svy

# Load survey data
data = svy.io.read_csv("survey_data.csv")

# Preview
print(data.head())

Expected columns: - income - Survey outcome variable - age, education - Covariates - stratum - Stratification variable - psu - Primary sampling unit (cluster) - weight - Survey weight

Step 2: Define Survey Design

Tell svy about your sampling design:

# Specify design variables
design = svy.Design(
    stratum="stratum",    # Stratification
    psu="psu",           # Clustering
    wgt="weight"         # Survey weights
)

# Create survey sample object
sample = svy.Sample(data=data, design=design)

print(sample)

Step 3: Calculate Population Estimates

Produce design-consistent estimates:

# Population mean with proper standard error
mean_income = sample.estimation.mean("income")
print(mean_income)

# Output includes:
# - Estimate
# - Standard error
# - 95% confidence interval
# - Design effect (DEFF)

Step 4: Analyze Subpopulations

Estimate for domains (subgroups):

# Mean income by education level
by_education = sample.estimation.mean("income", by="education")
print(by_education)

Step 5: Fit Regression Models

Survey-weighted regression:

# Linear regression accounting for design
model = sample.glm.fit(
    y="income",
    x=["age", svy.Cat("education")],
    family="gaussian"
)

print(model)

That’s it! You’ve completed your first survey analysis with proper design-based inference.

Core Concepts

Survey Design

The Design object specifies how your sample was selected:

# Simple random sample
design_srs = svy.Design(wgt="weight")

# Stratified sample
design_strat = svy.Design(
    stratum="region",
    wgt="weight"
)

# Clustered sample
design_cluster = svy.Design(
    psu="school_id",
    wgt="weight"
)

# Complex multi-stage design
design_complex = svy.Design(
    stratum=("region", "urban_rural"),  # Multiple strata
    psu="psu_id",                       # Primary units
    wgt="final_weight"                  # Combined weight
)

Sample Object

The Sample combines your data with design information:

sample = svy.Sample(data=data, design=design)

# Access components
sample.data        # Survey data (Polars DataFrame)
sample.design      # Design specification
sample.estimation  # Estimation methods
sample.glm         # Regression models
sample.wrangling   # Data transformation utilities

Data Wrangling

Clean and transform survey data efficiently:

from svy import CaseStyle, LetterCase
from svy.core.expr import col

# Standardize column names
sample = sample.wrangling.clean_names(
    case_style=CaseStyle.SNAKE,
    letter_case=LetterCase.LOWER
)

# Recode categorical variables
sample = sample.wrangling.recode(
    "education",
    {"High School": ["HS", "high_school"],
     "College": ["BA", "BS", "college"]}
)

# Bin continuous variables into categories
sample = sample.wrangling.categorize(
    "age",
    bins=[0, 18, 35, 65, 100],
    labels=["0-17", "18-34", "35-64", "65+"]
)

# Cap extreme values (top/bottom coding)
sample = sample.wrangling.bottom_and_top_code(
    {"income": (0, 200000)}  # Cap at 0 and 200k
)

# Create new variables
sample = sample.wrangling.mutate({
    "income_thousands": col("income") / 1000,
    "age_squared": col("age") ** 2
})

Common tasks:

clean_names() - Standardize column names (snake_case, camelCase, etc.)
recode() - Map values to new categories
categorize() - Bin continuous variables into labeled ranges
top_code() / bottom_code() - Cap extreme values
mutate() - Create or transform variables with expressions

See: Wrangling Tutorial for comprehensive examples.

Estimation Methods

All estimation methods produce design-consistent results:

# Means
sample.estimation.mean("income")

# Totals
sample.estimation.total("population")

# Proportions
sample.estimation.prop("employed")

# Median
sample.estimation.median("income")

# Ratios
sample.estimation.ratio(y="expenditure", x="hh_size") # x is the denominator

Working with Real Survey Data

Loading Common Formats

svy integrates with svy-io for reading survey data files:

# SPSS files
data = svy.io.read_spss("survey.sav")

# Stata files
data = svy.io.read_stata("survey.dta")

# SAS files
data = svy.io.read_sas("survey.sas7bdat")

# CSV with metadata
data = svy.io.read_csv("survey.csv", metadata="survey_metadata.json")

Handling Missing Data

svy respects survey-specific missing values:

sample.estimation.mean("income", drop_nulls=True)

Complete Example: NHANES Analysis

Analyze real public health survey data:

import svy

# Load NHANES data
nhanes = svy.io.read_csv("nhanes_demo.csv")

# Define complex design
design = svy.Design(
    stratum="sdmvstra",      # Stratification variable
    psu="sdmvpsu",          # Primary sampling unit
    wgt="wtmec2yr"          # 2-year weights
)

# Create sample
sample = svy.Sample(data=nhanes, design=design)

# Population mean BMI
mean_bmi = sample.estimation.mean("bmxbmi")
print(mean_bmi)

# BMI by gender
by_gender = sample.estimation.mean("bmxbmi", by="riagendr")
print(by_gender)

# Regression: BMI ~ age + gender
model = sample.glm.fit(
    y="bmxbmi",
    x=["ridageyr", svy.Cat("riagendr")],
    family="gaussian"
)
print(model)

Key Resources

📘 Documentation: svylab.com/docs/svy
💬 Community: GitHub Discussions
🐛 Issues: GitHub Issues
📧 Support: info@svylab.com

Dependencies

svy builds on Python’s scientific computing ecosystem:

Core Dependencies (installed automatically):

NumPy - Numerical computing
Polars - Fast DataFrames
PyArrow - Columnar data
SciPy - Scientific algorithms
msgspec - Fast serialization

Optional Dependencies:

rich - Enhanced console output
great-tables - Publication tables
svy-sae - Small area estimation

Install with optional dependencies:

pip install "svy[report]"  # Adds rich + great-tables
pip install "svy[sae]"     # Adds svy-sae
pip install "svy[all]"     # Everything

Contributing

svy is open source and welcomes contributions!

Ways to contribute:

🐛 Report bugs or request features
📖 Improve documentation
🧪 Add examples or tutorials
💻 Submit code improvements

See our GitHub repository to get started.

Support the Project

Help svy grow:

⭐ Star the GitHub repo
📣 Share with colleagues and on social media
📝 Cite in your publications (see Citation)
💼 Connect on LinkedIn

License

svy is released under the MIT License - free for academic and commercial use.

Ready to Explore?

Continue with hands-on tutorials covering real survey analysis workflows.

View Tutorials →