import polars as pl
import svy
from svy import Sample, CaseStyle, LetterCaseData Wrangling
Clean names, recode, bin, clamp, and create new variables
svy data wrangling, clean survey data, recode variables Python, bin continuous variables, survey data preparation
svy provides functionality for wrangling the dataset contained within a Sample. This functionality is a light wrapper around Polars—designed not to replace Polars, but to make common survey data cleaning and organization tasks faster and more consistent. It focuses on practical, frequently used transformations rather than offering the full breadth of Polars’ capabilities.
This tutorial shows the most common wrangling helpers in svy. Examples are small and focused. Advanced options live in the reference documentation.
Clean & Rename
clean_names()
Standardize column names for easier downstream work.
df = pl.DataFrame({"First Name": ["Ana"], "Income ($)": [5000]})
s = Sample(df).wrangling.clean_names(
case_style=CaseStyle.SNAKE,
letter_case=LetterCase.LOWER
)
s.data.columns['svy_row_index', 'first_name', 'income']
You can also remove characters with a regex or pick other styles (camel/pascal).
df2 = pl.DataFrame({"A#B": [1], "C&D": [2]})
s2 = Sample(df2).wrangling.clean_names(
remove=r"[^a-zA-Z0-9]",
letter_case=LetterCase.UPPER
)
s2.data.columns['svy_row_index', 'AB', 'CD']
rename_columns()
Rename variables directly (raises if a source name is missing).
df = pl.DataFrame({"old": [1, 2, 3]})
s = Sample(df).wrangling.rename_columns(renames={"old": "new"})
s.data.columns['svy_row_index', 'new']
Recode Categories
Map old values to new labels; everything else passes through unchanged. Works on one or multiple columns; you can replace=True or write to a new column via into=.
df = pl.DataFrame({"item": ["apple", "soap", "carrot", "tv"]})
s = Sample(df)
# Map to "Food"; non-targets remain
s1 = s.wrangling.recode("item", {"Food": ["apple", "carrot"]})
print(s1.show_data(columns=["item", "svy_item_recoded"]))
# Custom output name
s2 = s.wrangling.recode(
cols="item",
recodes={"Food": ["apple", "carrot"]},
replace=False,
into="item_grp"
)
print(s2.show_data(columns=["item", "item_grp"]))shape: (4, 2)
┌────────┬──────────────────┐
│ item ┆ svy_item_recoded │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪══════════════════╡
│ apple ┆ Food │
│ soap ┆ soap │
│ carrot ┆ Food │
│ tv ┆ tv │
└────────┴──────────────────┘
shape: (4, 2)
┌────────┬──────────┐
│ item ┆ item_grp │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪══════════╡
│ apple ┆ Food │
│ soap ┆ soap │
│ carrot ┆ Food │
│ tv ┆ tv │
└────────┴──────────┘
Multiple columns:
df = pl.DataFrame({"a": [0, 1, 2], "b": [1, 1, 9]})
s = Sample(df).wrangling.recode(cols=["a", "b"], recodes={"ONE": [1]})
print(s.show_data(columns=["svy_a_recoded", "svy_b_recoded"]))shape: (3, 2)
┌───────────────┬───────────────┐
│ svy_a_recoded ┆ svy_b_recoded │
│ --- ┆ --- │
│ object ┆ object │
╞═══════════════╪═══════════════╡
│ 0 ┆ ONE │
│ ONE ┆ ONE │
│ 2 ┆ 9 │
└───────────────┴───────────────┘
Categorize Continuous Variables
Turn continuous values into labeled bins. Out-of-range values become None.
df = pl.DataFrame({"x": [1, 10, 20, 30, 40]})
# Right-closed: (0,10], (10,20], (20,30]
s = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30])
s.show_data(columns=["x", "svy_x_categorized"])
# Left-closed: [0,10), [10,20), [20,30)
s2 = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30], right=False)
print(s2.show_data(columns=["x", "svy_x_categorized"]))shape: (5, 2)
┌─────┬───────────────────┐
│ x ┆ svy_x_categorized │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════════════════╡
│ 1 ┆ [0, 10) │
│ 10 ┆ [10, 20) │
│ 20 ┆ [20, 30) │
│ 30 ┆ null │
│ 40 ┆ null │
└─────┴───────────────────┘
Provide your own labels and/or write into an existing name:
df = pl.DataFrame({"v": [1, 2, 6, 11]})
s = Sample(df).wrangling.categorize(
col="v",
bins=[0, 5, 10, 20],
labels=["(0,5]", "(5,10]", "(10,20]"],
into="v_band"
)
print(s.show_data(columns=["v", "v_band"]))shape: (4, 2)
┌─────┬─────────┐
│ v ┆ v_band │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════════╡
│ 1 ┆ (0,5] │
│ 2 ┆ (0,5] │
│ 6 ┆ (5,10] │
│ 11 ┆ (10,20] │
└─────┴─────────┘
Cap Extremes (Top/Bottom Coding)
Clamp high or low values to chosen bounds. Auto-names outputs unless you set replace=True or into=.
df = pl.DataFrame({"income": [100, 300, 600, 1000]})
s = Sample(df)
# Clamp to [250, 800] in a new column
c1 = s.wrangling.bottom_and_top_code(bottom_and_top_codes={"income": (250, 800)}, replace=False)
c1.show_data(columns=["income", "svy_income_bottom_and_top_coded"])
# Replace in place
c2 = s.wrangling.bottom_and_top_code(bottom_and_top_codes={"income": (250, 800)}, replace=True)
print(c2.show_data(columns=["income"]))shape: (4, 1)
┌────────┐
│ income │
│ --- │
│ i64 │
╞════════╡
│ 250 │
│ 300 │
│ 600 │
│ 800 │
└────────┘
You can also do only top or only bottom coding:
print(Sample(pl.DataFrame({"x": [1, 5, 10]})).wrangling.top_code(top_codes={"x": 5}).show_data())
print(Sample(pl.DataFrame({"y": [1, 3, 7]})).wrangling.bottom_code(bottom_codes={"y": 3}).show_data())shape: (3, 3)
┌───────────────┬─────┬─────────────────┐
│ svy_row_index ┆ x ┆ svy_x_top_coded │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═════════════════╡
│ 0 ┆ 1 ┆ 1 │
│ 1 ┆ 5 ┆ 5 │
│ 2 ┆ 10 ┆ 5 │
└───────────────┴─────┴─────────────────┘
shape: (3, 3)
┌───────────────┬─────┬────────────────────┐
│ svy_row_index ┆ y ┆ svy_y_bottom_coded │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪════════════════════╡
│ 0 ┆ 1 ┆ 3 │
│ 1 ┆ 3 ┆ 3 │
│ 2 ┆ 7 ┆ 7 │
└───────────────┴─────┴────────────────────┘
Mutate (Create or Transform Columns)
Create new variables from expressions, arrays, Series, or callables. Dependencies created in the same call are supported.
from svy.core.expr import col, when
import numpy as np
df = pl.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
s = Sample(df)
out = s.wrangling.mutate(specs={
"c": col("a") * 2, # svy expr
"d": np.array([1, 1, 1]), # numpy
"e": col("c") + col("d"), # depends on c
"flag": when(col("e") > 3).then(1).otherwise(0), # conditional
})
print(out.show_data(columns=["a", "b", "c", "d", "e", "flag"]))shape: (3, 6)
┌─────┬─────┬─────┬─────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ flag │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i32 │
╞═════╪═════╪═════╪═════╪═════╪══════╡
│ 1 ┆ 10 ┆ 2 ┆ 1 ┆ 3 ┆ 0 │
│ 2 ┆ 20 ┆ 4 ┆ 1 ┆ 5 ┆ 1 │
│ 3 ┆ 30 ┆ 6 ┆ 1 ┆ 7 ┆ 1 │
└─────┴─────┴─────┴─────┴─────┴──────┘
Common gotchas: - Referencing a missing column raises a clear error - Length mismatches for arrays/Series raise a compile error - Circular dependencies (e.g., x depends on y and y on x) are blocked
Quick Reference
| Task | Method(s) | Notes |
|---|---|---|
| Standardize names | clean_names() |
styles (snake/camel/pascal), letter case, regex removal |
| Rename variables | rename_columns() |
one or many; raises on missing source |
| Recode values | recode() |
multi-column; pass-through non-targets; replace/into |
| Bin numeric | categorize() |
bins, labels, right=; out-of-range → None |
| Cap extremes | top_code(), bottom_code(), bottom_and_top_code() |
replace/into; order checks |
| Create/transform | mutate() |
svy/Polars exprs, arrays/Series, callables; same-call deps |
Next Steps
Now that your data is clean and organized, learn how to plan your survey.
Master the basics?
Continue to Survey Planning →