import polars as pl
from svy import Sample, CaseStyle, LetterCase
import svyWrangling — basics
Clean names, recode, bin, clamp, and create new variables
svy provides functionality for wrangling the dataset contained within a Sample. This functionality is a light wrapper around Polars—designed not to replace Polars, but to make common survey data cleaning and organization tasks faster and more consistent. It focuses on practical, frequently used transformations rather than offering the full breadth of Polars’ capabilities.
This tutorial shows the most common wrangling helpers in svy. Examples are small and focused. Advanced options live in the reference page.
Clean & rename
clean_names()
Standardize column names for easier downstream work.
df = pl.DataFrame({"First Name": ["Ana"], "Income ($)": [5000]})
s = Sample(df).wrangling.clean_names(case_style=CaseStyle.SNAKE, letter_case=LetterCase.LOWER)
s.data.columns # -> ["svy_row_index","first_name","income"]['svy_row_index', 'first_name', 'income']
You can also remove characters with a regex or pick other styles (camel/pascal).
df2 = pl.DataFrame({"A#B": [1], "C&D": [2]})
s2 = Sample(df2).wrangling.clean_names(remove=r"[^a-zA-Z0-9]", letter_case=LetterCase.UPPER)
s2.data.columns # -> ["svy_row_index","AB","CD"]['svy_row_index', 'AB', 'CD']
rename_columns()
Rename variables directly (raises if a source name is missing).
df = pl.DataFrame({"old": [1,2,3]})
s = Sample(df).wrangling.rename_columns({"old": "new"})
s.data.columns # -> ["svy_row_index","new"]['svy_row_index', 'new']
Recode categories
Map old values to new labels; everything else passes through unchanged. Works on one or multiple columns; you can replace=True or write to a new column via into=.
df = pl.DataFrame({"item": ["apple","soap","carrot","tv"]})
s = Sample(df)
# Map to "Food"; non-targets remain
s1 = s.wrangling.recode("item", {"Food": ["apple","carrot"]})
s1.show_data(columns=["item","svy_item_recoded"])
# Custom output name
s2 = s.wrangling.recode("item", {"Food": ["apple","carrot"]}, replace=False, into="item_grp")
s2.show_data(columns=["item","item_grp"])| item | item_grp |
|---|---|
| str | str |
| "apple" | "Food" |
| "soap" | "soap" |
| "carrot" | "Food" |
| "tv" | "tv" |
Multiple columns:
df = pl.DataFrame({"a": [0,1,2], "b": [1,1,9]})
s = Sample(df).wrangling.recode(["a","b"], {"ONE":[1]})
s.show_data(columns=["svy_a_recoded","svy_b_recoded"])| svy_a_recoded | svy_b_recoded |
|---|---|
| object | object |
| 0 | ONE |
| ONE | ONE |
| 2 | 9 |
Categorize continuous variables
Turn continuous values into labeled bins. Out-of-range values become None.
df = pl.DataFrame({"x": [1,10,20,30,40]})
# Right-closed: (0,10], (10,20], (20,30]
s = Sample(df).wrangling.categorize("x", bins=[0,10,20,30])
s.show_data(columns=["x","svy_x_categorized"])
# Left-closed: [0,10), [10,20), [20,30)
s2 = Sample(df).wrangling.categorize("x", bins=[0,10,20,30], right=False)
s2.show_data(columns=["x","svy_x_categorized"])| x | svy_x_categorized |
|---|---|
| i64 | str |
| 1 | "[0, 10)" |
| 10 | "[10, 20)" |
| 20 | "[20, 30)" |
| 30 | null |
| 40 | null |
Provide your own labels and/or write into an existing name:
df = pl.DataFrame({"v": [1,2,6,11]})
s = Sample(df).wrangling.categorize(
"v", bins=[0,5,10,20], labels=["(0,5]","(5,10]","(10,20]"], into="v_band"
)
s.show_data(columns=["v","v_band"])| v | v_band |
|---|---|
| i64 | str |
| 1 | "(0,5]" |
| 2 | "(0,5]" |
| 6 | "(5,10]" |
| 11 | "(10,20]" |
Cap extremes (top/bottom coding)
Clamp high or low values to chosen bounds. Auto-names outputs unless you set replace=True or into=.
df = pl.DataFrame({"income": [100, 300, 600, 1000]})
s = Sample(df)
# Clamp to [250, 800] in a new column
c1 = s.wrangling.bottom_and_top_code({"income": (250, 800)}, replace=False)
c1.show_data(columns=["income","svy_income_bottom_and_top_coded"])
# Replace in place
c2 = s.wrangling.bottom_and_top_code({"income": (250, 800)}, replace=True)
c2.show_data(columns=["income"])| income |
|---|
| i64 |
| 250 |
| 300 |
| 600 |
| 800 |
You can also do only top or only bottom coding:
Sample(pl.DataFrame({"x":[1,5,10]})).wrangling.top_code({"x":5}).show_data()
Sample(pl.DataFrame({"y":[1,3,7]})).wrangling.bottom_code({"y":3}).show_data()| svy_row_index | y | svy_y_bottom_coded |
|---|---|---|
| u32 | i64 | i64 |
| 0 | 1 | 3 |
| 1 | 3 | 3 |
| 2 | 7 | 7 |
Mutate (create or transform columns)
Create new variables from expressions, arrays, Series, or callables. Dependencies created in the same call are supported.
from svy.core.expr import col, when
import numpy as np
df = pl.DataFrame({"a":[1,2,3], "b":[10,20,30]})
s = Sample(df)
out = s.wrangling.mutate({
"c": col("a") * 2, # svy expr
"d": np.array([1,1,1]), # numpy
"e": col("c") + col("d"), # depends on c
"flag": when(col("e") > 3).then(1).otherwise(0), # conditional
})
out.show_data(columns=["a","b","c","d","e","flag"])| a | b | c | d | e | flag |
|---|---|---|---|---|---|
| i64 | i64 | i64 | i64 | i64 | i32 |
| 1 | 10 | 2 | 1 | 3 | 0 |
| 2 | 20 | 4 | 1 | 5 | 1 |
| 3 | 30 | 6 | 1 | 7 | 1 |
Common gotchas - Referencing a missing column raises a clear error. - Length mismatches for arrays/Series raise a compile error. - Circular dependencies (e.g., x depends on y and y on x) are blocked.
Cheatsheet
| Task | Method(s) | Notes |
|---|---|---|
| Standardize names | clean_names() |
styles (snake/camel/pascal), letter case, regex removal |
| Rename variables | rename_columns() |
one or many; raises on missing source |
| Recode values | recode() |
multi-column; pass-through non-targets; replace/into |
| Bin numeric | categorize() |
bins, labels, right=; out-of-range → None |
| Cap extremes | top_code(), bottom_code(), bottom_and_top_code() |
replace/into; order checks |
| Create/transform | mutate() |
svy/Polars exprs, arrays/Series, callables; same-call deps |
Next Steps
Next, we discuss planning surveys: Planning.