Wrangling — basics

Clean names, recode, bin, clamp, and create new variables

svy provides functionality for wrangling the dataset contained within a Sample. This functionality is a light wrapper around Polars—designed not to replace Polars, but to make common survey data cleaning and organization tasks faster and more consistent. It focuses on practical, frequently used transformations rather than offering the full breadth of Polars’ capabilities.

This tutorial shows the most common wrangling helpers in svy. Examples are small and focused. Advanced options live in the reference page.

import polars as pl
from svy import Sample, CaseStyle, LetterCase
import svy

Clean & rename

clean_names()

Standardize column names for easier downstream work.

df = pl.DataFrame({"First Name": ["Ana"], "Income ($)": [5000]})
s = Sample(df).wrangling.clean_names(case_style=CaseStyle.SNAKE, letter_case=LetterCase.LOWER)
s.data.columns  # -> ["svy_row_index","first_name","income"]
['svy_row_index', 'first_name', 'income']

You can also remove characters with a regex or pick other styles (camel/pascal).

df2 = pl.DataFrame({"A#B": [1], "C&D": [2]})
s2 = Sample(df2).wrangling.clean_names(remove=r"[^a-zA-Z0-9]", letter_case=LetterCase.UPPER)
s2.data.columns  # -> ["svy_row_index","AB","CD"]
['svy_row_index', 'AB', 'CD']

rename_columns()

Rename variables directly (raises if a source name is missing).

df = pl.DataFrame({"old": [1,2,3]})
s = Sample(df).wrangling.rename_columns({"old": "new"})
s.data.columns  # -> ["svy_row_index","new"]
['svy_row_index', 'new']

Recode categories

Map old values to new labels; everything else passes through unchanged. Works on one or multiple columns; you can replace=True or write to a new column via into=.

df = pl.DataFrame({"item": ["apple","soap","carrot","tv"]})
s = Sample(df)

# Map to "Food"; non-targets remain
s1 = s.wrangling.recode("item", {"Food": ["apple","carrot"]})
s1.show_data(columns=["item","svy_item_recoded"])

# Custom output name
s2 = s.wrangling.recode("item", {"Food": ["apple","carrot"]}, replace=False, into="item_grp")
s2.show_data(columns=["item","item_grp"])
shape: (4, 2)
item item_grp
str str
"apple" "Food"
"soap" "soap"
"carrot" "Food"
"tv" "tv"

Multiple columns:

df = pl.DataFrame({"a": [0,1,2], "b": [1,1,9]})
s  = Sample(df).wrangling.recode(["a","b"], {"ONE":[1]})
s.show_data(columns=["svy_a_recoded","svy_b_recoded"])
shape: (3, 2)
svy_a_recoded svy_b_recoded
object object
0 ONE
ONE ONE
2 9

Categorize continuous variables

Turn continuous values into labeled bins. Out-of-range values become None.

df = pl.DataFrame({"x": [1,10,20,30,40]})

# Right-closed: (0,10], (10,20], (20,30]
s = Sample(df).wrangling.categorize("x", bins=[0,10,20,30])
s.show_data(columns=["x","svy_x_categorized"])

# Left-closed: [0,10), [10,20), [20,30)
s2 = Sample(df).wrangling.categorize("x", bins=[0,10,20,30], right=False)
s2.show_data(columns=["x","svy_x_categorized"])
shape: (5, 2)
x svy_x_categorized
i64 str
1 "[0, 10)"
10 "[10, 20)"
20 "[20, 30)"
30 null
40 null

Provide your own labels and/or write into an existing name:

df = pl.DataFrame({"v": [1,2,6,11]})
s  = Sample(df).wrangling.categorize(
    "v", bins=[0,5,10,20], labels=["(0,5]","(5,10]","(10,20]"], into="v_band"
)
s.show_data(columns=["v","v_band"])
shape: (4, 2)
v v_band
i64 str
1 "(0,5]"
2 "(0,5]"
6 "(5,10]"
11 "(10,20]"

Cap extremes (top/bottom coding)

Clamp high or low values to chosen bounds. Auto-names outputs unless you set replace=True or into=.

df = pl.DataFrame({"income": [100, 300, 600, 1000]})
s  = Sample(df)

# Clamp to [250, 800] in a new column
c1 = s.wrangling.bottom_and_top_code({"income": (250, 800)}, replace=False)
c1.show_data(columns=["income","svy_income_bottom_and_top_coded"])

# Replace in place
c2 = s.wrangling.bottom_and_top_code({"income": (250, 800)}, replace=True)
c2.show_data(columns=["income"])
shape: (4, 1)
income
i64
250
300
600
800

You can also do only top or only bottom coding:

Sample(pl.DataFrame({"x":[1,5,10]})).wrangling.top_code({"x":5}).show_data()
Sample(pl.DataFrame({"y":[1,3,7]})).wrangling.bottom_code({"y":3}).show_data()
shape: (3, 3)
svy_row_index y svy_y_bottom_coded
u32 i64 i64
0 1 3
1 3 3
2 7 7

Mutate (create or transform columns)

Create new variables from expressions, arrays, Series, or callables. Dependencies created in the same call are supported.

from svy.core.expr import col, when
import numpy as np

df = pl.DataFrame({"a":[1,2,3], "b":[10,20,30]})
s  = Sample(df)

out = s.wrangling.mutate({
    "c": col("a") * 2,           # svy expr
    "d": np.array([1,1,1]),      # numpy
    "e": col("c") + col("d"),    # depends on c
    "flag": when(col("e") > 3).then(1).otherwise(0),  # conditional
})
out.show_data(columns=["a","b","c","d","e","flag"])
shape: (3, 6)
a b c d e flag
i64 i64 i64 i64 i64 i32
1 10 2 1 3 0
2 20 4 1 5 1
3 30 6 1 7 1

Common gotchas - Referencing a missing column raises a clear error. - Length mismatches for arrays/Series raise a compile error. - Circular dependencies (e.g., x depends on y and y on x) are blocked.

Cheatsheet

Task Method(s) Notes
Standardize names clean_names() styles (snake/camel/pascal), letter case, regex removal
Rename variables rename_columns() one or many; raises on missing source
Recode values recode() multi-column; pass-through non-targets; replace/into
Bin numeric categorize() bins, labels, right=; out-of-range → None
Cap extremes top_code(), bottom_code(), bottom_and_top_code() replace/into; order checks
Create/transform mutate() svy/Polars exprs, arrays/Series, callables; same-call deps

Next Steps

Next, we discuss planning surveys: Planning.