Data Wrangling

Clean names, recode, bin, clamp, and create new variables

Clean, transform, and prepare survey data with svy. Learn to standardize names, recode categories, bin variables, and create new columns.
Keywords

svy data wrangling, clean survey data, recode variables Python, bin continuous variables, survey data preparation

svy provides functionality for wrangling the dataset contained within a Sample. This functionality is a light wrapper around Polars—designed not to replace Polars, but to make common survey data cleaning and organization tasks faster and more consistent. It focuses on practical, frequently used transformations rather than offering the full breadth of Polars’ capabilities.

This tutorial shows the most common wrangling helpers in svy. Examples are small and focused. Advanced options live in the reference documentation.

import polars as pl
import svy
from svy import Sample, CaseStyle, LetterCase

Clean & Rename

clean_names()

Standardize column names for easier downstream work.

df = pl.DataFrame({"First Name": ["Ana"], "Income ($)": [5000]})
s = Sample(df).wrangling.clean_names(
    case_style=CaseStyle.SNAKE,
    letter_case=LetterCase.LOWER
)
s.data.columns
['svy_row_index', 'first_name', 'income']

You can also remove characters with a regex or pick other styles (camel/pascal).

df2 = pl.DataFrame({"A#B": [1], "C&D": [2]})
s2 = Sample(df2).wrangling.clean_names(
    remove=r"[^a-zA-Z0-9]",
    letter_case=LetterCase.UPPER
)
s2.data.columns
['svy_row_index', 'AB', 'CD']

rename_columns()

Rename variables directly (raises if a source name is missing).

df = pl.DataFrame({"old": [1, 2, 3]})
s = Sample(df).wrangling.rename_columns(renames={"old": "new"})
s.data.columns
['svy_row_index', 'new']

Recode Categories

Map old values to new labels; everything else passes through unchanged. Works on one or multiple columns; you can replace=True or write to a new column via into=.

df = pl.DataFrame({"item": ["apple", "soap", "carrot", "tv"]})
s = Sample(df)

# Map to "Food"; non-targets remain
s1 = s.wrangling.recode("item", {"Food": ["apple", "carrot"]})
print(s1.show_data(columns=["item", "svy_item_recoded"]))

# Custom output name
s2 = s.wrangling.recode(
    cols="item",
    recodes={"Food": ["apple", "carrot"]},
    replace=False,
    into="item_grp"
)
print(s2.show_data(columns=["item", "item_grp"]))
shape: (4, 2)
┌────────┬──────────────────┐
│ item   ┆ svy_item_recoded │
│ ---    ┆ ---              │
│ str    ┆ str              │
╞════════╪══════════════════╡
│ apple  ┆ Food             │
│ soap   ┆ soap             │
│ carrot ┆ Food             │
│ tv     ┆ tv               │
└────────┴──────────────────┘
shape: (4, 2)
┌────────┬──────────┐
│ item   ┆ item_grp │
│ ---    ┆ ---      │
│ str    ┆ str      │
╞════════╪══════════╡
│ apple  ┆ Food     │
│ soap   ┆ soap     │
│ carrot ┆ Food     │
│ tv     ┆ tv       │
└────────┴──────────┘

Multiple columns:

df = pl.DataFrame({"a": [0, 1, 2], "b": [1, 1, 9]})
s = Sample(df).wrangling.recode(cols=["a", "b"], recodes={"ONE": [1]})
print(s.show_data(columns=["svy_a_recoded", "svy_b_recoded"]))
shape: (3, 2)
┌───────────────┬───────────────┐
│ svy_a_recoded ┆ svy_b_recoded │
│ ---           ┆ ---           │
│ object        ┆ object        │
╞═══════════════╪═══════════════╡
│ 0             ┆ ONE           │
│ ONE           ┆ ONE           │
│ 2             ┆ 9             │
└───────────────┴───────────────┘

Categorize Continuous Variables

Turn continuous values into labeled bins. Out-of-range values become None.

df = pl.DataFrame({"x": [1, 10, 20, 30, 40]})

# Right-closed: (0,10], (10,20], (20,30]
s = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30])
s.show_data(columns=["x", "svy_x_categorized"])

# Left-closed: [0,10), [10,20), [20,30)
s2 = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30], right=False)
print(s2.show_data(columns=["x", "svy_x_categorized"]))
shape: (5, 2)
┌─────┬───────────────────┐
│ x   ┆ svy_x_categorized │
│ --- ┆ ---               │
│ i64 ┆ str               │
╞═════╪═══════════════════╡
│ 1   ┆ [0, 10)           │
│ 10  ┆ [10, 20)          │
│ 20  ┆ [20, 30)          │
│ 30  ┆ null              │
│ 40  ┆ null              │
└─────┴───────────────────┘

Provide your own labels and/or write into an existing name:

df = pl.DataFrame({"v": [1, 2, 6, 11]})
s = Sample(df).wrangling.categorize(
    col="v",
    bins=[0, 5, 10, 20],
    labels=["(0,5]", "(5,10]", "(10,20]"],
    into="v_band"
)
print(s.show_data(columns=["v", "v_band"]))
shape: (4, 2)
┌─────┬─────────┐
│ v   ┆ v_band  │
│ --- ┆ ---     │
│ i64 ┆ str     │
╞═════╪═════════╡
│ 1   ┆ (0,5]   │
│ 2   ┆ (0,5]   │
│ 6   ┆ (5,10]  │
│ 11  ┆ (10,20] │
└─────┴─────────┘

Cap Extremes (Top/Bottom Coding)

Clamp high or low values to chosen bounds. Auto-names outputs unless you set replace=True or into=.

df = pl.DataFrame({"income": [100, 300, 600, 1000]})
s = Sample(df)

# Clamp to [250, 800] in a new column
c1 = s.wrangling.bottom_and_top_code(bottom_and_top_codes={"income": (250, 800)}, replace=False)
c1.show_data(columns=["income", "svy_income_bottom_and_top_coded"])

# Replace in place
c2 = s.wrangling.bottom_and_top_code(bottom_and_top_codes={"income": (250, 800)}, replace=True)
print(c2.show_data(columns=["income"]))
shape: (4, 1)
┌────────┐
│ income │
│ ---    │
│ i64    │
╞════════╡
│ 250    │
│ 300    │
│ 600    │
│ 800    │
└────────┘

You can also do only top or only bottom coding:

print(Sample(pl.DataFrame({"x": [1, 5, 10]})).wrangling.top_code(top_codes={"x": 5}).show_data())
print(Sample(pl.DataFrame({"y": [1, 3, 7]})).wrangling.bottom_code(bottom_codes={"y": 3}).show_data())
shape: (3, 3)
┌───────────────┬─────┬─────────────────┐
│ svy_row_index ┆ x   ┆ svy_x_top_coded │
│ ---           ┆ --- ┆ ---             │
│ u32           ┆ i64 ┆ i64             │
╞═══════════════╪═════╪═════════════════╡
│ 0             ┆ 1   ┆ 1               │
│ 1             ┆ 5   ┆ 5               │
│ 2             ┆ 10  ┆ 5               │
└───────────────┴─────┴─────────────────┘
shape: (3, 3)
┌───────────────┬─────┬────────────────────┐
│ svy_row_index ┆ y   ┆ svy_y_bottom_coded │
│ ---           ┆ --- ┆ ---                │
│ u32           ┆ i64 ┆ i64                │
╞═══════════════╪═════╪════════════════════╡
│ 0             ┆ 1   ┆ 3                  │
│ 1             ┆ 3   ┆ 3                  │
│ 2             ┆ 7   ┆ 7                  │
└───────────────┴─────┴────────────────────┘

Mutate (Create or Transform Columns)

Create new variables from expressions, arrays, Series, or callables. Dependencies created in the same call are supported.

from svy.core.expr import col, when
import numpy as np

df = pl.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
s = Sample(df)

out = s.wrangling.mutate(specs={
    "c": col("a") * 2,           # svy expr
    "d": np.array([1, 1, 1]),    # numpy
    "e": col("c") + col("d"),    # depends on c
    "flag": when(col("e") > 3).then(1).otherwise(0),  # conditional
})
print(out.show_data(columns=["a", "b", "c", "d", "e", "flag"]))
shape: (3, 6)
┌─────┬─────┬─────┬─────┬─────┬──────┐
│ a   ┆ b   ┆ c   ┆ d   ┆ e   ┆ flag │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i32  │
╞═════╪═════╪═════╪═════╪═════╪══════╡
│ 1   ┆ 10  ┆ 2   ┆ 1   ┆ 3   ┆ 0    │
│ 2   ┆ 20  ┆ 4   ┆ 1   ┆ 5   ┆ 1    │
│ 3   ┆ 30  ┆ 6   ┆ 1   ┆ 7   ┆ 1    │
└─────┴─────┴─────┴─────┴─────┴──────┘

Common gotchas: - Referencing a missing column raises a clear error - Length mismatches for arrays/Series raise a compile error - Circular dependencies (e.g., x depends on y and y on x) are blocked

Quick Reference

Task Method(s) Notes
Standardize names clean_names() styles (snake/camel/pascal), letter case, regex removal
Rename variables rename_columns() one or many; raises on missing source
Recode values recode() multi-column; pass-through non-targets; replace/into
Bin numeric categorize() bins, labels, right=; out-of-range → None
Cap extremes top_code(), bottom_code(), bottom_and_top_code() replace/into; order checks
Create/transform mutate() svy/Polars exprs, arrays/Series, callables; same-call deps

Next Steps

Now that your data is clean and organized, learn how to plan your survey.

Master the basics?
Continue to Survey Planning →