Survey Data Wrangling in Python with svy

A complete guide to cleaning, transforming, and preparing survey data

Tutorials
Data Wrangling
Data Preparation
Python
Clean, transform, and prepare survey data with svy. Learn to standardize column names, recode categorical variables, bin continuous data, handle missing values, filter rows, and create new variables.
Author

Mamadou S. Diallo, Ph.D.

Published

January 18, 2026

Modified

February 16, 2026

Keywords

survey data wrangling Python, clean survey data svy, recode variables Python, bin continuous variables, survey data preparation, Polars data cleaning, column name standardization, top code bottom code, categorical variable recoding, mutate survey data, svy wrangling tutorial, data transformation Python

Why Wrangling in svy?

svy provides a wrangling interface for transforming the dataset within a Sample object. This approach has key advantages over extracting data to Polars and re-importing:

  • Design integrity: Column renames automatically propagate to design metadata (strata, PSU, weights)
  • Chained operations: Build transformation pipelines with method chaining
  • Immutability: Transformations return new Sample objects by default, preserving the original
  • Survey-aware: Protected columns (design variables) cannot be accidentally deleted

The wrangling module wraps common Polars operations—it’s not meant to replace Polars, but to make frequent survey data tasks faster and safer.

import numpy as np
import polars as pl

import svy
from svy import Sample, CaseStyle, LetterCase

from rich import print as rprint

Column Operations

Clean Names

Standardize column names for easier downstream work. This is especially useful when importing data from SPSS, Stata, or Excel files with inconsistent naming.

df = pl.DataFrame({"First Name": ["Ana"], "Income ($)": [5000], "AGE_years": [25]})
s = Sample(df).wrangling.clean_names(
    case_style=CaseStyle.SNAKE,
    letter_case=LetterCase.LOWER
)
print(s.data.columns)
['svy_row_index', 'first_name', 'income', 'age_years']

Remove special characters with regex:

df = pl.DataFrame({"A#B": [1], "C&D": [2], "E@F": [3]})
s = Sample(df).wrangling.clean_names(
    remove=r"[^a-zA-Z0-9]",
    letter_case=LetterCase.UPPER
)
print(s.data.columns)
['svy_row_index', 'AB', 'CD', 'EF']

Available options:

Parameter Options Description
case_style SNAKE, CAMEL, PASCAL Target naming convention
letter_case LOWER, UPPER, TITLE Letter case transformation
remove regex string Characters to strip
minimal bool Only trim whitespace

Rename Columns

Rename specific columns directly:

df = pl.DataFrame({"old_name": [1, 2, 3], "another": [4, 5, 6]})
s = Sample(df).wrangling.rename_columns({"old_name": "new_name"})
print(s.data.columns)
['svy_row_index', 'new_name', 'another']
Note

When you rename a design column (stratum, PSU, weight), the Design object is automatically updated to reference the new name.

Select and Drop Columns

Keep only specific columns:

df = pl.DataFrame({"id": [1, 2], "name": ["A", "B"], "age": [20, 30], "temp": [0, 0]})
s = Sample(df)

# Keep specific columns
s_kept = s.wrangling.keep_columns(["id", "name", "age"])
print(s_kept.data.columns)

# Alias: select()
s_selected = s.wrangling.select(["id", "age"])
print(s_selected.data.columns)
['id', 'name', 'age', 'svy_row_index']
['id', 'age', 'svy_row_index']

Remove unwanted columns:

# Start fresh for remove examples
s = Sample(df)

# Remove columns
s_dropped = s.wrangling.remove_columns(["temp"])
print(s_dropped.data.columns)

# Alias: drop()
s_dropped2 = s.wrangling.drop("age")
print(s_dropped2.data.columns)
['svy_row_index', 'id', 'name', 'age']
['svy_row_index', 'id', 'name']
Warning

Design-referenced columns (strata, PSU, weights, etc.) are protected by default. Use force=True to remove them, which automatically cleans the design metadata. Internal svy columns (like svy_row_index) are always preserved and cannot be removed.

# This would raise an error if 'weight' is a design column
sample.wrangling.drop("weight")  # Error!

# Use force=True to override
sample.wrangling.drop("weight", force=True)  # Works, design cleaned

Add Row Index

Add a sequential row identifier:

df = pl.DataFrame({"name": ["Alice", "Bob", "Carol"]})
s = Sample(df).wrangling.with_row_index("row_id", offset=1)
print(s.data)
shape: (3, 3)
┌────────┬───────────────┬───────┐
│ row_id ┆ svy_row_index ┆ name  │
│ ---    ┆ ---           ┆ ---   │
│ u32    ┆ u32           ┆ str   │
╞════════╪═══════════════╪═══════╡
│ 1      ┆ 0             ┆ Alice │
│ 2      ┆ 1             ┆ Bob   │
│ 3      ┆ 2             ┆ Carol │
└────────┴───────────────┴───────┘

Value Transformations

Recode Categories

Map old values to new labels. Values not in the mapping pass through unchanged.

df = pl.DataFrame({"item": ["apple", "soap", "carrot", "tv", "banana"]})
s = Sample(df)

# Map multiple values to a single category
s1 = s.wrangling.recode("item", {"Food": ["apple", "carrot", "banana"]})
print(s1.show_data(columns=["item", "svy_item_recoded"]))
shape: (5, 2)
┌────────┬──────────────────┐
│ item   ┆ svy_item_recoded │
│ ---    ┆ ---              │
│ str    ┆ str              │
╞════════╪══════════════════╡
│ apple  ┆ Food             │
│ soap   ┆ soap             │
│ carrot ┆ Food             │
│ tv     ┆ tv               │
│ banana ┆ Food             │
└────────┴──────────────────┘

Custom output column name:

s2 = s.wrangling.recode(
    cols="item",
    recodes={"Food": ["apple", "carrot", "banana"], "Non-Food": ["soap", "tv"]},
    into="item_category"
)
print(s2.show_data(columns=["item", "item_category"]))
shape: (5, 2)
┌────────┬───────────────┐
│ item   ┆ item_category │
│ ---    ┆ ---           │
│ str    ┆ str           │
╞════════╪═══════════════╡
│ apple  ┆ Food          │
│ soap   ┆ Non-Food      │
│ carrot ┆ Food          │
│ tv     ┆ Non-Food      │
│ banana ┆ Food          │
└────────┴───────────────┘

Recode multiple columns at once:

df = pl.DataFrame({"q1": [1, 2, 3, 4, 5], "q2": [1, 1, 2, 2, 3]})
s = Sample(df).wrangling.recode(
    cols=["q1", "q2"],
    recodes={"Low": [1, 2], "Medium": [3], "High": [4, 5]}
)
print(s.show_data(columns=["q1", "svy_q1_recoded", "q2", "svy_q2_recoded"]))
shape: (5, 4)
┌─────┬────────────────┬─────┬────────────────┐
│ q1  ┆ svy_q1_recoded ┆ q2  ┆ svy_q2_recoded │
│ --- ┆ ---            ┆ --- ┆ ---            │
│ i64 ┆ str            ┆ i64 ┆ str            │
╞═════╪════════════════╪═════╪════════════════╡
│ 1   ┆ Low            ┆ 1   ┆ Low            │
│ 2   ┆ Low            ┆ 1   ┆ Low            │
│ 3   ┆ Medium         ┆ 2   ┆ Low            │
│ 4   ┆ High           ┆ 2   ┆ Low            │
│ 5   ┆ High           ┆ 3   ┆ Medium         │
└─────┴────────────────┴─────┴────────────────┘

Categorize Continuous Variables

Bin numeric values into labeled categories. Out-of-range values become null.

df = pl.DataFrame({"age": [5, 17, 25, 45, 70, 85]})
s = Sample(df)

# Create age groups
s_binned = s.wrangling.categorize(
    col="age",
    bins=[0, 18, 35, 65, 100],
    labels=["0-17", "18-34", "35-64", "65+"],
    into="age_group"
)
print(s_binned.show_data(columns=["age", "age_group"]))
shape: (5, 2)
┌─────┬───────────┐
│ age ┆ age_group │
│ --- ┆ ---       │
│ i64 ┆ str       │
╞═════╪═══════════╡
│ 5   ┆ 0-17      │
│ 17  ┆ 0-17      │
│ 25  ┆ 18-34     │
│ 45  ┆ 35-64     │
│ 70  ┆ 65+       │
└─────┴───────────┘

Control bin closure (right-closed vs. left-closed):

df = pl.DataFrame({"x": [0, 10, 20, 30]})

# Right-closed (default): (0,10], (10,20], (20,30]
s_right = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30], right=True)

# Left-closed: [0,10), [10,20), [20,30)
s_left = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30], right=False)

print("Right-closed:", s_right.data["svy_x_categorized"].to_list())
print("Left-closed:", s_left.data["svy_x_categorized"].to_list())
Right-closed: [None, '(0, 10]', '(10, 20]', '(20, 30]']
Left-closed: ['[0, 10)', '[10, 20)', '[20, 30)', None]

Cap Extremes (Top/Bottom Coding)

Clamp outliers to specified bounds—common in income and expenditure surveys.

df = pl.DataFrame({"income": [500, 5000, 50000, 500000, 5000000]})
s = Sample(df)

# Top code: cap at 100,000
s_top = s.wrangling.top_code({"income": 100000})
print(s_top.show_data(columns=["income", "svy_income_top_coded"]))
shape: (5, 2)
┌─────────┬──────────────────────┐
│ income  ┆ svy_income_top_coded │
│ ---     ┆ ---                  │
│ i64     ┆ i64                  │
╞═════════╪══════════════════════╡
│ 500     ┆ 500                  │
│ 5000    ┆ 5000                 │
│ 50000   ┆ 50000                │
│ 500000  ┆ 100000               │
│ 5000000 ┆ 100000               │
└─────────┴──────────────────────┘
# Bottom code: floor at 1,000
s_bottom = s.wrangling.bottom_code({"income": 1000})
print(s_bottom.show_data(columns=["income", "svy_income_bottom_coded"]))
shape: (5, 2)
┌─────────┬─────────────────────────┐
│ income  ┆ svy_income_bottom_coded │
│ ---     ┆ ---                     │
│ i64     ┆ i64                     │
╞═════════╪═════════════════════════╡
│ 500     ┆ 1000                    │
│ 5000    ┆ 5000                    │
│ 50000   ┆ 50000                   │
│ 500000  ┆ 500000                  │
│ 5000000 ┆ 5000000                 │
└─────────┴─────────────────────────┘
# Both: clamp to [1000, 100000]
s_clamped = s.wrangling.bottom_and_top_code({"income": (1000, 100000)})
print(s_clamped.show_data(columns=["income", "svy_income_bottom_and_top_coded"]))
shape: (5, 2)
┌─────────┬─────────────────────────────────┐
│ income  ┆ svy_income_bottom_and_top_code… │
│ ---     ┆ ---                             │
│ i64     ┆ i64                             │
╞═════════╪═════════════════════════════════╡
│ 500     ┆ 1000                            │
│ 5000    ┆ 5000                            │
│ 50000   ┆ 50000                           │
│ 500000  ┆ 100000                          │
│ 5000000 ┆ 100000                          │
└─────────┴─────────────────────────────────┘

Replace the original column:

s_replaced = s.wrangling.bottom_and_top_code({"income": (1000, 100000)}, replace=True)
print(s_replaced.data["income"].to_list())
[1000, 5000, 50000, 100000, 100000]

Cast Data Types

Convert columns to different data types:

df = pl.DataFrame({"id": ["1", "2", "3"], "value": [1, 2, 3]})
s = Sample(df)

# Single column
s1 = s.wrangling.cast("id", pl.Int64)
print(s1.data.schema)

# Multiple columns with different types
s2 = s.wrangling.cast({"id": pl.Int64, "value": pl.Float64})
print(s2.data.schema)
Schema({'svy_row_index': UInt32, 'id': Int64, 'value': Int64})
Schema({'svy_row_index': UInt32, 'id': Int64, 'value': Float64})

Handling Missing Values

Fill Nulls

Fill missing values with a constant or strategy:

df = pl.DataFrame({
    "income": [1000, None, 3000, None, 5000],
    "age": [25, 30, None, 40, None]
})
s = Sample(df)

# Fill with constant
s1 = s.wrangling.fill_null("income", value=0)
print(s1.data["income"].to_list())

# Fill with median
s2 = s.wrangling.fill_null("income", strategy="mean")
print(s2.data["income"].to_list())
[1000, 0, 3000, 0, 5000]
[1000, 0, 3000, 0, 5000]

Available strategies:

Strategy Description
"forward" Fill with previous non-null value
"backward" Fill with next non-null value
"mean" Fill with column mean
"min" Fill with column minimum
"max" Fill with column maximum
"zero" Fill with 0
"one" Fill with 1

Row Operations

Filter Records

Filter rows based on conditions:

df = pl.DataFrame({
    "region": ["North", "South", "North", "East", "South"],
    "age": [25, 45, 30, 55, 35],
    "income": [30000, 50000, 40000, 60000, 45000]
})
s = Sample(df)

# Filter with dict (equality/membership)
s_north = s.wrangling.filter_records({"region": "North"})
print(s_north.data)
shape: (2, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ ---           ┆ ---    ┆ --- ┆ ---    │
│ u32           ┆ str    ┆ i64 ┆ i64    │
╞═══════════════╪════════╪═════╪════════╡
│ 0             ┆ North  ┆ 25  ┆ 30000  │
│ 2             ┆ North  ┆ 30  ┆ 40000  │
└───────────────┴────────┴─────┴────────┘

Filter with multiple values:

s_north_south = s.wrangling.filter_records({"region": ["North", "South"]})
print(s_north_south.data)
shape: (2, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ ---           ┆ ---    ┆ --- ┆ ---    │
│ u32           ┆ str    ┆ i64 ┆ i64    │
╞═══════════════╪════════╪═════╪════════╡
│ 0             ┆ North  ┆ 25  ┆ 30000  │
│ 2             ┆ North  ┆ 30  ┆ 40000  │
└───────────────┴────────┴─────┴────────┘

Filter with expressions:

# Complex conditions
s_filtered = s.wrangling.filter_records([
    svy.col("age") >= 30,
    svy.col("income") > 40000
])
print(s_filtered.data)
shape: (0, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ ---           ┆ ---    ┆ --- ┆ ---    │
│ u32           ┆ str    ┆ i64 ┆ i64    │
╞═══════════════╪════════╪═════╪════════╡
└───────────────┴────────┴─────┴────────┘

Negate the filter (exclude matching rows):

s_not_north = s.wrangling.filter_records({"region": "North"}, negate=True)
print(s_not_north.data)
shape: (0, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ ---           ┆ ---    ┆ --- ┆ ---    │
│ u32           ┆ str    ┆ i64 ┆ i64    │
╞═══════════════╪════════╪═════╪════════╡
└───────────────┴────────┴─────┴────────┘

Sort Rows

Sort by one or more columns:

df = pl.DataFrame({
    "name": ["Carol", "Alice", "Bob", "Diana"],
    "age": [30, 25, 35, 25]
})
s = Sample(df)

# Sort ascending
s_sorted = s.wrangling.sort("age")
print(s_sorted.data)

# Sort descending
s_sorted_desc = s.wrangling.sort("age", descending=True)
print(s_sorted_desc.data)

# Sort by multiple columns
s_multi = s.wrangling.sort(["age", "name"], descending=[False, False])
print(s_multi.data)
shape: (4, 3)
┌───────────────┬───────┬─────┐
│ svy_row_index ┆ name  ┆ age │
│ ---           ┆ ---   ┆ --- │
│ u32           ┆ str   ┆ i64 │
╞═══════════════╪═══════╪═════╡
│ 1             ┆ Alice ┆ 25  │
│ 3             ┆ Diana ┆ 25  │
│ 0             ┆ Carol ┆ 30  │
│ 2             ┆ Bob   ┆ 35  │
└───────────────┴───────┴─────┘
shape: (4, 3)
┌───────────────┬───────┬─────┐
│ svy_row_index ┆ name  ┆ age │
│ ---           ┆ ---   ┆ --- │
│ u32           ┆ str   ┆ i64 │
╞═══════════════╪═══════╪═════╡
│ 2             ┆ Bob   ┆ 35  │
│ 0             ┆ Carol ┆ 30  │
│ 3             ┆ Diana ┆ 25  │
│ 1             ┆ Alice ┆ 25  │
└───────────────┴───────┴─────┘
shape: (4, 3)
┌───────────────┬───────┬─────┐
│ svy_row_index ┆ name  ┆ age │
│ ---           ┆ ---   ┆ --- │
│ u32           ┆ str   ┆ i64 │
╞═══════════════╪═══════╪═════╡
│ 1             ┆ Alice ┆ 25  │
│ 3             ┆ Diana ┆ 25  │
│ 0             ┆ Carol ┆ 30  │
│ 2             ┆ Bob   ┆ 35  │
└───────────────┴───────┴─────┘

Remove Duplicates

Keep unique rows:

df = pl.DataFrame({
    "household_id": [1, 1, 2, 2, 3],
    "member": ["A", "B", "A", "B", "A"],
    "value": [100, 200, 150, 250, 300]
})
s = Sample(df)

# Unique by specific column(s)
s_unique = s.wrangling.distinct("household_id")
print(s_unique.data)

# Keep last occurrence instead of first
s_last = s.wrangling.distinct("household_id", keep="last")
print(s_last.data)
shape: (3, 4)
┌───────────────┬──────────────┬────────┬───────┐
│ svy_row_index ┆ household_id ┆ member ┆ value │
│ ---           ┆ ---          ┆ ---    ┆ ---   │
│ u32           ┆ i64          ┆ str    ┆ i64   │
╞═══════════════╪══════════════╪════════╪═══════╡
│ 0             ┆ 1            ┆ A      ┆ 100   │
│ 2             ┆ 2            ┆ A      ┆ 150   │
│ 4             ┆ 3            ┆ A      ┆ 300   │
└───────────────┴──────────────┴────────┴───────┘
shape: (3, 4)
┌───────────────┬──────────────┬────────┬───────┐
│ svy_row_index ┆ household_id ┆ member ┆ value │
│ ---           ┆ ---          ┆ ---    ┆ ---   │
│ u32           ┆ i64          ┆ str    ┆ i64   │
╞═══════════════╪══════════════╪════════╪═══════╡
│ 0             ┆ 1            ┆ A      ┆ 100   │
│ 2             ┆ 2            ┆ A      ┆ 150   │
│ 4             ┆ 3            ┆ A      ┆ 300   │
└───────────────┴──────────────┴────────┴───────┘

Creating Variables with Mutate

The mutate() method is the most flexible way to create or transform columns.

Scalar Values

Create constant columns with scalar values:

df = pl.DataFrame({"x": [1, 2, 3]})
s = Sample(df)

s_mutated = s.wrangling.mutate({
    "_one": 1,              # integer
    "pi": 3.14159,          # float
    "group": "A",           # string
    "flag": True,           # boolean
    "empty": None,          # null
})
print(s_mutated.data)
shape: (3, 7)
┌───────────────┬─────┬──────┬───────┬───────┬──────┬─────────┐
│ svy_row_index ┆ x   ┆ flag ┆ empty ┆ group ┆ _one ┆ pi      │
│ ---           ┆ --- ┆ ---  ┆ ---   ┆ ---   ┆ ---  ┆ ---     │
│ u32           ┆ i64 ┆ bool ┆ null  ┆ str   ┆ i32  ┆ f64     │
╞═══════════════╪═════╪══════╪═══════╪═══════╪══════╪═════════╡
│ 0             ┆ 1   ┆ true ┆ null  ┆ A     ┆ 1    ┆ 3.14159 │
│ 1             ┆ 2   ┆ true ┆ null  ┆ A     ┆ 1    ┆ 3.14159 │
│ 2             ┆ 3   ┆ true ┆ null  ┆ A     ┆ 1    ┆ 3.14159 │
└───────────────┴─────┴──────┴───────┴───────┴──────┴─────────┘

Expressions

Transform columns using expressions:

df = pl.DataFrame({"income": [30000, 50000, 75000], "household_size": [2, 4, 3]})
s = Sample(df)

s_transformed = s.wrangling.mutate({
    "income_k": svy.col("income") / 1000,
    "per_capita": svy.col("income") / svy.col("household_size"),
    "log_income": svy.col("income").log(),
})
print(s_transformed.data)
shape: (3, 6)
┌───────────────┬────────┬────────────────┬────────────┬──────────┬────────────┐
│ svy_row_index ┆ income ┆ household_size ┆ per_capita ┆ income_k ┆ log_income │
│ ---           ┆ ---    ┆ ---            ┆ ---        ┆ ---      ┆ ---        │
│ u32           ┆ i64    ┆ i64            ┆ f64        ┆ f64      ┆ f64        │
╞═══════════════╪════════╪════════════════╪════════════╪══════════╪════════════╡
│ 0             ┆ 30000  ┆ 2              ┆ 15000.0    ┆ 30.0     ┆ 10.308953  │
│ 1             ┆ 50000  ┆ 4              ┆ 12500.0    ┆ 50.0     ┆ 10.819778  │
│ 2             ┆ 75000  ┆ 3              ┆ 25000.0    ┆ 75.0     ┆ 11.225243  │
└───────────────┴────────┴────────────────┴────────────┴──────────┴────────────┘

Conditional Logic

Use when().then().otherwise() for if-else logic:

df = pl.DataFrame({"age": [15, 25, 45, 70], "income": [0, 30000, 60000, 25000]})
s = Sample(df)

s_conditional = s.wrangling.mutate({
    "age_group": (
        svy.when(svy.col("age") < 18).then("child")
        .when(svy.col("age") < 65).then("adult")
        .otherwise("senior")
    ),
    "income_bracket": (
        svy.when(svy.col("income") < 20000).then("low")
        .when(svy.col("income") < 50000).then("middle")
        .otherwise("high")
    ),
})
print(s_conditional.data)
shape: (4, 5)
┌───────────────┬─────┬────────┬───────────┬────────────────┐
│ svy_row_index ┆ age ┆ income ┆ age_group ┆ income_bracket │
│ ---           ┆ --- ┆ ---    ┆ ---       ┆ ---            │
│ u32           ┆ i64 ┆ i64    ┆ str       ┆ str            │
╞═══════════════╪═════╪════════╪═══════════╪════════════════╡
│ 0             ┆ 15  ┆ 0      ┆ child     ┆ low            │
│ 1             ┆ 25  ┆ 30000  ┆ adult     ┆ middle         │
│ 2             ┆ 45  ┆ 60000  ┆ adult     ┆ high           │
│ 3             ┆ 70  ┆ 25000  ┆ senior    ┆ middle         │
└───────────────┴─────┴────────┴───────────┴────────────────┘

Dependencies Within Same Call

Columns created in the same mutate() call can reference each other:

df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
s = Sample(df)

s_chained = s.wrangling.mutate({
    "x_squared": svy.col("x") ** 2,
    "x_cubed": svy.col("x") ** 3,
    "sum_powers": svy.col("x_squared") + svy.col("x_cubed"),  # References columns created above
})
print(s_chained.data)
shape: (5, 5)
┌───────────────┬─────┬───────────┬─────────┬────────────┐
│ svy_row_index ┆ x   ┆ x_squared ┆ x_cubed ┆ sum_powers │
│ ---           ┆ --- ┆ ---       ┆ ---     ┆ ---        │
│ u32           ┆ i64 ┆ i64       ┆ i64     ┆ i64        │
╞═══════════════╪═════╪═══════════╪═════════╪════════════╡
│ 0             ┆ 1   ┆ 1         ┆ 1       ┆ 2          │
│ 1             ┆ 2   ┆ 4         ┆ 8       ┆ 12         │
│ 2             ┆ 3   ┆ 9         ┆ 27      ┆ 36         │
│ 3             ┆ 4   ┆ 16        ┆ 64      ┆ 80         │
│ 4             ┆ 5   ┆ 25        ┆ 125     ┆ 150        │
└───────────────┴─────┴───────────┴─────────┴────────────┘

Arrays and Series

Pass arrays directly (must match row count):

df = pl.DataFrame({"id": [1, 2, 3]})
s = Sample(df)

s_arrays = s.wrangling.mutate({
    "from_list": [10, 20, 30],
    "from_numpy": np.array([100, 200, 300]),
    "from_series": pl.Series([1.1, 2.2, 3.3]),
})
print(s_arrays.data)
shape: (3, 5)
┌───────────────┬─────┬───────────┬─────────────┬────────────┐
│ svy_row_index ┆ id  ┆ from_list ┆ from_series ┆ from_numpy │
│ ---           ┆ --- ┆ ---       ┆ ---         ┆ ---        │
│ u32           ┆ i64 ┆ i64       ┆ f64         ┆ i64        │
╞═══════════════╪═════╪═══════════╪═════════════╪════════════╡
│ 0             ┆ 1   ┆ 10        ┆ 1.1         ┆ 100        │
│ 1             ┆ 2   ┆ 20        ┆ 2.2         ┆ 200        │
│ 2             ┆ 3   ┆ 30        ┆ 3.3         ┆ 300        │
└───────────────┴─────┴───────────┴─────────────┴────────────┘

Expression Methods Reference

The svy.col() expression supports a rich set of methods for transformations.

Arithmetic

df = pl.DataFrame({"x": [10, 20, 30], "y": [3, 4, 5]})
s = Sample(df)

s_arith = s.wrangling.mutate({
    "add": svy.col("x") + svy.col("y"),
    "sub": svy.col("x") - svy.col("y"),
    "mul": svy.col("x") * svy.col("y"),
    "div": svy.col("x") / svy.col("y"),
    "floordiv": svy.col("x") // svy.col("y"),
    "mod": svy.col("x") % svy.col("y"),
    "pow": svy.col("y") ** 2,
})
print(s_arith.data)
shape: (3, 10)
┌───────────────┬─────┬─────┬──────────┬───┬─────┬─────┬─────┬─────┐
│ svy_row_index ┆ x   ┆ y   ┆ floordiv ┆ … ┆ mul ┆ mod ┆ pow ┆ sub │
│ ---           ┆ --- ┆ --- ┆ ---      ┆   ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32           ┆ i64 ┆ i64 ┆ i64      ┆   ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═════╪══════════╪═══╪═════╪═════╪═════╪═════╡
│ 0             ┆ 10  ┆ 3   ┆ 3        ┆ … ┆ 30  ┆ 1   ┆ 9   ┆ 7   │
│ 1             ┆ 20  ┆ 4   ┆ 5        ┆ … ┆ 80  ┆ 0   ┆ 16  ┆ 16  │
│ 2             ┆ 30  ┆ 5   ┆ 6        ┆ … ┆ 150 ┆ 0   ┆ 25  ┆ 25  │
└───────────────┴─────┴─────┴──────────┴───┴─────┴─────┴─────┴─────┘

Math Functions

df = pl.DataFrame({"x": [1.0, 4.0, 9.0, 16.0]})
s = Sample(df)

s_math = s.wrangling.mutate({
    "sqrt": svy.col("x").sqrt(),
    "log": svy.col("x").log(),
    "log10": svy.col("x").log10(),
    "exp": svy.col("x").exp(),
    "abs": (-svy.col("x")).abs(),
    "round": (svy.col("x") / 3).round(2),
    "floor": (svy.col("x") / 3).floor(),
    "ceil": (svy.col("x") / 3).ceil(),
})
print(s_math.data)
shape: (4, 10)
┌───────────────┬──────┬───────┬──────────┬───┬──────┬──────┬──────────┬──────┐
│ svy_row_index ┆ x    ┆ round ┆ log      ┆ … ┆ ceil ┆ sqrt ┆ log10    ┆ abs  │
│ ---           ┆ ---  ┆ ---   ┆ ---      ┆   ┆ ---  ┆ ---  ┆ ---      ┆ ---  │
│ u32           ┆ f64  ┆ f64   ┆ f64      ┆   ┆ f64  ┆ f64  ┆ f64      ┆ f64  │
╞═══════════════╪══════╪═══════╪══════════╪═══╪══════╪══════╪══════════╪══════╡
│ 0             ┆ 1.0  ┆ 0.33  ┆ 0.0      ┆ … ┆ 1.0  ┆ 1.0  ┆ 0.0      ┆ 1.0  │
│ 1             ┆ 4.0  ┆ 1.33  ┆ 1.386294 ┆ … ┆ 2.0  ┆ 2.0  ┆ 0.60206  ┆ 4.0  │
│ 2             ┆ 9.0  ┆ 3.0   ┆ 2.197225 ┆ … ┆ 3.0  ┆ 3.0  ┆ 0.954243 ┆ 9.0  │
│ 3             ┆ 16.0 ┆ 5.33  ┆ 2.772589 ┆ … ┆ 6.0  ┆ 4.0  ┆ 1.20412  ┆ 16.0 │
└───────────────┴──────┴───────┴──────────┴───┴──────┴──────┴──────────┴──────┘

Clipping Values

df = pl.DataFrame({"x": [1, 5, 10, 15, 20]})
s = Sample(df)

s_clipped = s.wrangling.mutate({
    "clipped": svy.col("x").clip(5, 15),
})
print(s_clipped.data)
shape: (5, 3)
┌───────────────┬─────┬─────────┐
│ svy_row_index ┆ x   ┆ clipped │
│ ---           ┆ --- ┆ ---     │
│ u32           ┆ i64 ┆ i64     │
╞═══════════════╪═════╪═════════╡
│ 0             ┆ 1   ┆ 5       │
│ 1             ┆ 5   ┆ 5       │
│ 2             ┆ 10  ┆ 10      │
│ 3             ┆ 15  ┆ 15      │
│ 4             ┆ 20  ┆ 15      │
└───────────────┴─────┴─────────┘

Comparisons and Boolean Logic

df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1]})
s = Sample(df)

s_bool = s.wrangling.mutate({
    "x_gt_2": svy.col("x") > 2,
    "x_eq_y": svy.col("x") == svy.col("y"),
    "both": (svy.col("x") > 2) & (svy.col("y") > 2),
    "either": (svy.col("x") > 4) | (svy.col("y") > 4),
    "not_gt_2": ~(svy.col("x") > 2),
})
print(s_bool.data)
shape: (5, 8)
┌───────────────┬─────┬─────┬────────┬───────┬────────┬────────┬──────────┐
│ svy_row_index ┆ x   ┆ y   ┆ either ┆ both  ┆ x_gt_2 ┆ x_eq_y ┆ not_gt_2 │
│ ---           ┆ --- ┆ --- ┆ ---    ┆ ---   ┆ ---    ┆ ---    ┆ ---      │
│ u32           ┆ i64 ┆ i64 ┆ bool   ┆ bool  ┆ bool   ┆ bool   ┆ bool     │
╞═══════════════╪═════╪═════╪════════╪═══════╪════════╪════════╪══════════╡
│ 0             ┆ 1   ┆ 5   ┆ true   ┆ false ┆ false  ┆ false  ┆ true     │
│ 1             ┆ 2   ┆ 4   ┆ false  ┆ false ┆ false  ┆ false  ┆ true     │
│ 2             ┆ 3   ┆ 3   ┆ false  ┆ true  ┆ true   ┆ true   ┆ false    │
│ 3             ┆ 4   ┆ 2   ┆ false  ┆ false ┆ true   ┆ false  ┆ false    │
│ 4             ┆ 5   ┆ 1   ┆ true   ┆ false ┆ true   ┆ false  ┆ false    │
└───────────────┴─────┴─────┴────────┴───────┴────────┴────────┴──────────┘

Membership Testing

df = pl.DataFrame({"category": ["A", "B", "C", "D", "E"], "value": [1, 5, 10, 15, 20]})
s = Sample(df)

s_membership = s.wrangling.mutate({
    "in_list": svy.col("category").is_in(["A", "C", "E"]),
    "not_in_list": svy.col("category").is_not_in(["A", "C", "E"]),
    "in_range": svy.col("value").between(5, 15),
})
print(s_membership.data)
shape: (5, 6)
┌───────────────┬──────────┬───────┬─────────────┬──────────┬─────────┐
│ svy_row_index ┆ category ┆ value ┆ not_in_list ┆ in_range ┆ in_list │
│ ---           ┆ ---      ┆ ---   ┆ ---         ┆ ---      ┆ ---     │
│ u32           ┆ str      ┆ i64   ┆ bool        ┆ bool     ┆ bool    │
╞═══════════════╪══════════╪═══════╪═════════════╪══════════╪═════════╡
│ 0             ┆ A        ┆ 1     ┆ false       ┆ false    ┆ true    │
│ 1             ┆ B        ┆ 5     ┆ true        ┆ true     ┆ false   │
│ 2             ┆ C        ┆ 10    ┆ false       ┆ true     ┆ true    │
│ 3             ┆ D        ┆ 15    ┆ true        ┆ true     ┆ false   │
│ 4             ┆ E        ┆ 20    ┆ false       ┆ false    ┆ true    │
└───────────────┴──────────┴───────┴─────────────┴──────────┴─────────┘

Null Handling

df = pl.DataFrame({"x": [1, None, 3, None, 5]})
s = Sample(df)

s_nulls = s.wrangling.mutate({
    "is_null": svy.col("x").is_null(),
    "is_not_null": svy.col("x").is_not_null(),
    "filled_zero": svy.col("x").fill_null(0),
    "filled_expr": svy.col("x").fill_null(svy.col("x").mean()),
})
print(s_nulls.data)
shape: (5, 6)
┌───────────────┬──────┬─────────────┬─────────────┬─────────────┬─────────┐
│ svy_row_index ┆ x    ┆ is_not_null ┆ filled_expr ┆ filled_zero ┆ is_null │
│ ---           ┆ ---  ┆ ---         ┆ ---         ┆ ---         ┆ ---     │
│ u32           ┆ i64  ┆ bool        ┆ f64         ┆ i64         ┆ bool    │
╞═══════════════╪══════╪═════════════╪═════════════╪═════════════╪═════════╡
│ 0             ┆ 1    ┆ true        ┆ 1.0         ┆ 1           ┆ false   │
│ 1             ┆ null ┆ false       ┆ 3.0         ┆ 0           ┆ true    │
│ 2             ┆ 3    ┆ true        ┆ 3.0         ┆ 3           ┆ false   │
│ 3             ┆ null ┆ false       ┆ 3.0         ┆ 0           ┆ true    │
│ 4             ┆ 5    ┆ true        ┆ 5.0         ┆ 5           ┆ false   │
└───────────────┴──────┴─────────────┴─────────────┴─────────────┴─────────┘

String Operations

df = pl.DataFrame({"name": ["  Alice  ", "BOB", "carol", "Diana Smith"]})
s = Sample(df)

s_strings = s.wrangling.mutate({
    "stripped": svy.col("name").strip(),
    "lower": svy.col("name").lower(),
    "upper": svy.col("name").upper(),
    "title": svy.col("name").title(),
    "length": svy.col("name").str_len(),
    "contains_a": svy.col("name").lower().contains("a"),
    "starts_d": svy.col("name").startswith("D"),
})
print(s_strings.data)
shape: (4, 9)
┌────────────┬────────────┬──────────┬────────────┬───┬───────────┬───────────┬───────────┬────────┐
│ svy_row_in ┆ name       ┆ starts_d ┆ contains_a ┆ … ┆ lower     ┆ stripped  ┆ title     ┆ length │
│ dex        ┆ ---        ┆ ---      ┆ ---        ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---    │
│ ---        ┆ str        ┆ bool     ┆ bool       ┆   ┆ str       ┆ str       ┆ str       ┆ u32    │
│ u32        ┆            ┆          ┆            ┆   ┆           ┆           ┆           ┆        │
╞════════════╪════════════╪══════════╪════════════╪═══╪═══════════╪═══════════╪═══════════╪════════╡
│ 0          ┆   Alice    ┆ false    ┆ true       ┆ … ┆   alice   ┆ Alice     ┆   Alice   ┆ 9      │
│ 1          ┆ BOB        ┆ false    ┆ false      ┆ … ┆ bob       ┆ BOB       ┆ Bob       ┆ 3      │
│ 2          ┆ carol      ┆ false    ┆ true       ┆ … ┆ carol     ┆ carol     ┆ Carol     ┆ 5      │
│ 3          ┆ Diana      ┆ true     ┆ true       ┆ … ┆ diana     ┆ Diana     ┆ Diana     ┆ 11     │
│            ┆ Smith      ┆          ┆            ┆   ┆ smith     ┆ Smith     ┆ Smith     ┆        │
└────────────┴────────────┴──────────┴────────────┴───┴───────────┴───────────┴───────────┴────────┘

String replacement:

df = pl.DataFrame({"text": ["hello world", "world world", "goodbye"]})
s = Sample(df)

s_replace = s.wrangling.mutate({
    "replace_first": svy.col("text").replace("world", "python"),
    "replace_all": svy.col("text").replace_all("world", "python"),
})
print(s_replace.data)
shape: (3, 4)
┌───────────────┬─────────────┬───────────────┬───────────────┐
│ svy_row_index ┆ text        ┆ replace_first ┆ replace_all   │
│ ---           ┆ ---         ┆ ---           ┆ ---           │
│ u32           ┆ str         ┆ str           ┆ str           │
╞═══════════════╪═════════════╪═══════════════╪═══════════════╡
│ 0             ┆ hello world ┆ hello python  ┆ hello python  │
│ 1             ┆ world world ┆ python world  ┆ python python │
│ 2             ┆ goodbye     ┆ goodbye       ┆ goodbye       │
└───────────────┴─────────────┴───────────────┴───────────────┘

Type Casting

df = pl.DataFrame({"x": [1, 2, 3], "y": ["10", "20", "30"]})
s = Sample(df)

s_cast = s.wrangling.mutate({
    "x_float": svy.col("x").to_float(),
    "x_str": svy.col("x").to_str(),
    "y_int": svy.col("y").cast(pl.Int64),
})
print(s_cast.data)
shape: (3, 6)
┌───────────────┬─────┬─────┬───────┬─────────┬───────┐
│ svy_row_index ┆ x   ┆ y   ┆ y_int ┆ x_float ┆ x_str │
│ ---           ┆ --- ┆ --- ┆ ---   ┆ ---     ┆ ---   │
│ u32           ┆ i64 ┆ str ┆ i64   ┆ f64     ┆ str   │
╞═══════════════╪═════╪═════╪═══════╪═════════╪═══════╡
│ 0             ┆ 1   ┆ 10  ┆ 10    ┆ 1.0     ┆ 1     │
│ 1             ┆ 2   ┆ 20  ┆ 20    ┆ 2.0     ┆ 2     │
│ 2             ┆ 3   ┆ 30  ┆ 30    ┆ 3.0     ┆ 3     │
└───────────────┴─────┴─────┴───────┴─────────┴───────┘

Aggregations in Window Context

Use .over() to compute aggregations within groups:

df = pl.DataFrame({
    "region": ["North", "North", "South", "South", "South"],
    "sales": [100, 150, 200, 250, 300]
})
s = Sample(df)

s_window = s.wrangling.mutate({
    "region_total": svy.col("sales").sum().over("region"),
    "region_mean": svy.col("sales").mean().over("region"),
    "region_max": svy.col("sales").max().over("region"),
    "pct_of_region": svy.col("sales") / svy.col("sales").sum().over("region") * 100,
})
print(s_window.data)
shape: (5, 7)
┌───────────────┬────────┬───────┬─────────────┬────────────┬───────────────┬──────────────┐
│ svy_row_index ┆ region ┆ sales ┆ region_mean ┆ region_max ┆ pct_of_region ┆ region_total │
│ ---           ┆ ---    ┆ ---   ┆ ---         ┆ ---        ┆ ---           ┆ ---          │
│ u32           ┆ str    ┆ i64   ┆ f64         ┆ i64        ┆ f64           ┆ i64          │
╞═══════════════╪════════╪═══════╪═════════════╪════════════╪═══════════════╪══════════════╡
│ 0             ┆ North  ┆ 100   ┆ 125.0       ┆ 150        ┆ 40.0          ┆ 250          │
│ 1             ┆ North  ┆ 150   ┆ 125.0       ┆ 150        ┆ 60.0          ┆ 250          │
│ 2             ┆ South  ┆ 200   ┆ 250.0       ┆ 300        ┆ 26.666667     ┆ 750          │
│ 3             ┆ South  ┆ 250   ┆ 250.0       ┆ 300        ┆ 33.333333     ┆ 750          │
│ 4             ┆ South  ┆ 300   ┆ 250.0       ┆ 300        ┆ 40.0          ┆ 750          │
└───────────────┴────────┴───────┴─────────────┴────────────┴───────────────┴──────────────┘

Cumulative Functions

df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
s = Sample(df)

s_cumulative = s.wrangling.mutate({
    "cum_sum": svy.col("x").cum_sum(),
    "cum_max": svy.col("x").cum_max(),
    "cum_min": svy.col("x").cum_min(),
})
print(s_cumulative.data)
shape: (5, 5)
┌───────────────┬─────┬─────────┬─────────┬─────────┐
│ svy_row_index ┆ x   ┆ cum_min ┆ cum_sum ┆ cum_max │
│ ---           ┆ --- ┆ ---     ┆ ---     ┆ ---     │
│ u32           ┆ i64 ┆ i64     ┆ i64     ┆ i64     │
╞═══════════════╪═════╪═════════╪═════════╪═════════╡
│ 0             ┆ 1   ┆ 1       ┆ 1       ┆ 1       │
│ 1             ┆ 2   ┆ 1       ┆ 3       ┆ 2       │
│ 2             ┆ 3   ┆ 1       ┆ 6       ┆ 3       │
│ 3             ┆ 4   ┆ 1       ┆ 10      ┆ 4       │
│ 4             ┆ 5   ┆ 1       ┆ 15      ┆ 5       │
└───────────────┴─────┴─────────┴─────────┴─────────┘

Shift and Diff

df = pl.DataFrame({"value": [100, 110, 105, 120, 115]})
s = Sample(df)

s_shift = s.wrangling.mutate({
    "prev_value": svy.col("value").shift(1),
    "next_value": svy.col("value").shift(-1),
    "change": svy.col("value").diff(),
    "pct_change": svy.col("value").pct_change(),
})
print(s_shift.data)
shape: (5, 6)
┌───────────────┬───────┬────────────┬────────────┬────────┬────────────┐
│ svy_row_index ┆ value ┆ next_value ┆ pct_change ┆ change ┆ prev_value │
│ ---           ┆ ---   ┆ ---        ┆ ---        ┆ ---    ┆ ---        │
│ u32           ┆ i64   ┆ i64        ┆ f64        ┆ i64    ┆ i64        │
╞═══════════════╪═══════╪════════════╪════════════╪════════╪════════════╡
│ 0             ┆ 100   ┆ 110        ┆ null       ┆ null   ┆ null       │
│ 1             ┆ 110   ┆ 105        ┆ 0.1        ┆ 10     ┆ 100        │
│ 2             ┆ 105   ┆ 120        ┆ -0.045455  ┆ -5     ┆ 110        │
│ 3             ┆ 120   ┆ 115        ┆ 0.142857   ┆ 15     ┆ 105        │
│ 4             ┆ 115   ┆ null       ┆ -0.041667  ┆ -5     ┆ 120        │
└───────────────┴───────┴────────────┴────────────┴────────┴────────────┘

Ranking

df = pl.DataFrame({"score": [85, 92, 78, 92, 88]})
s = Sample(df)

s_rank = s.wrangling.mutate({
    "rank": svy.col("score").rank(),
    "rank_desc": svy.col("score").rank(descending=True),
})
print(s_rank.data)
shape: (5, 4)
┌───────────────┬───────┬──────┬───────────┐
│ svy_row_index ┆ score ┆ rank ┆ rank_desc │
│ ---           ┆ ---   ┆ ---  ┆ ---       │
│ u32           ┆ i64   ┆ u32  ┆ u32       │
╞═══════════════╪═══════╪══════╪═══════════╡
│ 0             ┆ 85    ┆ 2    ┆ 4         │
│ 1             ┆ 92    ┆ 4    ┆ 1         │
│ 2             ┆ 78    ┆ 1    ┆ 5         │
│ 3             ┆ 92    ┆ 5    ┆ 2         │
│ 4             ┆ 88    ┆ 3    ┆ 3         │
└───────────────┴───────┴──────┴───────────┘

Helper Functions

Coalesce

Return the first non-null value across multiple columns:

df = pl.DataFrame({
    "primary_phone": ["555-1234", None, None],
    "secondary_phone": [None, "555-5678", None],
    "emergency_phone": ["555-9999", "555-9999", "555-9999"]
})
s = Sample(df)

s_coalesce = s.wrangling.mutate({
    "contact_phone": svy.coalesce(
        svy.col("primary_phone"),
        svy.col("secondary_phone"),
        svy.col("emergency_phone")
    ),
})
print(s_coalesce.data)
shape: (3, 5)
┌───────────────┬───────────────┬─────────────────┬─────────────────┬───────────────┐
│ svy_row_index ┆ primary_phone ┆ secondary_phone ┆ emergency_phone ┆ contact_phone │
│ ---           ┆ ---           ┆ ---             ┆ ---             ┆ ---           │
│ u32           ┆ str           ┆ str             ┆ str             ┆ str           │
╞═══════════════╪═══════════════╪═════════════════╪═════════════════╪═══════════════╡
│ 0             ┆ 555-1234      ┆ null            ┆ 555-9999        ┆ 555-1234      │
│ 1             ┆ null          ┆ 555-5678        ┆ 555-9999        ┆ 555-5678      │
│ 2             ┆ null          ┆ null            ┆ 555-9999        ┆ 555-9999      │
└───────────────┴───────────────┴─────────────────┴─────────────────┴───────────────┘

Concatenate Strings

Join string columns together:

df = pl.DataFrame({
    "first_name": ["John", "Jane"],
    "last_name": ["Doe", "Smith"]
})
s = Sample(df)

s_concat = s.wrangling.mutate({
    "full_name": svy.concat_str(
        svy.col("first_name"),
        svy.lit(" "),
        svy.col("last_name")
    ),
})
print(s_concat.data)
shape: (2, 4)
┌───────────────┬────────────┬───────────┬────────────┐
│ svy_row_index ┆ first_name ┆ last_name ┆ full_name  │
│ ---           ┆ ---        ┆ ---       ┆ ---        │
│ u32           ┆ str        ┆ str       ┆ str        │
╞═══════════════╪════════════╪═══════════╪════════════╡
│ 0             ┆ John       ┆ Doe       ┆ John Doe   │
│ 1             ┆ Jane       ┆ Smith     ┆ Jane Smith │
└───────────────┴────────────┴───────────┴────────────┘

Horizontal Operations

Perform row-wise operations across multiple columns:

df = pl.DataFrame({
    "q1": [4, 3, 5, 2],
    "q2": [3, 4, 4, 3],
    "q3": [5, 3, 5, 4]
})
s = Sample(df)

s_horizontal = s.wrangling.mutate({
    "total_score": svy.sum_horizontal(svy.col("q1"), svy.col("q2"), svy.col("q3")),
    "min_score": svy.min_horizontal(svy.col("q1"), svy.col("q2"), svy.col("q3")),
    "max_score": svy.max_horizontal(svy.col("q1"), svy.col("q2"), svy.col("q3")),
})
print(s_horizontal.data)
shape: (4, 7)
┌───────────────┬─────┬─────┬─────┬───────────┬───────────┬─────────────┐
│ svy_row_index ┆ q1  ┆ q2  ┆ q3  ┆ min_score ┆ max_score ┆ total_score │
│ ---           ┆ --- ┆ --- ┆ --- ┆ ---       ┆ ---       ┆ ---         │
│ u32           ┆ i64 ┆ i64 ┆ i64 ┆ i64       ┆ i64       ┆ i64         │
╞═══════════════╪═════╪═════╪═════╪═══════════╪═══════════╪═════════════╡
│ 0             ┆ 4   ┆ 3   ┆ 5   ┆ 3         ┆ 5         ┆ 12          │
│ 1             ┆ 3   ┆ 4   ┆ 3   ┆ 3         ┆ 4         ┆ 10          │
│ 2             ┆ 5   ┆ 4   ┆ 5   ┆ 4         ┆ 5         ┆ 14          │
│ 3             ┆ 2   ┆ 3   ┆ 4   ┆ 2         ┆ 4         ┆ 9           │
└───────────────┴─────┴─────┴─────┴───────────┴───────────┴─────────────┘

Boolean horizontal operations:

df = pl.DataFrame({
    "passed_test1": [True, True, False, True],
    "passed_test2": [True, False, False, True],
    "passed_test3": [True, True, False, False]
})
s = Sample(df)

s_bool_horiz = s.wrangling.mutate({
    "passed_all": svy.all_horizontal(
        svy.col("passed_test1"),
        svy.col("passed_test2"),
        svy.col("passed_test3")
    ),
    "passed_any": svy.any_horizontal(
        svy.col("passed_test1"),
        svy.col("passed_test2"),
        svy.col("passed_test3")
    ),
})
print(s_bool_horiz.data)
shape: (4, 6)
┌───────────────┬──────────────┬──────────────┬──────────────┬────────────┬────────────┐
│ svy_row_index ┆ passed_test1 ┆ passed_test2 ┆ passed_test3 ┆ passed_any ┆ passed_all │
│ ---           ┆ ---          ┆ ---          ┆ ---          ┆ ---        ┆ ---        │
│ u32           ┆ bool         ┆ bool         ┆ bool         ┆ bool       ┆ bool       │
╞═══════════════╪══════════════╪══════════════╪══════════════╪════════════╪════════════╡
│ 0             ┆ true         ┆ true         ┆ true         ┆ true       ┆ true       │
│ 1             ┆ true         ┆ false        ┆ true         ┆ true       ┆ false      │
│ 2             ┆ false        ┆ false        ┆ false        ┆ false      ┆ false      │
│ 3             ┆ true         ┆ true         ┆ false        ┆ true       ┆ false      │
└───────────────┴──────────────┴──────────────┴──────────────┴────────────┴────────────┘

Method Chaining

All wrangling methods return the Sample, enabling fluent pipelines:

df = pl.DataFrame({
    "First Name": ["alice", "bob", "carol"],
    "Age_Years": [25, None, 35],
    "Income ($)": [30000, 50000, 1000000]
})

result = (
    Sample(df)
    .wrangling.clean_names(case_style=CaseStyle.SNAKE, letter_case=LetterCase.LOWER)
    .wrangling.fill_null("age_years", strategy="mean")
    .wrangling.top_code({"income": 200000}, replace=True)
    .wrangling.mutate({
        "income_k": svy.col("income") / 1000,
        "is_adult": svy.col("age_years") >= 18,
    })
)

print(result.data)
shape: (3, 6)
┌───────────────┬────────────┬───────────┬────────┬──────────┬──────────┐
│ svy_row_index ┆ first_name ┆ age_years ┆ income ┆ is_adult ┆ income_k │
│ ---           ┆ ---        ┆ ---       ┆ ---    ┆ ---      ┆ ---      │
│ u32           ┆ str        ┆ i64       ┆ i64    ┆ bool     ┆ f64      │
╞═══════════════╪════════════╪═══════════╪════════╪══════════╪══════════╡
│ 0             ┆ alice      ┆ 25        ┆ 30000  ┆ true     ┆ 30.0     │
│ 1             ┆ bob        ┆ 30        ┆ 50000  ┆ true     ┆ 50.0     │
│ 2             ┆ carol      ┆ 35        ┆ 200000 ┆ true     ┆ 200.0    │
└───────────────┴────────────┴───────────┴────────┴──────────┴──────────┘

Working with Survey Design

When your Sample has a design, wrangling operations maintain design integrity:

df = pl.DataFrame({
    "id": [1, 2, 3, 4],
    "stratum_var": ["A", "A", "B", "B"],
    "psu_var": [1, 1, 2, 2],
    "wgt_var": [1.0, 1.2, 0.9, 1.1],
    "income": [30000, 50000, 40000, 60000]
})

design = svy.Design(stratum="stratum_var", psu="psu_var", wgt="wgt_var")
sample = svy.Sample(df, design=design)

# Rename a design column - design is automatically updated
sample_renamed = sample.wrangling.rename_columns({"stratum_var": "region"})
print(f"Original stratum: {sample.design.stratum}")
print(f"After rename: {sample_renamed.design.stratum}")
Original stratum: region
After rename: region

Variable Labels

Apply descriptive labels to variables (useful for documentation and export):

df = pl.DataFrame({"q1": [1, 2, 1, 2], "q2": [1, 1, 2, 2]})
s = Sample(df)

s_labeled = s.wrangling.apply_labels(
    cols=["q1", "q2"],
    labels=["Satisfaction with service", "Would recommend"],
    categories={1: "Yes", 2: "No"}
)

# Access labels
rprint(s_labeled.labels)
{'q1': {1: 'Yes', 2: 'No'}, 'q2': {1: 'Yes', 2: 'No'}}

Quick Reference

Wrangling Methods

Task Method Key Parameters
Column Names
Standardize names clean_names() case_style, letter_case, remove
Rename columns rename_columns() renames dict
Keep columns keep_columns() / select() columns, force
Drop columns remove_columns() / drop() columns, force
Add row index with_row_index() name, offset
Values
Recode categories recode() cols, recodes, into, replace
Bin continuous categorize() col, bins, labels, right
Top code top_code() top_codes, replace, into
Bottom code bottom_code() bottom_codes, replace, into
Clamp both bottom_and_top_code() bottom_and_top_codes
Change dtype cast() cols, dtype, strict
Fill nulls fill_null() cols, value, strategy
Rows
Filter rows filter_records() where, negate
Sort rows sort() by, descending, nulls_last
Deduplicate distinct() cols, keep, maintain_order
Create/Transform
New columns mutate() specs, inplace
Metadata
Add labels apply_labels() cols, labels, categories

Expression Methods

Category Methods
Arithmetic +, -, *, /, //, %, **, .pow()
Math .abs(), .sqrt(), .log(), .log10(), .exp(), .round(), .floor(), .ceil(), .clip(), .sign()
Comparison <, <=, >, >=, ==, !=
Boolean &, \|, ~, ^
Membership .is_in(), .is_not_in(), .between()
Nulls .is_null(), .is_not_null(), .fill_null(), .fill_nan(), .drop_nulls()
Strings .lower(), .upper(), .strip(), .contains(), .startswith(), .endswith(), .replace(), .replace_all(), .str_len(), .split()
Casting .cast(), .to_float(), .to_int(), .to_str(), .to_bool()
Aggregation .sum(), .mean(), .median(), .min(), .max(), .std(), .var(), .count(), .n_unique(), .first(), .last()
Window .over(), .cum_sum(), .cum_max(), .cum_min(), .shift(), .diff(), .pct_change(), .rank(), .rolling_mean()
Naming .alias()

Helper Functions

Function Description
svy.col("name") Reference a column
svy.cols("a", "b") Reference multiple columns
svy.lit(value) Create a literal value
svy.when(cond).then(val).otherwise(default) Conditional logic
svy.coalesce(expr1, expr2, ...) First non-null value
svy.concat_str(expr1, expr2, ...) Concatenate strings
svy.sum_horizontal(...) Row-wise sum
svy.min_horizontal(...) Row-wise minimum
svy.max_horizontal(...) Row-wise maximum
svy.all_horizontal(...) Row-wise AND
svy.any_horizontal(...) Row-wise OR

Next Steps

Now that your data is clean and organized, learn how to plan your survey.

Master the basics?
Continue to Survey Planning →