import numpy as np
import polars as pl
import svy
from svy import Sample, CaseStyle, LetterCase
from rich import print as rprintSurvey Data Wrangling in Python with svy
A complete guide to cleaning, transforming, and preparing survey data
survey data wrangling Python, clean survey data svy, recode variables Python, bin continuous variables, survey data preparation, Polars data cleaning, column name standardization, top code bottom code, categorical variable recoding, mutate survey data, svy wrangling tutorial, data transformation Python
Why Wrangling in svy?
svy provides a wrangling interface for transforming the dataset within a Sample object. This approach has key advantages over extracting data to Polars and re-importing:
- Design integrity: Column renames automatically propagate to design metadata (strata, PSU, weights)
- Chained operations: Build transformation pipelines with method chaining
- Immutability: Transformations return new Sample objects by default, preserving the original
- Survey-aware: Protected columns (design variables) cannot be accidentally deleted
The wrangling module wraps common Polars operations—it’s not meant to replace Polars, but to make frequent survey data tasks faster and safer.
Column Operations
Clean Names
Standardize column names for easier downstream work. This is especially useful when importing data from SPSS, Stata, or Excel files with inconsistent naming.
df = pl.DataFrame({"First Name": ["Ana"], "Income ($)": [5000], "AGE_years": [25]})
s = Sample(df).wrangling.clean_names(
case_style=CaseStyle.SNAKE,
letter_case=LetterCase.LOWER
)
print(s.data.columns)['svy_row_index', 'first_name', 'income', 'age_years']
Remove special characters with regex:
df = pl.DataFrame({"A#B": [1], "C&D": [2], "E@F": [3]})
s = Sample(df).wrangling.clean_names(
remove=r"[^a-zA-Z0-9]",
letter_case=LetterCase.UPPER
)
print(s.data.columns)['svy_row_index', 'AB', 'CD', 'EF']
Available options:
| Parameter | Options | Description |
|---|---|---|
case_style |
SNAKE, CAMEL, PASCAL |
Target naming convention |
letter_case |
LOWER, UPPER, TITLE |
Letter case transformation |
remove |
regex string | Characters to strip |
minimal |
bool | Only trim whitespace |
Rename Columns
Rename specific columns directly:
df = pl.DataFrame({"old_name": [1, 2, 3], "another": [4, 5, 6]})
s = Sample(df).wrangling.rename_columns({"old_name": "new_name"})
print(s.data.columns)['svy_row_index', 'new_name', 'another']
When you rename a design column (stratum, PSU, weight), the Design object is automatically updated to reference the new name.
Select and Drop Columns
Keep only specific columns:
df = pl.DataFrame({"id": [1, 2], "name": ["A", "B"], "age": [20, 30], "temp": [0, 0]})
s = Sample(df)
# Keep specific columns
s_kept = s.wrangling.keep_columns(["id", "name", "age"])
print(s_kept.data.columns)
# Alias: select()
s_selected = s.wrangling.select(["id", "age"])
print(s_selected.data.columns)['id', 'name', 'age', 'svy_row_index']
['id', 'age', 'svy_row_index']
Remove unwanted columns:
# Start fresh for remove examples
s = Sample(df)
# Remove columns
s_dropped = s.wrangling.remove_columns(["temp"])
print(s_dropped.data.columns)
# Alias: drop()
s_dropped2 = s.wrangling.drop("age")
print(s_dropped2.data.columns)['svy_row_index', 'id', 'name', 'age']
['svy_row_index', 'id', 'name']
Design-referenced columns (strata, PSU, weights, etc.) are protected by default. Use force=True to remove them, which automatically cleans the design metadata. Internal svy columns (like svy_row_index) are always preserved and cannot be removed.
# This would raise an error if 'weight' is a design column
sample.wrangling.drop("weight") # Error!
# Use force=True to override
sample.wrangling.drop("weight", force=True) # Works, design cleanedAdd Row Index
Add a sequential row identifier:
df = pl.DataFrame({"name": ["Alice", "Bob", "Carol"]})
s = Sample(df).wrangling.with_row_index("row_id", offset=1)
print(s.data)shape: (3, 3)
┌────────┬───────────────┬───────┐
│ row_id ┆ svy_row_index ┆ name │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str │
╞════════╪═══════════════╪═══════╡
│ 1 ┆ 0 ┆ Alice │
│ 2 ┆ 1 ┆ Bob │
│ 3 ┆ 2 ┆ Carol │
└────────┴───────────────┴───────┘
Value Transformations
Recode Categories
Map old values to new labels. Values not in the mapping pass through unchanged.
df = pl.DataFrame({"item": ["apple", "soap", "carrot", "tv", "banana"]})
s = Sample(df)
# Map multiple values to a single category
s1 = s.wrangling.recode("item", {"Food": ["apple", "carrot", "banana"]})
print(s1.show_data(columns=["item", "svy_item_recoded"]))shape: (5, 2)
┌────────┬──────────────────┐
│ item ┆ svy_item_recoded │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪══════════════════╡
│ apple ┆ Food │
│ soap ┆ soap │
│ carrot ┆ Food │
│ tv ┆ tv │
│ banana ┆ Food │
└────────┴──────────────────┘
Custom output column name:
s2 = s.wrangling.recode(
cols="item",
recodes={"Food": ["apple", "carrot", "banana"], "Non-Food": ["soap", "tv"]},
into="item_category"
)
print(s2.show_data(columns=["item", "item_category"]))shape: (5, 2)
┌────────┬───────────────┐
│ item ┆ item_category │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪═══════════════╡
│ apple ┆ Food │
│ soap ┆ Non-Food │
│ carrot ┆ Food │
│ tv ┆ Non-Food │
│ banana ┆ Food │
└────────┴───────────────┘
Recode multiple columns at once:
df = pl.DataFrame({"q1": [1, 2, 3, 4, 5], "q2": [1, 1, 2, 2, 3]})
s = Sample(df).wrangling.recode(
cols=["q1", "q2"],
recodes={"Low": [1, 2], "Medium": [3], "High": [4, 5]}
)
print(s.show_data(columns=["q1", "svy_q1_recoded", "q2", "svy_q2_recoded"]))shape: (5, 4)
┌─────┬────────────────┬─────┬────────────────┐
│ q1 ┆ svy_q1_recoded ┆ q2 ┆ svy_q2_recoded │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞═════╪════════════════╪═════╪════════════════╡
│ 1 ┆ Low ┆ 1 ┆ Low │
│ 2 ┆ Low ┆ 1 ┆ Low │
│ 3 ┆ Medium ┆ 2 ┆ Low │
│ 4 ┆ High ┆ 2 ┆ Low │
│ 5 ┆ High ┆ 3 ┆ Medium │
└─────┴────────────────┴─────┴────────────────┘
Categorize Continuous Variables
Bin numeric values into labeled categories. Out-of-range values become null.
df = pl.DataFrame({"age": [5, 17, 25, 45, 70, 85]})
s = Sample(df)
# Create age groups
s_binned = s.wrangling.categorize(
col="age",
bins=[0, 18, 35, 65, 100],
labels=["0-17", "18-34", "35-64", "65+"],
into="age_group"
)
print(s_binned.show_data(columns=["age", "age_group"]))shape: (5, 2)
┌─────┬───────────┐
│ age ┆ age_group │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════════╡
│ 5 ┆ 0-17 │
│ 17 ┆ 0-17 │
│ 25 ┆ 18-34 │
│ 45 ┆ 35-64 │
│ 70 ┆ 65+ │
└─────┴───────────┘
Control bin closure (right-closed vs. left-closed):
df = pl.DataFrame({"x": [0, 10, 20, 30]})
# Right-closed (default): (0,10], (10,20], (20,30]
s_right = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30], right=True)
# Left-closed: [0,10), [10,20), [20,30)
s_left = Sample(df).wrangling.categorize(col="x", bins=[0, 10, 20, 30], right=False)
print("Right-closed:", s_right.data["svy_x_categorized"].to_list())
print("Left-closed:", s_left.data["svy_x_categorized"].to_list())Right-closed: [None, '(0, 10]', '(10, 20]', '(20, 30]']
Left-closed: ['[0, 10)', '[10, 20)', '[20, 30)', None]
Cap Extremes (Top/Bottom Coding)
Clamp outliers to specified bounds—common in income and expenditure surveys.
df = pl.DataFrame({"income": [500, 5000, 50000, 500000, 5000000]})
s = Sample(df)
# Top code: cap at 100,000
s_top = s.wrangling.top_code({"income": 100000})
print(s_top.show_data(columns=["income", "svy_income_top_coded"]))shape: (5, 2)
┌─────────┬──────────────────────┐
│ income ┆ svy_income_top_coded │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════╪══════════════════════╡
│ 500 ┆ 500 │
│ 5000 ┆ 5000 │
│ 50000 ┆ 50000 │
│ 500000 ┆ 100000 │
│ 5000000 ┆ 100000 │
└─────────┴──────────────────────┘
# Bottom code: floor at 1,000
s_bottom = s.wrangling.bottom_code({"income": 1000})
print(s_bottom.show_data(columns=["income", "svy_income_bottom_coded"]))shape: (5, 2)
┌─────────┬─────────────────────────┐
│ income ┆ svy_income_bottom_coded │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════╪═════════════════════════╡
│ 500 ┆ 1000 │
│ 5000 ┆ 5000 │
│ 50000 ┆ 50000 │
│ 500000 ┆ 500000 │
│ 5000000 ┆ 5000000 │
└─────────┴─────────────────────────┘
# Both: clamp to [1000, 100000]
s_clamped = s.wrangling.bottom_and_top_code({"income": (1000, 100000)})
print(s_clamped.show_data(columns=["income", "svy_income_bottom_and_top_coded"]))shape: (5, 2)
┌─────────┬─────────────────────────────────┐
│ income ┆ svy_income_bottom_and_top_code… │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════╪═════════════════════════════════╡
│ 500 ┆ 1000 │
│ 5000 ┆ 5000 │
│ 50000 ┆ 50000 │
│ 500000 ┆ 100000 │
│ 5000000 ┆ 100000 │
└─────────┴─────────────────────────────────┘
Replace the original column:
s_replaced = s.wrangling.bottom_and_top_code({"income": (1000, 100000)}, replace=True)
print(s_replaced.data["income"].to_list())[1000, 5000, 50000, 100000, 100000]
Cast Data Types
Convert columns to different data types:
df = pl.DataFrame({"id": ["1", "2", "3"], "value": [1, 2, 3]})
s = Sample(df)
# Single column
s1 = s.wrangling.cast("id", pl.Int64)
print(s1.data.schema)
# Multiple columns with different types
s2 = s.wrangling.cast({"id": pl.Int64, "value": pl.Float64})
print(s2.data.schema)Schema({'svy_row_index': UInt32, 'id': Int64, 'value': Int64})
Schema({'svy_row_index': UInt32, 'id': Int64, 'value': Float64})
Handling Missing Values
Fill Nulls
Fill missing values with a constant or strategy:
df = pl.DataFrame({
"income": [1000, None, 3000, None, 5000],
"age": [25, 30, None, 40, None]
})
s = Sample(df)
# Fill with constant
s1 = s.wrangling.fill_null("income", value=0)
print(s1.data["income"].to_list())
# Fill with median
s2 = s.wrangling.fill_null("income", strategy="mean")
print(s2.data["income"].to_list())[1000, 0, 3000, 0, 5000]
[1000, 0, 3000, 0, 5000]
Available strategies:
| Strategy | Description |
|---|---|
"forward" |
Fill with previous non-null value |
"backward" |
Fill with next non-null value |
"mean" |
Fill with column mean |
"min" |
Fill with column minimum |
"max" |
Fill with column maximum |
"zero" |
Fill with 0 |
"one" |
Fill with 1 |
Row Operations
Filter Records
Filter rows based on conditions:
df = pl.DataFrame({
"region": ["North", "South", "North", "East", "South"],
"age": [25, 45, 30, 55, 35],
"income": [30000, 50000, 40000, 60000, 45000]
})
s = Sample(df)
# Filter with dict (equality/membership)
s_north = s.wrangling.filter_records({"region": "North"})
print(s_north.data)shape: (2, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ i64 │
╞═══════════════╪════════╪═════╪════════╡
│ 0 ┆ North ┆ 25 ┆ 30000 │
│ 2 ┆ North ┆ 30 ┆ 40000 │
└───────────────┴────────┴─────┴────────┘
Filter with multiple values:
s_north_south = s.wrangling.filter_records({"region": ["North", "South"]})
print(s_north_south.data)shape: (2, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ i64 │
╞═══════════════╪════════╪═════╪════════╡
│ 0 ┆ North ┆ 25 ┆ 30000 │
│ 2 ┆ North ┆ 30 ┆ 40000 │
└───────────────┴────────┴─────┴────────┘
Filter with expressions:
# Complex conditions
s_filtered = s.wrangling.filter_records([
svy.col("age") >= 30,
svy.col("income") > 40000
])
print(s_filtered.data)shape: (0, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ i64 │
╞═══════════════╪════════╪═════╪════════╡
└───────────────┴────────┴─────┴────────┘
Negate the filter (exclude matching rows):
s_not_north = s.wrangling.filter_records({"region": "North"}, negate=True)
print(s_not_north.data)shape: (0, 4)
┌───────────────┬────────┬─────┬────────┐
│ svy_row_index ┆ region ┆ age ┆ income │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ i64 │
╞═══════════════╪════════╪═════╪════════╡
└───────────────┴────────┴─────┴────────┘
Sort Rows
Sort by one or more columns:
df = pl.DataFrame({
"name": ["Carol", "Alice", "Bob", "Diana"],
"age": [30, 25, 35, 25]
})
s = Sample(df)
# Sort ascending
s_sorted = s.wrangling.sort("age")
print(s_sorted.data)
# Sort descending
s_sorted_desc = s.wrangling.sort("age", descending=True)
print(s_sorted_desc.data)
# Sort by multiple columns
s_multi = s.wrangling.sort(["age", "name"], descending=[False, False])
print(s_multi.data)shape: (4, 3)
┌───────────────┬───────┬─────┐
│ svy_row_index ┆ name ┆ age │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 │
╞═══════════════╪═══════╪═════╡
│ 1 ┆ Alice ┆ 25 │
│ 3 ┆ Diana ┆ 25 │
│ 0 ┆ Carol ┆ 30 │
│ 2 ┆ Bob ┆ 35 │
└───────────────┴───────┴─────┘
shape: (4, 3)
┌───────────────┬───────┬─────┐
│ svy_row_index ┆ name ┆ age │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 │
╞═══════════════╪═══════╪═════╡
│ 2 ┆ Bob ┆ 35 │
│ 0 ┆ Carol ┆ 30 │
│ 3 ┆ Diana ┆ 25 │
│ 1 ┆ Alice ┆ 25 │
└───────────────┴───────┴─────┘
shape: (4, 3)
┌───────────────┬───────┬─────┐
│ svy_row_index ┆ name ┆ age │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 │
╞═══════════════╪═══════╪═════╡
│ 1 ┆ Alice ┆ 25 │
│ 3 ┆ Diana ┆ 25 │
│ 0 ┆ Carol ┆ 30 │
│ 2 ┆ Bob ┆ 35 │
└───────────────┴───────┴─────┘
Remove Duplicates
Keep unique rows:
df = pl.DataFrame({
"household_id": [1, 1, 2, 2, 3],
"member": ["A", "B", "A", "B", "A"],
"value": [100, 200, 150, 250, 300]
})
s = Sample(df)
# Unique by specific column(s)
s_unique = s.wrangling.distinct("household_id")
print(s_unique.data)
# Keep last occurrence instead of first
s_last = s.wrangling.distinct("household_id", keep="last")
print(s_last.data)shape: (3, 4)
┌───────────────┬──────────────┬────────┬───────┐
│ svy_row_index ┆ household_id ┆ member ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i64 │
╞═══════════════╪══════════════╪════════╪═══════╡
│ 0 ┆ 1 ┆ A ┆ 100 │
│ 2 ┆ 2 ┆ A ┆ 150 │
│ 4 ┆ 3 ┆ A ┆ 300 │
└───────────────┴──────────────┴────────┴───────┘
shape: (3, 4)
┌───────────────┬──────────────┬────────┬───────┐
│ svy_row_index ┆ household_id ┆ member ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i64 │
╞═══════════════╪══════════════╪════════╪═══════╡
│ 0 ┆ 1 ┆ A ┆ 100 │
│ 2 ┆ 2 ┆ A ┆ 150 │
│ 4 ┆ 3 ┆ A ┆ 300 │
└───────────────┴──────────────┴────────┴───────┘
Creating Variables with Mutate
The mutate() method is the most flexible way to create or transform columns.
Scalar Values
Create constant columns with scalar values:
df = pl.DataFrame({"x": [1, 2, 3]})
s = Sample(df)
s_mutated = s.wrangling.mutate({
"_one": 1, # integer
"pi": 3.14159, # float
"group": "A", # string
"flag": True, # boolean
"empty": None, # null
})
print(s_mutated.data)shape: (3, 7)
┌───────────────┬─────┬──────┬───────┬───────┬──────┬─────────┐
│ svy_row_index ┆ x ┆ flag ┆ empty ┆ group ┆ _one ┆ pi │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool ┆ null ┆ str ┆ i32 ┆ f64 │
╞═══════════════╪═════╪══════╪═══════╪═══════╪══════╪═════════╡
│ 0 ┆ 1 ┆ true ┆ null ┆ A ┆ 1 ┆ 3.14159 │
│ 1 ┆ 2 ┆ true ┆ null ┆ A ┆ 1 ┆ 3.14159 │
│ 2 ┆ 3 ┆ true ┆ null ┆ A ┆ 1 ┆ 3.14159 │
└───────────────┴─────┴──────┴───────┴───────┴──────┴─────────┘
Expressions
Transform columns using expressions:
df = pl.DataFrame({"income": [30000, 50000, 75000], "household_size": [2, 4, 3]})
s = Sample(df)
s_transformed = s.wrangling.mutate({
"income_k": svy.col("income") / 1000,
"per_capita": svy.col("income") / svy.col("household_size"),
"log_income": svy.col("income").log(),
})
print(s_transformed.data)shape: (3, 6)
┌───────────────┬────────┬────────────────┬────────────┬──────────┬────────────┐
│ svy_row_index ┆ income ┆ household_size ┆ per_capita ┆ income_k ┆ log_income │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪════════╪════════════════╪════════════╪══════════╪════════════╡
│ 0 ┆ 30000 ┆ 2 ┆ 15000.0 ┆ 30.0 ┆ 10.308953 │
│ 1 ┆ 50000 ┆ 4 ┆ 12500.0 ┆ 50.0 ┆ 10.819778 │
│ 2 ┆ 75000 ┆ 3 ┆ 25000.0 ┆ 75.0 ┆ 11.225243 │
└───────────────┴────────┴────────────────┴────────────┴──────────┴────────────┘
Conditional Logic
Use when().then().otherwise() for if-else logic:
df = pl.DataFrame({"age": [15, 25, 45, 70], "income": [0, 30000, 60000, 25000]})
s = Sample(df)
s_conditional = s.wrangling.mutate({
"age_group": (
svy.when(svy.col("age") < 18).then("child")
.when(svy.col("age") < 65).then("adult")
.otherwise("senior")
),
"income_bracket": (
svy.when(svy.col("income") < 20000).then("low")
.when(svy.col("income") < 50000).then("middle")
.otherwise("high")
),
})
print(s_conditional.data)shape: (4, 5)
┌───────────────┬─────┬────────┬───────────┬────────────────┐
│ svy_row_index ┆ age ┆ income ┆ age_group ┆ income_bracket │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ str ┆ str │
╞═══════════════╪═════╪════════╪═══════════╪════════════════╡
│ 0 ┆ 15 ┆ 0 ┆ child ┆ low │
│ 1 ┆ 25 ┆ 30000 ┆ adult ┆ middle │
│ 2 ┆ 45 ┆ 60000 ┆ adult ┆ high │
│ 3 ┆ 70 ┆ 25000 ┆ senior ┆ middle │
└───────────────┴─────┴────────┴───────────┴────────────────┘
Dependencies Within Same Call
Columns created in the same mutate() call can reference each other:
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
s = Sample(df)
s_chained = s.wrangling.mutate({
"x_squared": svy.col("x") ** 2,
"x_cubed": svy.col("x") ** 3,
"sum_powers": svy.col("x_squared") + svy.col("x_cubed"), # References columns created above
})
print(s_chained.data)shape: (5, 5)
┌───────────────┬─────┬───────────┬─────────┬────────────┐
│ svy_row_index ┆ x ┆ x_squared ┆ x_cubed ┆ sum_powers │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═══════════╪═════════╪════════════╡
│ 0 ┆ 1 ┆ 1 ┆ 1 ┆ 2 │
│ 1 ┆ 2 ┆ 4 ┆ 8 ┆ 12 │
│ 2 ┆ 3 ┆ 9 ┆ 27 ┆ 36 │
│ 3 ┆ 4 ┆ 16 ┆ 64 ┆ 80 │
│ 4 ┆ 5 ┆ 25 ┆ 125 ┆ 150 │
└───────────────┴─────┴───────────┴─────────┴────────────┘
Arrays and Series
Pass arrays directly (must match row count):
df = pl.DataFrame({"id": [1, 2, 3]})
s = Sample(df)
s_arrays = s.wrangling.mutate({
"from_list": [10, 20, 30],
"from_numpy": np.array([100, 200, 300]),
"from_series": pl.Series([1.1, 2.2, 3.3]),
})
print(s_arrays.data)shape: (3, 5)
┌───────────────┬─────┬───────────┬─────────────┬────────────┐
│ svy_row_index ┆ id ┆ from_list ┆ from_series ┆ from_numpy │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ f64 ┆ i64 │
╞═══════════════╪═════╪═══════════╪═════════════╪════════════╡
│ 0 ┆ 1 ┆ 10 ┆ 1.1 ┆ 100 │
│ 1 ┆ 2 ┆ 20 ┆ 2.2 ┆ 200 │
│ 2 ┆ 3 ┆ 30 ┆ 3.3 ┆ 300 │
└───────────────┴─────┴───────────┴─────────────┴────────────┘
Expression Methods Reference
The svy.col() expression supports a rich set of methods for transformations.
Arithmetic
df = pl.DataFrame({"x": [10, 20, 30], "y": [3, 4, 5]})
s = Sample(df)
s_arith = s.wrangling.mutate({
"add": svy.col("x") + svy.col("y"),
"sub": svy.col("x") - svy.col("y"),
"mul": svy.col("x") * svy.col("y"),
"div": svy.col("x") / svy.col("y"),
"floordiv": svy.col("x") // svy.col("y"),
"mod": svy.col("x") % svy.col("y"),
"pow": svy.col("y") ** 2,
})
print(s_arith.data)shape: (3, 10)
┌───────────────┬─────┬─────┬──────────┬───┬─────┬─────┬─────┬─────┐
│ svy_row_index ┆ x ┆ y ┆ floordiv ┆ … ┆ mul ┆ mod ┆ pow ┆ sub │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═════╪══════════╪═══╪═════╪═════╪═════╪═════╡
│ 0 ┆ 10 ┆ 3 ┆ 3 ┆ … ┆ 30 ┆ 1 ┆ 9 ┆ 7 │
│ 1 ┆ 20 ┆ 4 ┆ 5 ┆ … ┆ 80 ┆ 0 ┆ 16 ┆ 16 │
│ 2 ┆ 30 ┆ 5 ┆ 6 ┆ … ┆ 150 ┆ 0 ┆ 25 ┆ 25 │
└───────────────┴─────┴─────┴──────────┴───┴─────┴─────┴─────┴─────┘
Math Functions
df = pl.DataFrame({"x": [1.0, 4.0, 9.0, 16.0]})
s = Sample(df)
s_math = s.wrangling.mutate({
"sqrt": svy.col("x").sqrt(),
"log": svy.col("x").log(),
"log10": svy.col("x").log10(),
"exp": svy.col("x").exp(),
"abs": (-svy.col("x")).abs(),
"round": (svy.col("x") / 3).round(2),
"floor": (svy.col("x") / 3).floor(),
"ceil": (svy.col("x") / 3).ceil(),
})
print(s_math.data)shape: (4, 10)
┌───────────────┬──────┬───────┬──────────┬───┬──────┬──────┬──────────┬──────┐
│ svy_row_index ┆ x ┆ round ┆ log ┆ … ┆ ceil ┆ sqrt ┆ log10 ┆ abs │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪══════╪═══════╪══════════╪═══╪══════╪══════╪══════════╪══════╡
│ 0 ┆ 1.0 ┆ 0.33 ┆ 0.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 0.0 ┆ 1.0 │
│ 1 ┆ 4.0 ┆ 1.33 ┆ 1.386294 ┆ … ┆ 2.0 ┆ 2.0 ┆ 0.60206 ┆ 4.0 │
│ 2 ┆ 9.0 ┆ 3.0 ┆ 2.197225 ┆ … ┆ 3.0 ┆ 3.0 ┆ 0.954243 ┆ 9.0 │
│ 3 ┆ 16.0 ┆ 5.33 ┆ 2.772589 ┆ … ┆ 6.0 ┆ 4.0 ┆ 1.20412 ┆ 16.0 │
└───────────────┴──────┴───────┴──────────┴───┴──────┴──────┴──────────┴──────┘
Clipping Values
df = pl.DataFrame({"x": [1, 5, 10, 15, 20]})
s = Sample(df)
s_clipped = s.wrangling.mutate({
"clipped": svy.col("x").clip(5, 15),
})
print(s_clipped.data)shape: (5, 3)
┌───────────────┬─────┬─────────┐
│ svy_row_index ┆ x ┆ clipped │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═════════╡
│ 0 ┆ 1 ┆ 5 │
│ 1 ┆ 5 ┆ 5 │
│ 2 ┆ 10 ┆ 10 │
│ 3 ┆ 15 ┆ 15 │
│ 4 ┆ 20 ┆ 15 │
└───────────────┴─────┴─────────┘
Comparisons and Boolean Logic
df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1]})
s = Sample(df)
s_bool = s.wrangling.mutate({
"x_gt_2": svy.col("x") > 2,
"x_eq_y": svy.col("x") == svy.col("y"),
"both": (svy.col("x") > 2) & (svy.col("y") > 2),
"either": (svy.col("x") > 4) | (svy.col("y") > 4),
"not_gt_2": ~(svy.col("x") > 2),
})
print(s_bool.data)shape: (5, 8)
┌───────────────┬─────┬─────┬────────┬───────┬────────┬────────┬──────────┐
│ svy_row_index ┆ x ┆ y ┆ either ┆ both ┆ x_gt_2 ┆ x_eq_y ┆ not_gt_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ bool ┆ bool ┆ bool ┆ bool ┆ bool │
╞═══════════════╪═════╪═════╪════════╪═══════╪════════╪════════╪══════════╡
│ 0 ┆ 1 ┆ 5 ┆ true ┆ false ┆ false ┆ false ┆ true │
│ 1 ┆ 2 ┆ 4 ┆ false ┆ false ┆ false ┆ false ┆ true │
│ 2 ┆ 3 ┆ 3 ┆ false ┆ true ┆ true ┆ true ┆ false │
│ 3 ┆ 4 ┆ 2 ┆ false ┆ false ┆ true ┆ false ┆ false │
│ 4 ┆ 5 ┆ 1 ┆ true ┆ false ┆ true ┆ false ┆ false │
└───────────────┴─────┴─────┴────────┴───────┴────────┴────────┴──────────┘
Membership Testing
df = pl.DataFrame({"category": ["A", "B", "C", "D", "E"], "value": [1, 5, 10, 15, 20]})
s = Sample(df)
s_membership = s.wrangling.mutate({
"in_list": svy.col("category").is_in(["A", "C", "E"]),
"not_in_list": svy.col("category").is_not_in(["A", "C", "E"]),
"in_range": svy.col("value").between(5, 15),
})
print(s_membership.data)shape: (5, 6)
┌───────────────┬──────────┬───────┬─────────────┬──────────┬─────────┐
│ svy_row_index ┆ category ┆ value ┆ not_in_list ┆ in_range ┆ in_list │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ bool ┆ bool ┆ bool │
╞═══════════════╪══════════╪═══════╪═════════════╪══════════╪═════════╡
│ 0 ┆ A ┆ 1 ┆ false ┆ false ┆ true │
│ 1 ┆ B ┆ 5 ┆ true ┆ true ┆ false │
│ 2 ┆ C ┆ 10 ┆ false ┆ true ┆ true │
│ 3 ┆ D ┆ 15 ┆ true ┆ true ┆ false │
│ 4 ┆ E ┆ 20 ┆ false ┆ false ┆ true │
└───────────────┴──────────┴───────┴─────────────┴──────────┴─────────┘
Null Handling
df = pl.DataFrame({"x": [1, None, 3, None, 5]})
s = Sample(df)
s_nulls = s.wrangling.mutate({
"is_null": svy.col("x").is_null(),
"is_not_null": svy.col("x").is_not_null(),
"filled_zero": svy.col("x").fill_null(0),
"filled_expr": svy.col("x").fill_null(svy.col("x").mean()),
})
print(s_nulls.data)shape: (5, 6)
┌───────────────┬──────┬─────────────┬─────────────┬─────────────┬─────────┐
│ svy_row_index ┆ x ┆ is_not_null ┆ filled_expr ┆ filled_zero ┆ is_null │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool ┆ f64 ┆ i64 ┆ bool │
╞═══════════════╪══════╪═════════════╪═════════════╪═════════════╪═════════╡
│ 0 ┆ 1 ┆ true ┆ 1.0 ┆ 1 ┆ false │
│ 1 ┆ null ┆ false ┆ 3.0 ┆ 0 ┆ true │
│ 2 ┆ 3 ┆ true ┆ 3.0 ┆ 3 ┆ false │
│ 3 ┆ null ┆ false ┆ 3.0 ┆ 0 ┆ true │
│ 4 ┆ 5 ┆ true ┆ 5.0 ┆ 5 ┆ false │
└───────────────┴──────┴─────────────┴─────────────┴─────────────┴─────────┘
String Operations
df = pl.DataFrame({"name": [" Alice ", "BOB", "carol", "Diana Smith"]})
s = Sample(df)
s_strings = s.wrangling.mutate({
"stripped": svy.col("name").strip(),
"lower": svy.col("name").lower(),
"upper": svy.col("name").upper(),
"title": svy.col("name").title(),
"length": svy.col("name").str_len(),
"contains_a": svy.col("name").lower().contains("a"),
"starts_d": svy.col("name").startswith("D"),
})
print(s_strings.data)shape: (4, 9)
┌────────────┬────────────┬──────────┬────────────┬───┬───────────┬───────────┬───────────┬────────┐
│ svy_row_in ┆ name ┆ starts_d ┆ contains_a ┆ … ┆ lower ┆ stripped ┆ title ┆ length │
│ dex ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ --- ┆ str ┆ bool ┆ bool ┆ ┆ str ┆ str ┆ str ┆ u32 │
│ u32 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
╞════════════╪════════════╪══════════╪════════════╪═══╪═══════════╪═══════════╪═══════════╪════════╡
│ 0 ┆ Alice ┆ false ┆ true ┆ … ┆ alice ┆ Alice ┆ Alice ┆ 9 │
│ 1 ┆ BOB ┆ false ┆ false ┆ … ┆ bob ┆ BOB ┆ Bob ┆ 3 │
│ 2 ┆ carol ┆ false ┆ true ┆ … ┆ carol ┆ carol ┆ Carol ┆ 5 │
│ 3 ┆ Diana ┆ true ┆ true ┆ … ┆ diana ┆ Diana ┆ Diana ┆ 11 │
│ ┆ Smith ┆ ┆ ┆ ┆ smith ┆ Smith ┆ Smith ┆ │
└────────────┴────────────┴──────────┴────────────┴───┴───────────┴───────────┴───────────┴────────┘
String replacement:
df = pl.DataFrame({"text": ["hello world", "world world", "goodbye"]})
s = Sample(df)
s_replace = s.wrangling.mutate({
"replace_first": svy.col("text").replace("world", "python"),
"replace_all": svy.col("text").replace_all("world", "python"),
})
print(s_replace.data)shape: (3, 4)
┌───────────────┬─────────────┬───────────────┬───────────────┐
│ svy_row_index ┆ text ┆ replace_first ┆ replace_all │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═══════════════╪═════════════╪═══════════════╪═══════════════╡
│ 0 ┆ hello world ┆ hello python ┆ hello python │
│ 1 ┆ world world ┆ python world ┆ python python │
│ 2 ┆ goodbye ┆ goodbye ┆ goodbye │
└───────────────┴─────────────┴───────────────┴───────────────┘
Type Casting
df = pl.DataFrame({"x": [1, 2, 3], "y": ["10", "20", "30"]})
s = Sample(df)
s_cast = s.wrangling.mutate({
"x_float": svy.col("x").to_float(),
"x_str": svy.col("x").to_str(),
"y_int": svy.col("y").cast(pl.Int64),
})
print(s_cast.data)shape: (3, 6)
┌───────────────┬─────┬─────┬───────┬─────────┬───────┐
│ svy_row_index ┆ x ┆ y ┆ y_int ┆ x_float ┆ x_str │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i64 ┆ f64 ┆ str │
╞═══════════════╪═════╪═════╪═══════╪═════════╪═══════╡
│ 0 ┆ 1 ┆ 10 ┆ 10 ┆ 1.0 ┆ 1 │
│ 1 ┆ 2 ┆ 20 ┆ 20 ┆ 2.0 ┆ 2 │
│ 2 ┆ 3 ┆ 30 ┆ 30 ┆ 3.0 ┆ 3 │
└───────────────┴─────┴─────┴───────┴─────────┴───────┘
Aggregations in Window Context
Use .over() to compute aggregations within groups:
df = pl.DataFrame({
"region": ["North", "North", "South", "South", "South"],
"sales": [100, 150, 200, 250, 300]
})
s = Sample(df)
s_window = s.wrangling.mutate({
"region_total": svy.col("sales").sum().over("region"),
"region_mean": svy.col("sales").mean().over("region"),
"region_max": svy.col("sales").max().over("region"),
"pct_of_region": svy.col("sales") / svy.col("sales").sum().over("region") * 100,
})
print(s_window.data)shape: (5, 7)
┌───────────────┬────────┬───────┬─────────────┬────────────┬───────────────┬──────────────┐
│ svy_row_index ┆ region ┆ sales ┆ region_mean ┆ region_max ┆ pct_of_region ┆ region_total │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ f64 ┆ i64 ┆ f64 ┆ i64 │
╞═══════════════╪════════╪═══════╪═════════════╪════════════╪═══════════════╪══════════════╡
│ 0 ┆ North ┆ 100 ┆ 125.0 ┆ 150 ┆ 40.0 ┆ 250 │
│ 1 ┆ North ┆ 150 ┆ 125.0 ┆ 150 ┆ 60.0 ┆ 250 │
│ 2 ┆ South ┆ 200 ┆ 250.0 ┆ 300 ┆ 26.666667 ┆ 750 │
│ 3 ┆ South ┆ 250 ┆ 250.0 ┆ 300 ┆ 33.333333 ┆ 750 │
│ 4 ┆ South ┆ 300 ┆ 250.0 ┆ 300 ┆ 40.0 ┆ 750 │
└───────────────┴────────┴───────┴─────────────┴────────────┴───────────────┴──────────────┘
Cumulative Functions
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
s = Sample(df)
s_cumulative = s.wrangling.mutate({
"cum_sum": svy.col("x").cum_sum(),
"cum_max": svy.col("x").cum_max(),
"cum_min": svy.col("x").cum_min(),
})
print(s_cumulative.data)shape: (5, 5)
┌───────────────┬─────┬─────────┬─────────┬─────────┐
│ svy_row_index ┆ x ┆ cum_min ┆ cum_sum ┆ cum_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═════════╪═════════╪═════════╡
│ 0 ┆ 1 ┆ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 2 ┆ 1 ┆ 3 ┆ 2 │
│ 2 ┆ 3 ┆ 1 ┆ 6 ┆ 3 │
│ 3 ┆ 4 ┆ 1 ┆ 10 ┆ 4 │
│ 4 ┆ 5 ┆ 1 ┆ 15 ┆ 5 │
└───────────────┴─────┴─────────┴─────────┴─────────┘
Shift and Diff
df = pl.DataFrame({"value": [100, 110, 105, 120, 115]})
s = Sample(df)
s_shift = s.wrangling.mutate({
"prev_value": svy.col("value").shift(1),
"next_value": svy.col("value").shift(-1),
"change": svy.col("value").diff(),
"pct_change": svy.col("value").pct_change(),
})
print(s_shift.data)shape: (5, 6)
┌───────────────┬───────┬────────────┬────────────┬────────┬────────────┐
│ svy_row_index ┆ value ┆ next_value ┆ pct_change ┆ change ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═══════════════╪═══════╪════════════╪════════════╪════════╪════════════╡
│ 0 ┆ 100 ┆ 110 ┆ null ┆ null ┆ null │
│ 1 ┆ 110 ┆ 105 ┆ 0.1 ┆ 10 ┆ 100 │
│ 2 ┆ 105 ┆ 120 ┆ -0.045455 ┆ -5 ┆ 110 │
│ 3 ┆ 120 ┆ 115 ┆ 0.142857 ┆ 15 ┆ 105 │
│ 4 ┆ 115 ┆ null ┆ -0.041667 ┆ -5 ┆ 120 │
└───────────────┴───────┴────────────┴────────────┴────────┴────────────┘
Ranking
df = pl.DataFrame({"score": [85, 92, 78, 92, 88]})
s = Sample(df)
s_rank = s.wrangling.mutate({
"rank": svy.col("score").rank(),
"rank_desc": svy.col("score").rank(descending=True),
})
print(s_rank.data)shape: (5, 4)
┌───────────────┬───────┬──────┬───────────┐
│ svy_row_index ┆ score ┆ rank ┆ rank_desc │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ u32 ┆ u32 │
╞═══════════════╪═══════╪══════╪═══════════╡
│ 0 ┆ 85 ┆ 2 ┆ 4 │
│ 1 ┆ 92 ┆ 4 ┆ 1 │
│ 2 ┆ 78 ┆ 1 ┆ 5 │
│ 3 ┆ 92 ┆ 5 ┆ 2 │
│ 4 ┆ 88 ┆ 3 ┆ 3 │
└───────────────┴───────┴──────┴───────────┘
Helper Functions
Coalesce
Return the first non-null value across multiple columns:
df = pl.DataFrame({
"primary_phone": ["555-1234", None, None],
"secondary_phone": [None, "555-5678", None],
"emergency_phone": ["555-9999", "555-9999", "555-9999"]
})
s = Sample(df)
s_coalesce = s.wrangling.mutate({
"contact_phone": svy.coalesce(
svy.col("primary_phone"),
svy.col("secondary_phone"),
svy.col("emergency_phone")
),
})
print(s_coalesce.data)shape: (3, 5)
┌───────────────┬───────────────┬─────────────────┬─────────────────┬───────────────┐
│ svy_row_index ┆ primary_phone ┆ secondary_phone ┆ emergency_phone ┆ contact_phone │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str │
╞═══════════════╪═══════════════╪═════════════════╪═════════════════╪═══════════════╡
│ 0 ┆ 555-1234 ┆ null ┆ 555-9999 ┆ 555-1234 │
│ 1 ┆ null ┆ 555-5678 ┆ 555-9999 ┆ 555-5678 │
│ 2 ┆ null ┆ null ┆ 555-9999 ┆ 555-9999 │
└───────────────┴───────────────┴─────────────────┴─────────────────┴───────────────┘
Concatenate Strings
Join string columns together:
df = pl.DataFrame({
"first_name": ["John", "Jane"],
"last_name": ["Doe", "Smith"]
})
s = Sample(df)
s_concat = s.wrangling.mutate({
"full_name": svy.concat_str(
svy.col("first_name"),
svy.lit(" "),
svy.col("last_name")
),
})
print(s_concat.data)shape: (2, 4)
┌───────────────┬────────────┬───────────┬────────────┐
│ svy_row_index ┆ first_name ┆ last_name ┆ full_name │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═══════════════╪════════════╪═══════════╪════════════╡
│ 0 ┆ John ┆ Doe ┆ John Doe │
│ 1 ┆ Jane ┆ Smith ┆ Jane Smith │
└───────────────┴────────────┴───────────┴────────────┘
Horizontal Operations
Perform row-wise operations across multiple columns:
df = pl.DataFrame({
"q1": [4, 3, 5, 2],
"q2": [3, 4, 4, 3],
"q3": [5, 3, 5, 4]
})
s = Sample(df)
s_horizontal = s.wrangling.mutate({
"total_score": svy.sum_horizontal(svy.col("q1"), svy.col("q2"), svy.col("q3")),
"min_score": svy.min_horizontal(svy.col("q1"), svy.col("q2"), svy.col("q3")),
"max_score": svy.max_horizontal(svy.col("q1"), svy.col("q2"), svy.col("q3")),
})
print(s_horizontal.data)shape: (4, 7)
┌───────────────┬─────┬─────┬─────┬───────────┬───────────┬─────────────┐
│ svy_row_index ┆ q1 ┆ q2 ┆ q3 ┆ min_score ┆ max_score ┆ total_score │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════════╪═════╪═════╪═════╪═══════════╪═══════════╪═════════════╡
│ 0 ┆ 4 ┆ 3 ┆ 5 ┆ 3 ┆ 5 ┆ 12 │
│ 1 ┆ 3 ┆ 4 ┆ 3 ┆ 3 ┆ 4 ┆ 10 │
│ 2 ┆ 5 ┆ 4 ┆ 5 ┆ 4 ┆ 5 ┆ 14 │
│ 3 ┆ 2 ┆ 3 ┆ 4 ┆ 2 ┆ 4 ┆ 9 │
└───────────────┴─────┴─────┴─────┴───────────┴───────────┴─────────────┘
Boolean horizontal operations:
df = pl.DataFrame({
"passed_test1": [True, True, False, True],
"passed_test2": [True, False, False, True],
"passed_test3": [True, True, False, False]
})
s = Sample(df)
s_bool_horiz = s.wrangling.mutate({
"passed_all": svy.all_horizontal(
svy.col("passed_test1"),
svy.col("passed_test2"),
svy.col("passed_test3")
),
"passed_any": svy.any_horizontal(
svy.col("passed_test1"),
svy.col("passed_test2"),
svy.col("passed_test3")
),
})
print(s_bool_horiz.data)shape: (4, 6)
┌───────────────┬──────────────┬──────────────┬──────────────┬────────────┬────────────┐
│ svy_row_index ┆ passed_test1 ┆ passed_test2 ┆ passed_test3 ┆ passed_any ┆ passed_all │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ bool ┆ bool ┆ bool ┆ bool │
╞═══════════════╪══════════════╪══════════════╪══════════════╪════════════╪════════════╡
│ 0 ┆ true ┆ true ┆ true ┆ true ┆ true │
│ 1 ┆ true ┆ false ┆ true ┆ true ┆ false │
│ 2 ┆ false ┆ false ┆ false ┆ false ┆ false │
│ 3 ┆ true ┆ true ┆ false ┆ true ┆ false │
└───────────────┴──────────────┴──────────────┴──────────────┴────────────┴────────────┘
Method Chaining
All wrangling methods return the Sample, enabling fluent pipelines:
df = pl.DataFrame({
"First Name": ["alice", "bob", "carol"],
"Age_Years": [25, None, 35],
"Income ($)": [30000, 50000, 1000000]
})
result = (
Sample(df)
.wrangling.clean_names(case_style=CaseStyle.SNAKE, letter_case=LetterCase.LOWER)
.wrangling.fill_null("age_years", strategy="mean")
.wrangling.top_code({"income": 200000}, replace=True)
.wrangling.mutate({
"income_k": svy.col("income") / 1000,
"is_adult": svy.col("age_years") >= 18,
})
)
print(result.data)shape: (3, 6)
┌───────────────┬────────────┬───────────┬────────┬──────────┬──────────┐
│ svy_row_index ┆ first_name ┆ age_years ┆ income ┆ is_adult ┆ income_k │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 ┆ i64 ┆ bool ┆ f64 │
╞═══════════════╪════════════╪═══════════╪════════╪══════════╪══════════╡
│ 0 ┆ alice ┆ 25 ┆ 30000 ┆ true ┆ 30.0 │
│ 1 ┆ bob ┆ 30 ┆ 50000 ┆ true ┆ 50.0 │
│ 2 ┆ carol ┆ 35 ┆ 200000 ┆ true ┆ 200.0 │
└───────────────┴────────────┴───────────┴────────┴──────────┴──────────┘
Working with Survey Design
When your Sample has a design, wrangling operations maintain design integrity:
df = pl.DataFrame({
"id": [1, 2, 3, 4],
"stratum_var": ["A", "A", "B", "B"],
"psu_var": [1, 1, 2, 2],
"wgt_var": [1.0, 1.2, 0.9, 1.1],
"income": [30000, 50000, 40000, 60000]
})
design = svy.Design(stratum="stratum_var", psu="psu_var", wgt="wgt_var")
sample = svy.Sample(df, design=design)
# Rename a design column - design is automatically updated
sample_renamed = sample.wrangling.rename_columns({"stratum_var": "region"})
print(f"Original stratum: {sample.design.stratum}")
print(f"After rename: {sample_renamed.design.stratum}")Original stratum: region
After rename: region
Variable Labels
Apply descriptive labels to variables (useful for documentation and export):
df = pl.DataFrame({"q1": [1, 2, 1, 2], "q2": [1, 1, 2, 2]})
s = Sample(df)
s_labeled = s.wrangling.apply_labels(
cols=["q1", "q2"],
labels=["Satisfaction with service", "Would recommend"],
categories={1: "Yes", 2: "No"}
)
# Access labels
rprint(s_labeled.labels){'q1': {1: 'Yes', 2: 'No'}, 'q2': {1: 'Yes', 2: 'No'}}
Quick Reference
Wrangling Methods
| Task | Method | Key Parameters |
|---|---|---|
| Column Names | ||
| Standardize names | clean_names() |
case_style, letter_case, remove |
| Rename columns | rename_columns() |
renames dict |
| Keep columns | keep_columns() / select() |
columns, force |
| Drop columns | remove_columns() / drop() |
columns, force |
| Add row index | with_row_index() |
name, offset |
| Values | ||
| Recode categories | recode() |
cols, recodes, into, replace |
| Bin continuous | categorize() |
col, bins, labels, right |
| Top code | top_code() |
top_codes, replace, into |
| Bottom code | bottom_code() |
bottom_codes, replace, into |
| Clamp both | bottom_and_top_code() |
bottom_and_top_codes |
| Change dtype | cast() |
cols, dtype, strict |
| Fill nulls | fill_null() |
cols, value, strategy |
| Rows | ||
| Filter rows | filter_records() |
where, negate |
| Sort rows | sort() |
by, descending, nulls_last |
| Deduplicate | distinct() |
cols, keep, maintain_order |
| Create/Transform | ||
| New columns | mutate() |
specs, inplace |
| Metadata | ||
| Add labels | apply_labels() |
cols, labels, categories |
Expression Methods
| Category | Methods |
|---|---|
| Arithmetic | +, -, *, /, //, %, **, .pow() |
| Math | .abs(), .sqrt(), .log(), .log10(), .exp(), .round(), .floor(), .ceil(), .clip(), .sign() |
| Comparison | <, <=, >, >=, ==, != |
| Boolean | &, \|, ~, ^ |
| Membership | .is_in(), .is_not_in(), .between() |
| Nulls | .is_null(), .is_not_null(), .fill_null(), .fill_nan(), .drop_nulls() |
| Strings | .lower(), .upper(), .strip(), .contains(), .startswith(), .endswith(), .replace(), .replace_all(), .str_len(), .split() |
| Casting | .cast(), .to_float(), .to_int(), .to_str(), .to_bool() |
| Aggregation | .sum(), .mean(), .median(), .min(), .max(), .std(), .var(), .count(), .n_unique(), .first(), .last() |
| Window | .over(), .cum_sum(), .cum_max(), .cum_min(), .shift(), .diff(), .pct_change(), .rank(), .rolling_mean() |
| Naming | .alias() |
Helper Functions
| Function | Description |
|---|---|
svy.col("name") |
Reference a column |
svy.cols("a", "b") |
Reference multiple columns |
svy.lit(value) |
Create a literal value |
svy.when(cond).then(val).otherwise(default) |
Conditional logic |
svy.coalesce(expr1, expr2, ...) |
First non-null value |
svy.concat_str(expr1, expr2, ...) |
Concatenate strings |
svy.sum_horizontal(...) |
Row-wise sum |
svy.min_horizontal(...) |
Row-wise minimum |
svy.max_horizontal(...) |
Row-wise maximum |
svy.all_horizontal(...) |
Row-wise AND |
svy.any_horizontal(...) |
Row-wise OR |
Next Steps
Now that your data is clean and organized, learn how to plan your survey.
Master the basics?
Continue to Survey Planning →