Quick Tour: The Sample Object

Learn the svy Sample object - your central interface for survey data exploration, filtering, and summaries before analysis.
Keywords

svy Sample object, survey data exploration, svy tutorial, inspect survey data

5-minute introduction to Sample—the core object you’ll use throughout these tutorials.

What is Sample?

Sample wraps your survey data (a Polars DataFrame) with design information, providing a unified interface for data exploration, wrangling, weighting, and estimation.

Think of Sample as:

  • Your survey dataset + design metadata
  • A gateway to all svy functionality
  • Immutable by default (transformations return new Sample objects)
import polars as pl
import svy

# Create sample data
df = pl.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "region": ["North", "South", "North", "East", "South"],
    "age": [22, 47, 35, 61, 29],
    "income": [45000, 62000, 51000, 78000, 43000],
    "weight": [1.0, 1.2, 0.9, 1.1, 0.8],
})

# Define survey design
design = svy.Design(wgt="weight", stratum="region")

# Create Sample object
sample = svy.Sample(df, design=design)

print(sample)
╭─────────────────────────── Sample ────────────────────────────╮
 Survey Data:                                                  
   Number of rows: 5                                           
   Number of columns: 7                                        
   Number of strata: 3                                         
   Number of PSUs: None                                        
                                                               
 Survey Design:                                                
                                                               
    Field               Value                                  
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                          
    Row index           svy_row_index                          
    Stratum             region                                 
    PSU                 None                                   
    SSU                 None                                   
    Weight              weight                                 
    With replacement    False                                  
    Prob                None                                   
    Hit                 None                                   
    MOS                 None                                   
    Population size     None                                   
    Replicate weights   None                                   
                                                               
╰───────────────────────────────────────────────────────────────╯

Quick Data Inspection

Preview Data

# First 3 rows
sample.show_data(how="head", n=3)

# Specific columns only
sample.show_data(columns=["id", "region", "age"], how="head", n=3)

# Last 2 rows, sorted by age
sample.show_data(how="tail", n=2, sort_by="age", descending=True)

# Random sample (reproducible with seed)
sample.show_data(how="sample", n=3, rstate=42)
shape: (3, 6)
svy_row_index id region age income weight
u32 i64 str i64 i64 f64
2 3 "North" 35 51000 0.9
1 2 "South" 47 62000 1.2
4 5 "South" 29 43000 0.8

Filter Records

# Filter by values (dictionary syntax)
sample.show_records(
    where={"region": ["North", "East"]},
    columns=["id", "region", "age"]
)

# Filter with expressions
from svy.core.expr import col

sample.show_records(
    where=[col("age") > 30, col("region") == "South"],
    order_by="income",
    descending=True
)
shape: (1, 6)
svy_row_index id region age income weight
u32 i64 str i64 i64 f64
1 2 "South" 47 62000 1.2

Sample Properties

Access key information about your sample:

print(f"Number of records: {sample.n_records}\n")
print(f"Number of columns: {sample.n_columns}\n")
print(f"Number of strata: {sample.n_strata}\n")
print(f"Number of psus: {sample.n_psus}\n")

print(f"Strata: {sample.strata}")

# Access underlying data (defensive copy)
df_copy = sample.data
print(df_copy.head())

# Access design
design_copy = sample.design
print(design_copy)
Number of records: 5

Number of columns: 7

Number of strata: 3

Number of psus: 0

Strata: shape: (3, 1)
┌────────┐
│ region │
│ ---    │
│ str    │
╞════════╡
│ East   │
│ North  │
│ South  │
└────────┘
shape: (5, 6)
┌───────────────┬─────┬────────┬─────┬────────┬────────┐
│ svy_row_index ┆ id  ┆ region ┆ age ┆ income ┆ weight │
│ ---           ┆ --- ┆ ---    ┆ --- ┆ ---    ┆ ---    │
│ u32           ┆ i64 ┆ str    ┆ i64 ┆ i64    ┆ f64    │
╞═══════════════╪═════╪════════╪═════╪════════╪════════╡
│ 0             ┆ 1   ┆ North  ┆ 22  ┆ 45000  ┆ 1.0    │
│ 1             ┆ 2   ┆ South  ┆ 47  ┆ 62000  ┆ 1.2    │
│ 2             ┆ 3   ┆ North  ┆ 35  ┆ 51000  ┆ 0.9    │
│ 3             ┆ 4   ┆ East   ┆ 61  ┆ 78000  ┆ 1.1    │
│ 4             ┆ 5   ┆ South  ┆ 29  ┆ 43000  ┆ 0.8    │
└───────────────┴─────┴────────┴─────┴────────┴────────┘
╭───────────── Design ──────────────╮
 Field               Value         
 ───────────────────────────────── 
 Row index           svy_row_index 
 Stratum             region        
 PSU                 None          
 SSU                 None          
 Weight              weight        
 With replacement    False         
 Prob                None          
 Hit                 None          
 MOS                 None          
 Population size     None          
 Replicate weights   None          
╰───────────────────────────────────╯

Note: sample.data and sample.design return defensive copies—safe to inspect without modifying the original Sample.

Data Summaries

Unweighted Summary

# Quick statistical summary
summary = sample.describe()
print(summary)
╭────────────────────────────────────── Describe ──────────────────────────────────────╮
 Columns: 5                                                                           
 Weighted: False                                                                      
 drop_nulls: True                                                                     
 percentiles: (0.05, 0.25, 0.5, 0.75, 0.95)                                           
 generated_at: 2026-01-08T21:53:13+00:00                                              
                                                                                      
 Numeric                                                                              
                                                                                      
   name    type           mis   mea       std   min    p25   p50    p75   max     sum 
   ────────────────────────────────────────────────────────────────────────────────── 
   id      Discrete         0     3   1.58114     1      2     3      4     5      15 
   age     Discrete         0   38.   15.4337    22     29    35     47    61     194 
   inco…   Discrete         0   558   14446.5   430   4500   510   6200   780   27900 
   weig…   Continuo…        0     1   0.15811   0.8    0.9     1    1.1   1.2       5 
                                                                                      
 String                                                                               
                                                                                      
   name     n   miss   unique   shortest   longest                                    
   ───────────────────────────────────────────────                                    
   region   5      0        3          4         5                                    
╰──────────────────────────────────────────────────────────────────────────────────────╯

Output includes: - Count, missing values - Mean, standard deviation - Min, quartiles, max - For categorical: top categories and frequencies

Weighted Summary

# Uses design weights if available
weighted_summary = sample.describe(weighted=True)
print(weighted_summary)
╭────────────────────────────────────── Describe ──────────────────────────────────────╮
 Columns: 5                                                                           
 Weighted: True (weight_col=weight)                                                   
 drop_nulls: True                                                                     
 percentiles: (0.05, 0.25, 0.5, 0.75, 0.95)                                           
 generated_at: 2026-01-08T21:53:13+00:00                                              
                                                                                      
 Numeric                                                                              
                                                                                      
   name    type           mis   mea       std   min    p25   p50    p75   max     sum 
   ────────────────────────────────────────────────────────────────────────────────── 
   id      Discrete         0   2.9   1.58114     1      2     3      4     5    14.5 
   age     Discrete         0   40.   15.4337    22     29    35     47    61   200.2 
   inco…   Discrete         0   571   14446.5   430   4500   510   6200   780   28550 
   weig…   Continuo…        0   1.0   0.15811   0.8    0.9     1    1.1   1.2     5.1 
                                                                                      
 String                                                                               
                                                                                      
   name     n   miss   unique   shortest   longest                                    
   ───────────────────────────────────────────────                                    
   region   5      0        3          4         5                                    
╰──────────────────────────────────────────────────────────────────────────────────────╯

Weighted summaries account for sampling design, producing population-representative statistics.

Sample is Immutable

Transformations return new Sample objects:

# Original sample unchanged
original = sample

# Wrangling creates new sample
cleaned = sample.wrangling.clean_names()

# Original still exists
print(f"Original columns: {original.data.columns}")
print(f"Cleaned columns: {cleaned.data.columns}")

# Chain operations
result = (sample
    .wrangling.clean_names()
    .wrangling.recode("region", {"North": ["North", "East"]})
)
Original columns: ['svy_row_index', 'id', 'region', 'age', 'income', 'weight']
Cleaned columns: ['svy_row_index', 'id', 'region', 'age', 'income', 'weight']

This design prevents accidental data corruption and makes workflows easier to debug.

Key Takeaways

Sample wraps data + design - Single object for all operations

Inspect easily - show_data(), show_records(), describe()

Immutable - Transformations return new objects

Gateway to functionality - Access .wrangling, .estimation, .glm

Defensive copies - .data and .design are safe to inspect

Next Steps

Now that you understand the Sample object, learn how to clean and transform your data: clean names, recode values, bin variables, create new columns x

Master the basics?
Continue to Data Wrangling →