from svy_io.spss import read_sav, read_por, read_spss
# Read a .sav file
df, meta = read_sav("survey.sav")
# Read a compressed .zsav file (automatically handled)
df_z, meta_z = read_sav("compressed.zsav")
# Read a portable .por file
df_por, meta_por = read_por("transport.por")
# Auto-detect format based on extension
df_auto, meta_auto = read_spss("data.sav")
# df is a Polars DataFrame
print(df.head())
# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
# 'user_missing', 'n_rows'])SPSS File I/O
Read and write .sav, .zsav, and .por files with full metadata support
read SPSS Python, write sav file Python, SPSS to Polars, Python ReadStat, svy-io SPSS, sav file Python, SPSS metadata Python, zsav Python
The svy_io library provides comprehensive support for reading and writing SPSS files (.sav, .zsav, and .por formats) through a clean, Pythonic API backed by the ReadStat C library.
Installation
pip install svy-ioQuick Start
Reading SPSS Files
Writing SPSS Files
from svy_io.spss import write_sav
import polars as pl
df = pl.DataFrame({
"subject_id": [1, 2, 3, 4, 5],
"age": [25, 30, 35, 40, 45],
"treatment": ["A", "B", "A", "B", "A"],
"response": [85.5, 92.0, 78.3, 88.9, 95.2]
})
# Write with variable labels
write_sav(
df,
"clinical_trial.sav",
var_labels={
"subject_id": "Subject ID",
"age": "Age in years",
"treatment": "Treatment group",
"response": "Response score"
}
)API Reference
read_sav()
Read an SPSS .sav or .zsav file into a Polars DataFrame. Automatically handles compressed files.
Signature:
def read_sav(
data_path: str | os.PathLike | io.BufferedIOBase,
*,
encoding: str | None = None,
user_na: bool = False,
cols_skip: list[str] | None = None,
n_max: int | None = None,
rows_skip: int = 0,
coerce_temporals: bool = True,
zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | PathLike | BufferedIOBase | required | Path to the SPSS file or file-like object |
encoding |
str | None | None | Character encoding (e.g., “latin1”, “utf-8”) |
user_na |
bool | False | Preserve user-defined missing values as data |
cols_skip |
list[str] | None | None | Column names to skip during import |
n_max |
int | None | None | Maximum number of rows to read |
rows_skip |
int | 0 | Number of rows to skip from the beginning |
coerce_temporals |
bool | True | Convert SPSS date/datetime to Python types |
zap_empty_str |
bool | False | Convert empty strings to None |
Returns:
A tuple of (df, meta) where:
df(pl.DataFrame): The data with normalized column names (lowercase, underscores)meta(dict): Metadata dictionary containing:
| Key | Type | Description |
|---|---|---|
file_label |
str | None | Dataset-level label/description |
vars |
list | Variable metadata (name, label, format, user_missing) |
value_labels |
list | Value label sets (categorical mappings) |
user_missing |
list | User-defined missing value specifications |
n_rows |
int | Number of rows read |
labelled_columns |
dict | LabelledSPSS objects (only when user_na=True) |
Example:
# Basic read
df, meta = read_sav("survey.sav")
# Read with encoding
df, meta = read_sav("survey.sav", encoding="latin1")
# Read with options
df, meta = read_sav(
"survey.sav",
cols_skip=["temp_var", "id"],
n_max=1000,
rows_skip=10,
coerce_temporals=True
)
# Preserve user-defined missing values
df, meta = read_sav("data.sav", user_na=True)
labelled_cols = meta.get('labelled_columns', {})
# Access metadata
print(f"Dataset: {meta['file_label']}")
for var in meta['vars']:
print(f" {var['name']}: {var.get('label', 'No label')}")read_por()
Read an SPSS Portable (.por) file into a Polars DataFrame.
Signature:
def read_por(
data_path: str | os.PathLike | io.BufferedIOBase,
*,
user_na: bool = False,
cols_skip: list[str] | None = None,
n_max: int | None = None,
rows_skip: int = 0,
coerce_temporals: bool = False,
zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | PathLike | BufferedIOBase | required | Path to the POR file or file-like object |
user_na |
bool | False | Preserve user-defined missing values as data |
cols_skip |
list[str] | None | None | Column names to skip during import |
n_max |
int | None | None | Maximum number of rows to read |
rows_skip |
int | 0 | Number of rows to skip from the beginning |
coerce_temporals |
bool | False | Convert SPSS date/datetime formats |
zap_empty_str |
bool | False | Convert empty strings to None |
Returns:
df(pl.DataFrame): The datameta(dict): Metadata dictionary (same structure asread_sav)
Example:
# Read portable file
df, meta = read_por("legacy.por")
# Read with temporal coercion
df, meta = read_por("data.por", coerce_temporals=True)read_spss()
Auto-dispatch to read_sav() or read_por() based on file extension.
Signature:
def read_spss(
data_path: str | os.PathLike,
*,
encoding: str | None = None,
user_na: bool = False,
cols_skip: list[str] | None = None,
n_max: int | None = None,
rows_skip: int = 0,
coerce_temporals: bool = False,
zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]Note:
data_pathmust be a filesystem path (not a file-like object).
Example:
# Automatically detects .sav or .por format
df, meta = read_spss("mydata.sav")
df, meta = read_spss("olddata.por")
df, meta = read_spss("compressed.zsav")write_sav()
Write a Polars DataFrame to an SPSS .sav file with optional compression, variable labels, value labels, and user-defined missing values.
Signature:
def write_sav(
df: pl.DataFrame,
path: str | Path,
*,
compress: str = "byte",
adjust_tz: bool = True,
var_labels: dict[str, str] | None = None,
user_missing: list[dict[str, Any]] | None = None,
value_labels: list[dict[str, Any]] | None = None
) -> pl.DataFrameParameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pl.DataFrame | required | DataFrame to write |
path |
str | PathLike | required | Output file path |
compress |
str | “byte” | Compression: “byte”, “none”, or “zsav” |
adjust_tz |
bool | True | Adjust timezone for datetime columns |
var_labels |
dict[str, str] | None | None | Variable labels {"col": "description"} |
user_missing |
list[dict] | None | None | User-defined missing specifications |
value_labels |
list[dict] | None | None | Value label definitions |
User Missing Format:
user_missing = [
{"col": "income", "values": [-99, -98]}, # Specific values
{"col": "age", "range": (0, 10)}, # Range of values
{"col": "score", "values": [999], "range": (-1, 0)} # Both
]Value Labels Format:
value_labels = [
{"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}},
{"col": "treatment", "labels": {"1": "Control", "2": "Treatment A"}}
]Returns:
df(pl.DataFrame): The original input DataFrame (unchanged)
Raises:
ValueError: Invalid column names (duplicates, reserved words, invalid characters)RuntimeError: If the underlying writer encounters an error
Example:
import polars as pl
from svy_io.spss import write_sav
df = pl.DataFrame({
"subject_id": [1, 2, 3, 4, 5],
"age": [25, 30, 35, 40, 45],
"gender": [1, 2, 1, 2, 3],
"income": [50000, 75000, -99, 60000, 85000]
})
# Complete example with all features
write_sav(
df,
"complete.sav",
compress="byte",
var_labels={
"subject_id": "Unique subject identifier",
"age": "Age at enrollment (years)",
"gender": "Self-reported gender",
"income": "Household income (USD)"
},
value_labels=[
{"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}}
],
user_missing=[
{"col": "income", "values": [-99, -98]}
]
)Metadata Helper Functions
get_column_labels()
Extract variable labels from metadata.
from svy_io.spss import read_sav, get_column_labels
df, meta = read_sav("survey.sav")
labels = get_column_labels(meta)
# {'age': 'Age in years', 'income': 'Annual income', ...}get_value_labels_for_column()
Get value labels for a specific column.
from svy_io.spss import read_sav, get_value_labels_for_column
df, meta = read_sav("survey.sav")
gender_labels = get_value_labels_for_column(meta, "gender")
# {'1': 'Male', '2': 'Female', '3': 'Other'}get_user_missing_for_column()
Get user-defined missing value specifications for a column.
from svy_io.spss import read_sav, get_user_missing_for_column
df, meta = read_sav("survey.sav")
income_missing = get_user_missing_for_column(meta, "income")
# {'values': [-99, -98], 'range': None}Working with User-Defined Missing Values
SPSS supports user-defined missing values, which are distinct from system missing (null). These allow researchers to distinguish between different types of missing data (e.g., “refused to answer” vs. “not applicable”).
Reading Files with User-Defined Missing
# Default: Convert user-defined missing to None
df, meta = read_sav("survey.sav", user_na=False)
# Preserve user-defined missing as data
df, meta = read_sav("survey.sav", user_na=True)
labelled_cols = meta.get('labelled_columns', {})
if 'income' in labelled_cols:
labelled_income = labelled_cols['income']
print(labelled_income.na_values) # [-99, -98]
print(labelled_income.na_range) # None or (low, high)Writing Files with User-Defined Missing
write_sav(
df,
"survey.sav",
user_missing=[
{"col": "q1", "values": [-99]}, # Specific values
{"col": "q2", "values": [-99, -98, -97]}, # Multiple values
{"col": "age", "range": (100, 999)} # Range
]
)Column Name Normalization
SPSS files are automatically normalized to Python-friendly column names:
- Whitespace stripped
- Converted to lowercase
- Dots, spaces, and dashes replaced with underscores
- Multiple underscores collapsed to single underscore
# Original SPSS names: "Income Level", "AGE.YEARS", "Q-1"
df, meta = read_sav("survey.sav")
print(df.columns)
# ['income_level', 'age_years', 'q_1']Variable Name Validation
When writing SPSS files, variable names must follow SPSS rules:
- Start with a letter
- Contain only letters, numbers, and underscores
- Maximum 64 bytes (UTF-8 encoded)
- Not a reserved word (ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH)
- Case-insensitive uniqueness
# Valid names
df_valid = pl.DataFrame({
"age": [25, 30],
"income_2024": [50000, 60000],
"response_A": [1, 2]
})
write_sav(df_valid, "valid.sav") # Works
# Invalid names raise ValueError
df_invalid = pl.DataFrame({
"1age": [25, 30], # Can't start with number
"income$": [50000], # Invalid character
"ALL": [1, 2] # Reserved word
})Compression Options
| Option | Description | File Size | Compatibility |
|---|---|---|---|
"none" |
No compression | Largest | All versions |
"byte" |
Byte compression (default) | Medium | All versions |
"zsav" |
ZLIB compression | Smallest | SPSS 21+ |
write_sav(df, "uncompressed.sav", compress="none")
write_sav(df, "compressed.sav", compress="byte")
write_sav(df, "compressed.zsav", compress="zsav")Temporal Data Handling
SPSS date and datetime formats are automatically converted:
from datetime import datetime, date
import polars as pl
from svy_io.spss import write_sav, read_sav
df = pl.DataFrame({
"id": [1, 2, 3],
"birth_date": [date(1990, 1, 1), date(1985, 5, 15), date(1992, 12, 31)],
"visit_datetime": [
datetime(2024, 1, 15, 10, 30),
datetime(2024, 2, 20, 14, 45),
datetime(2024, 3, 10, 9, 0)
]
})
# Write with automatic timezone adjustment
write_sav(df, "dates.sav", adjust_tz=True)
# Read with automatic temporal coercion
df2, meta = read_sav("dates.sav", coerce_temporals=True)File-Like Object Support
Both read_sav() and read_por() support file-like objects:
from io import BytesIO
from svy_io.spss import read_sav
# Read from bytes in memory
with open("survey.sav", "rb") as f:
data = f.read()
bio = BytesIO(data)
df, meta = read_sav(bio)
# Useful for cloud storage, HTTP streams, etc.
import requests
response = requests.get("https://example.com/data.sav", stream=True)
df, meta = read_sav(BytesIO(response.content))Performance Tips
# 1. Skip unnecessary columns
df, meta = read_sav("large_survey.sav", cols_skip=["verbatim_comments"])
# 2. Limit rows for exploration
df, meta = read_sav("large_survey.sav", n_max=10000)
# 3. Disable temporal coercion if not needed
df, meta = read_sav("large_survey.sav", coerce_temporals=False)
# 4. Use compression when writing large files
write_sav(df, "output.zsav", compress="zsav")Common Patterns
Converting SPSS to Other Formats
from svy_io.spss import read_sav
df, meta = read_sav("survey.sav")
# Export to CSV
df.write_csv("survey.csv")
# Export to Parquet (preserves types better)
df.write_parquet("survey.parquet")
# Export to Excel with labels as headers
labels = {v['name']: v.get('label', v['name']) for v in meta['vars']}
df.rename(labels).write_excel("survey.xlsx")Preserving Metadata Across Transformations
from svy_io.spss import read_sav, write_sav, get_column_labels
# Read with metadata
df, meta = read_sav("input.sav")
# Transform data
df_clean = df.filter(pl.col("age") > 18)
# Write with preserved labels
write_sav(df_clean, "output.sav", var_labels=get_column_labels(meta))Differences from Stata I/O
| Feature | SPSS | Stata |
|---|---|---|
| File formats | .sav, .zsav, .por |
.dta |
| Compression | “byte”, “none”, “zsav” | Version-based |
| User-defined missing | Complex (values + ranges) | Tagged missing (.a, .b) |
| Variable names | Case-insensitive, 64 bytes | Case-sensitive, 32 chars |
| Column normalization | Auto-normalized | Preserved |
| Default temporal coercion | True for .sav |
False |
See Also
- Stata File I/O — Read and write Stata
.dtafiles - SAS File I/O — Read and write SAS
.sas7bdatfiles - ReadStat — Upstream C library
- Polars Documentation — DataFrame operations