SPSS File I/O

Read and write .sav, .zsav, and .por files with full metadata support

Read and write SPSS .sav, .zsav, and .por files in Python using svy-io. Fast, Pythonic API with full support for variable labels, value labels, and user-defined missing values.
Keywords

read SPSS Python, write sav file Python, SPSS to Polars, Python ReadStat, svy-io SPSS, sav file Python, SPSS metadata Python, zsav Python

The svy_io library provides comprehensive support for reading and writing SPSS files (.sav, .zsav, and .por formats) through a clean, Pythonic API backed by the ReadStat C library.

Installation

pip install svy-io

Quick Start

Reading SPSS Files

from svy_io.spss import read_sav, read_por, read_spss

# Read a .sav file
df, meta = read_sav("survey.sav")

# Read a compressed .zsav file (automatically handled)
df_z, meta_z = read_sav("compressed.zsav")

# Read a portable .por file
df_por, meta_por = read_por("transport.por")

# Auto-detect format based on extension
df_auto, meta_auto = read_spss("data.sav")

# df is a Polars DataFrame
print(df.head())

# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
#            'user_missing', 'n_rows'])

Writing SPSS Files

from svy_io.spss import write_sav
import polars as pl

df = pl.DataFrame({
    "subject_id": [1, 2, 3, 4, 5],
    "age": [25, 30, 35, 40, 45],
    "treatment": ["A", "B", "A", "B", "A"],
    "response": [85.5, 92.0, 78.3, 88.9, 95.2]
})

# Write with variable labels
write_sav(
    df,
    "clinical_trial.sav",
    var_labels={
        "subject_id": "Subject ID",
        "age": "Age in years",
        "treatment": "Treatment group",
        "response": "Response score"
    }
)

API Reference

read_sav()

Read an SPSS .sav or .zsav file into a Polars DataFrame. Automatically handles compressed files.

Signature:

def read_sav(
    data_path: str | os.PathLike | io.BufferedIOBase,
    *,
    encoding: str | None = None,
    user_na: bool = False,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = True,
    zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter Type Default Description
data_path str | PathLike | BufferedIOBase required Path to the SPSS file or file-like object
encoding str | None None Character encoding (e.g., “latin1”, “utf-8”)
user_na bool False Preserve user-defined missing values as data
cols_skip list[str] | None None Column names to skip during import
n_max int | None None Maximum number of rows to read
rows_skip int 0 Number of rows to skip from the beginning
coerce_temporals bool True Convert SPSS date/datetime to Python types
zap_empty_str bool False Convert empty strings to None

Returns:

A tuple of (df, meta) where:

  • df (pl.DataFrame): The data with normalized column names (lowercase, underscores)
  • meta (dict): Metadata dictionary containing:
Key Type Description
file_label str | None Dataset-level label/description
vars list Variable metadata (name, label, format, user_missing)
value_labels list Value label sets (categorical mappings)
user_missing list User-defined missing value specifications
n_rows int Number of rows read
labelled_columns dict LabelledSPSS objects (only when user_na=True)

Example:

# Basic read
df, meta = read_sav("survey.sav")

# Read with encoding
df, meta = read_sav("survey.sav", encoding="latin1")

# Read with options
df, meta = read_sav(
    "survey.sav",
    cols_skip=["temp_var", "id"],
    n_max=1000,
    rows_skip=10,
    coerce_temporals=True
)

# Preserve user-defined missing values
df, meta = read_sav("data.sav", user_na=True)
labelled_cols = meta.get('labelled_columns', {})

# Access metadata
print(f"Dataset: {meta['file_label']}")
for var in meta['vars']:
    print(f"  {var['name']}: {var.get('label', 'No label')}")

read_por()

Read an SPSS Portable (.por) file into a Polars DataFrame.

Signature:

def read_por(
    data_path: str | os.PathLike | io.BufferedIOBase,
    *,
    user_na: bool = False,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter Type Default Description
data_path str | PathLike | BufferedIOBase required Path to the POR file or file-like object
user_na bool False Preserve user-defined missing values as data
cols_skip list[str] | None None Column names to skip during import
n_max int | None None Maximum number of rows to read
rows_skip int 0 Number of rows to skip from the beginning
coerce_temporals bool False Convert SPSS date/datetime formats
zap_empty_str bool False Convert empty strings to None

Returns:

  • df (pl.DataFrame): The data
  • meta (dict): Metadata dictionary (same structure as read_sav)

Example:

# Read portable file
df, meta = read_por("legacy.por")

# Read with temporal coercion
df, meta = read_por("data.por", coerce_temporals=True)

read_spss()

Auto-dispatch to read_sav() or read_por() based on file extension.

Signature:

def read_spss(
    data_path: str | os.PathLike,
    *,
    encoding: str | None = None,
    user_na: bool = False,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Note: data_path must be a filesystem path (not a file-like object).

Example:

# Automatically detects .sav or .por format
df, meta = read_spss("mydata.sav")
df, meta = read_spss("olddata.por")
df, meta = read_spss("compressed.zsav")

write_sav()

Write a Polars DataFrame to an SPSS .sav file with optional compression, variable labels, value labels, and user-defined missing values.

Signature:

def write_sav(
    df: pl.DataFrame,
    path: str | Path,
    *,
    compress: str = "byte",
    adjust_tz: bool = True,
    var_labels: dict[str, str] | None = None,
    user_missing: list[dict[str, Any]] | None = None,
    value_labels: list[dict[str, Any]] | None = None
) -> pl.DataFrame

Parameters:

Parameter Type Default Description
df pl.DataFrame required DataFrame to write
path str | PathLike required Output file path
compress str “byte” Compression: “byte”, “none”, or “zsav”
adjust_tz bool True Adjust timezone for datetime columns
var_labels dict[str, str] | None None Variable labels {"col": "description"}
user_missing list[dict] | None None User-defined missing specifications
value_labels list[dict] | None None Value label definitions

User Missing Format:

user_missing = [
    {"col": "income", "values": [-99, -98]},       # Specific values
    {"col": "age", "range": (0, 10)},              # Range of values
    {"col": "score", "values": [999], "range": (-1, 0)}  # Both
]

Value Labels Format:

value_labels = [
    {"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}},
    {"col": "treatment", "labels": {"1": "Control", "2": "Treatment A"}}
]

Returns:

  • df (pl.DataFrame): The original input DataFrame (unchanged)

Raises:

  • ValueError: Invalid column names (duplicates, reserved words, invalid characters)
  • RuntimeError: If the underlying writer encounters an error

Example:

import polars as pl
from svy_io.spss import write_sav

df = pl.DataFrame({
    "subject_id": [1, 2, 3, 4, 5],
    "age": [25, 30, 35, 40, 45],
    "gender": [1, 2, 1, 2, 3],
    "income": [50000, 75000, -99, 60000, 85000]
})

# Complete example with all features
write_sav(
    df,
    "complete.sav",
    compress="byte",
    var_labels={
        "subject_id": "Unique subject identifier",
        "age": "Age at enrollment (years)",
        "gender": "Self-reported gender",
        "income": "Household income (USD)"
    },
    value_labels=[
        {"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}}
    ],
    user_missing=[
        {"col": "income", "values": [-99, -98]}
    ]
)

Metadata Helper Functions

get_column_labels()

Extract variable labels from metadata.

from svy_io.spss import read_sav, get_column_labels

df, meta = read_sav("survey.sav")
labels = get_column_labels(meta)
# {'age': 'Age in years', 'income': 'Annual income', ...}

get_value_labels_for_column()

Get value labels for a specific column.

from svy_io.spss import read_sav, get_value_labels_for_column

df, meta = read_sav("survey.sav")
gender_labels = get_value_labels_for_column(meta, "gender")
# {'1': 'Male', '2': 'Female', '3': 'Other'}

get_user_missing_for_column()

Get user-defined missing value specifications for a column.

from svy_io.spss import read_sav, get_user_missing_for_column

df, meta = read_sav("survey.sav")
income_missing = get_user_missing_for_column(meta, "income")
# {'values': [-99, -98], 'range': None}

Working with User-Defined Missing Values

SPSS supports user-defined missing values, which are distinct from system missing (null). These allow researchers to distinguish between different types of missing data (e.g., “refused to answer” vs. “not applicable”).

Reading Files with User-Defined Missing

# Default: Convert user-defined missing to None
df, meta = read_sav("survey.sav", user_na=False)

# Preserve user-defined missing as data
df, meta = read_sav("survey.sav", user_na=True)
labelled_cols = meta.get('labelled_columns', {})
if 'income' in labelled_cols:
    labelled_income = labelled_cols['income']
    print(labelled_income.na_values)  # [-99, -98]
    print(labelled_income.na_range)   # None or (low, high)

Writing Files with User-Defined Missing

write_sav(
    df,
    "survey.sav",
    user_missing=[
        {"col": "q1", "values": [-99]},              # Specific values
        {"col": "q2", "values": [-99, -98, -97]},    # Multiple values
        {"col": "age", "range": (100, 999)}          # Range
    ]
)

Column Name Normalization

SPSS files are automatically normalized to Python-friendly column names:

  • Whitespace stripped
  • Converted to lowercase
  • Dots, spaces, and dashes replaced with underscores
  • Multiple underscores collapsed to single underscore
# Original SPSS names: "Income Level", "AGE.YEARS", "Q-1"
df, meta = read_sav("survey.sav")
print(df.columns)
# ['income_level', 'age_years', 'q_1']

Variable Name Validation

When writing SPSS files, variable names must follow SPSS rules:

  • Start with a letter
  • Contain only letters, numbers, and underscores
  • Maximum 64 bytes (UTF-8 encoded)
  • Not a reserved word (ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH)
  • Case-insensitive uniqueness
# Valid names
df_valid = pl.DataFrame({
    "age": [25, 30],
    "income_2024": [50000, 60000],
    "response_A": [1, 2]
})
write_sav(df_valid, "valid.sav")  # Works

# Invalid names raise ValueError
df_invalid = pl.DataFrame({
    "1age": [25, 30],       # Can't start with number
    "income$": [50000],     # Invalid character
    "ALL": [1, 2]           # Reserved word
})

Compression Options

Option Description File Size Compatibility
"none" No compression Largest All versions
"byte" Byte compression (default) Medium All versions
"zsav" ZLIB compression Smallest SPSS 21+
write_sav(df, "uncompressed.sav", compress="none")
write_sav(df, "compressed.sav", compress="byte")
write_sav(df, "compressed.zsav", compress="zsav")

Temporal Data Handling

SPSS date and datetime formats are automatically converted:

from datetime import datetime, date
import polars as pl
from svy_io.spss import write_sav, read_sav

df = pl.DataFrame({
    "id": [1, 2, 3],
    "birth_date": [date(1990, 1, 1), date(1985, 5, 15), date(1992, 12, 31)],
    "visit_datetime": [
        datetime(2024, 1, 15, 10, 30),
        datetime(2024, 2, 20, 14, 45),
        datetime(2024, 3, 10, 9, 0)
    ]
})

# Write with automatic timezone adjustment
write_sav(df, "dates.sav", adjust_tz=True)

# Read with automatic temporal coercion
df2, meta = read_sav("dates.sav", coerce_temporals=True)

File-Like Object Support

Both read_sav() and read_por() support file-like objects:

from io import BytesIO
from svy_io.spss import read_sav

# Read from bytes in memory
with open("survey.sav", "rb") as f:
    data = f.read()

bio = BytesIO(data)
df, meta = read_sav(bio)

# Useful for cloud storage, HTTP streams, etc.
import requests
response = requests.get("https://example.com/data.sav", stream=True)
df, meta = read_sav(BytesIO(response.content))

Performance Tips

# 1. Skip unnecessary columns
df, meta = read_sav("large_survey.sav", cols_skip=["verbatim_comments"])

# 2. Limit rows for exploration
df, meta = read_sav("large_survey.sav", n_max=10000)

# 3. Disable temporal coercion if not needed
df, meta = read_sav("large_survey.sav", coerce_temporals=False)

# 4. Use compression when writing large files
write_sav(df, "output.zsav", compress="zsav")

Common Patterns

Converting SPSS to Other Formats

from svy_io.spss import read_sav

df, meta = read_sav("survey.sav")

# Export to CSV
df.write_csv("survey.csv")

# Export to Parquet (preserves types better)
df.write_parquet("survey.parquet")

# Export to Excel with labels as headers
labels = {v['name']: v.get('label', v['name']) for v in meta['vars']}
df.rename(labels).write_excel("survey.xlsx")

Preserving Metadata Across Transformations

from svy_io.spss import read_sav, write_sav, get_column_labels

# Read with metadata
df, meta = read_sav("input.sav")

# Transform data
df_clean = df.filter(pl.col("age") > 18)

# Write with preserved labels
write_sav(df_clean, "output.sav", var_labels=get_column_labels(meta))

Differences from Stata I/O

Feature SPSS Stata
File formats .sav, .zsav, .por .dta
Compression “byte”, “none”, “zsav” Version-based
User-defined missing Complex (values + ranges) Tagged missing (.a, .b)
Variable names Case-insensitive, 64 bytes Case-sensitive, 32 chars
Column normalization Auto-normalized Preserved
Default temporal coercion True for .sav False

See Also