SPSS File I/O

Read and write .sav, .zsav, and .por files with full metadata support

Read and write SPSS .sav, .zsav, and .por files in Python using svy-io. Fast, Pythonic API with full support for variable labels, value labels, and user-defined missing values.

Keywords

read SPSS Python, write sav file Python, SPSS to Polars, Python ReadStat, svy-io SPSS, sav file Python, SPSS metadata Python, zsav Python

The svy_io library provides comprehensive support for reading and writing SPSS files (.sav, .zsav, and .por formats) through a clean, Pythonic API backed by the ReadStat C library.

Installation

pip install svy-io

Quick Start

Reading SPSS Files

from svy_io.spss import read_sav, read_por, read_spss

# Read a .sav file
df, meta = read_sav("survey.sav")

# Read a compressed .zsav file (automatically handled)
df_z, meta_z = read_sav("compressed.zsav")

# Read a portable .por file
df_por, meta_por = read_por("transport.por")

# Auto-detect format based on extension
df_auto, meta_auto = read_spss("data.sav")

# df is a Polars DataFrame
print(df.head())

# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
#            'user_missing', 'n_rows'])

Writing SPSS Files

from svy_io.spss import write_sav
import polars as pl

df = pl.DataFrame({
    "subject_id": [1, 2, 3, 4, 5],
    "age": [25, 30, 35, 40, 45],
    "treatment": ["A", "B", "A", "B", "A"],
    "response": [85.5, 92.0, 78.3, 88.9, 95.2]
})

# Write with variable labels
write_sav(
    df,
    "clinical_trial.sav",
    var_labels={
        "subject_id": "Subject ID",
        "age": "Age in years",
        "treatment": "Treatment group",
        "response": "Response score"
    }
)

API Reference

`read_sav()`

Read an SPSS .sav or .zsav file into a Polars DataFrame. Automatically handles compressed files.

Signature:

def read_sav(
    data_path: str | os.PathLike | io.BufferedIOBase,
    *,
    encoding: str | None = None,
    user_na: bool = False,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = True,
    zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter	Type	Default	Description
`data_path`	str \| PathLike \| BufferedIOBase	required	Path to the SPSS file or file-like object
`encoding`	str \| None	None	Character encoding (e.g., “latin1”, “utf-8”)
`user_na`	bool	False	Preserve user-defined missing values as data
`cols_skip`	list[str] \| None	None	Column names to skip during import
`n_max`	int \| None	None	Maximum number of rows to read
`rows_skip`	int	0	Number of rows to skip from the beginning
`coerce_temporals`	bool	True	Convert SPSS date/datetime to Python types
`zap_empty_str`	bool	False	Convert empty strings to `None`

Returns:

A tuple of (df, meta) where:

df (pl.DataFrame): The data with normalized column names (lowercase, underscores)
meta (dict): Metadata dictionary containing:

Key	Type	Description
`file_label`	str \| None	Dataset-level label/description
`vars`	list	Variable metadata (name, label, format, user_missing)
`value_labels`	list	Value label sets (categorical mappings)
`user_missing`	list	User-defined missing value specifications
`n_rows`	int	Number of rows read
`labelled_columns`	dict	LabelledSPSS objects (only when `user_na=True`)

Example:

# Basic read
df, meta = read_sav("survey.sav")

# Read with encoding
df, meta = read_sav("survey.sav", encoding="latin1")

# Read with options
df, meta = read_sav(
    "survey.sav",
    cols_skip=["temp_var", "id"],
    n_max=1000,
    rows_skip=10,
    coerce_temporals=True
)

# Preserve user-defined missing values
df, meta = read_sav("data.sav", user_na=True)
labelled_cols = meta.get('labelled_columns', {})

# Access metadata
print(f"Dataset: {meta['file_label']}")
for var in meta['vars']:
    print(f"  {var['name']}: {var.get('label', 'No label')}")

`read_por()`

Read an SPSS Portable (.por) file into a Polars DataFrame.

Signature:

def read_por(
    data_path: str | os.PathLike | io.BufferedIOBase,
    *,
    user_na: bool = False,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter	Type	Default	Description
`data_path`	str \| PathLike \| BufferedIOBase	required	Path to the POR file or file-like object
`user_na`	bool	False	Preserve user-defined missing values as data
`cols_skip`	list[str] \| None	None	Column names to skip during import
`n_max`	int \| None	None	Maximum number of rows to read
`rows_skip`	int	0	Number of rows to skip from the beginning
`coerce_temporals`	bool	False	Convert SPSS date/datetime formats
`zap_empty_str`	bool	False	Convert empty strings to `None`

Returns:

df (pl.DataFrame): The data
meta (dict): Metadata dictionary (same structure as read_sav)

Example:

# Read portable file
df, meta = read_por("legacy.por")

# Read with temporal coercion
df, meta = read_por("data.por", coerce_temporals=True)

`read_spss()`

Auto-dispatch to read_sav() or read_por() based on file extension.

Signature:

def read_spss(
    data_path: str | os.PathLike,
    *,
    encoding: str | None = None,
    user_na: bool = False,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Note: data_path must be a filesystem path (not a file-like object).

Example:

# Automatically detects .sav or .por format
df, meta = read_spss("mydata.sav")
df, meta = read_spss("olddata.por")
df, meta = read_spss("compressed.zsav")

`write_sav()`

Write a Polars DataFrame to an SPSS .sav file with optional compression, variable labels, value labels, and user-defined missing values.

Signature:

def write_sav(
    df: pl.DataFrame,
    path: str | Path,
    *,
    compress: str = "byte",
    adjust_tz: bool = True,
    var_labels: dict[str, str] | None = None,
    user_missing: list[dict[str, Any]] | None = None,
    value_labels: list[dict[str, Any]] | None = None
) -> pl.DataFrame

Parameters:

Parameter	Type	Default	Description
`df`	pl.DataFrame	required	DataFrame to write
`path`	str \| PathLike	required	Output file path
`compress`	str	“byte”	Compression: “byte”, “none”, or “zsav”
`adjust_tz`	bool	True	Adjust timezone for datetime columns
`var_labels`	dict[str, str] \| None	None	Variable labels `{"col": "description"}`
`user_missing`	list[dict] \| None	None	User-defined missing specifications
`value_labels`	list[dict] \| None	None	Value label definitions

User Missing Format:

user_missing = [
    {"col": "income", "values": [-99, -98]},       # Specific values
    {"col": "age", "range": (0, 10)},              # Range of values
    {"col": "score", "values": [999], "range": (-1, 0)}  # Both
]

Value Labels Format:

value_labels = [
    {"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}},
    {"col": "treatment", "labels": {"1": "Control", "2": "Treatment A"}}
]

Returns:

df (pl.DataFrame): The original input DataFrame (unchanged)

Raises:

ValueError: Invalid column names (duplicates, reserved words, invalid characters)
RuntimeError: If the underlying writer encounters an error

Example:

import polars as pl
from svy_io.spss import write_sav

df = pl.DataFrame({
    "subject_id": [1, 2, 3, 4, 5],
    "age": [25, 30, 35, 40, 45],
    "gender": [1, 2, 1, 2, 3],
    "income": [50000, 75000, -99, 60000, 85000]
})

# Complete example with all features
write_sav(
    df,
    "complete.sav",
    compress="byte",
    var_labels={
        "subject_id": "Unique subject identifier",
        "age": "Age at enrollment (years)",
        "gender": "Self-reported gender",
        "income": "Household income (USD)"
    },
    value_labels=[
        {"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}}
    ],
    user_missing=[
        {"col": "income", "values": [-99, -98]}
    ]
)

Metadata Helper Functions

`get_column_labels()`

Extract variable labels from metadata.

from svy_io.spss import read_sav, get_column_labels

df, meta = read_sav("survey.sav")
labels = get_column_labels(meta)
# {'age': 'Age in years', 'income': 'Annual income', ...}

`get_value_labels_for_column()`

Get value labels for a specific column.

from svy_io.spss import read_sav, get_value_labels_for_column

df, meta = read_sav("survey.sav")
gender_labels = get_value_labels_for_column(meta, "gender")
# {'1': 'Male', '2': 'Female', '3': 'Other'}

`get_user_missing_for_column()`

Get user-defined missing value specifications for a column.

from svy_io.spss import read_sav, get_user_missing_for_column

df, meta = read_sav("survey.sav")
income_missing = get_user_missing_for_column(meta, "income")
# {'values': [-99, -98], 'range': None}

Working with User-Defined Missing Values

SPSS supports user-defined missing values, which are distinct from system missing (null). These allow researchers to distinguish between different types of missing data (e.g., “refused to answer” vs. “not applicable”).

Reading Files with User-Defined Missing

# Default: Convert user-defined missing to None
df, meta = read_sav("survey.sav", user_na=False)

# Preserve user-defined missing as data
df, meta = read_sav("survey.sav", user_na=True)
labelled_cols = meta.get('labelled_columns', {})
if 'income' in labelled_cols:
    labelled_income = labelled_cols['income']
    print(labelled_income.na_values)  # [-99, -98]
    print(labelled_income.na_range)   # None or (low, high)

Writing Files with User-Defined Missing

write_sav(
    df,
    "survey.sav",
    user_missing=[
        {"col": "q1", "values": [-99]},              # Specific values
        {"col": "q2", "values": [-99, -98, -97]},    # Multiple values
        {"col": "age", "range": (100, 999)}          # Range
    ]
)

Column Name Normalization

SPSS files are automatically normalized to Python-friendly column names:

Whitespace stripped
Converted to lowercase
Dots, spaces, and dashes replaced with underscores
Multiple underscores collapsed to single underscore

# Original SPSS names: "Income Level", "AGE.YEARS", "Q-1"
df, meta = read_sav("survey.sav")
print(df.columns)
# ['income_level', 'age_years', 'q_1']

Variable Name Validation

When writing SPSS files, variable names must follow SPSS rules:

Start with a letter
Contain only letters, numbers, and underscores
Maximum 64 bytes (UTF-8 encoded)
Not a reserved word (ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH)
Case-insensitive uniqueness

# Valid names
df_valid = pl.DataFrame({
    "age": [25, 30],
    "income_2024": [50000, 60000],
    "response_A": [1, 2]
})
write_sav(df_valid, "valid.sav")  # Works

# Invalid names raise ValueError
df_invalid = pl.DataFrame({
    "1age": [25, 30],       # Can't start with number
    "income$": [50000],     # Invalid character
    "ALL": [1, 2]           # Reserved word
})

Compression Options

Option	Description	File Size	Compatibility
`"none"`	No compression	Largest	All versions
`"byte"`	Byte compression (default)	Medium	All versions
`"zsav"`	ZLIB compression	Smallest	SPSS 21+

write_sav(df, "uncompressed.sav", compress="none")
write_sav(df, "compressed.sav", compress="byte")
write_sav(df, "compressed.zsav", compress="zsav")

Temporal Data Handling

SPSS date and datetime formats are automatically converted:

from datetime import datetime, date
import polars as pl
from svy_io.spss import write_sav, read_sav

df = pl.DataFrame({
    "id": [1, 2, 3],
    "birth_date": [date(1990, 1, 1), date(1985, 5, 15), date(1992, 12, 31)],
    "visit_datetime": [
        datetime(2024, 1, 15, 10, 30),
        datetime(2024, 2, 20, 14, 45),
        datetime(2024, 3, 10, 9, 0)
    ]
})

# Write with automatic timezone adjustment
write_sav(df, "dates.sav", adjust_tz=True)

# Read with automatic temporal coercion
df2, meta = read_sav("dates.sav", coerce_temporals=True)

File-Like Object Support

Both read_sav() and read_por() support file-like objects:

from io import BytesIO
from svy_io.spss import read_sav

# Read from bytes in memory
with open("survey.sav", "rb") as f:
    data = f.read()

bio = BytesIO(data)
df, meta = read_sav(bio)

# Useful for cloud storage, HTTP streams, etc.
import requests
response = requests.get("https://example.com/data.sav", stream=True)
df, meta = read_sav(BytesIO(response.content))

Performance Tips

# 1. Skip unnecessary columns
df, meta = read_sav("large_survey.sav", cols_skip=["verbatim_comments"])

# 2. Limit rows for exploration
df, meta = read_sav("large_survey.sav", n_max=10000)

# 3. Disable temporal coercion if not needed
df, meta = read_sav("large_survey.sav", coerce_temporals=False)

# 4. Use compression when writing large files
write_sav(df, "output.zsav", compress="zsav")

Common Patterns

Converting SPSS to Other Formats

from svy_io.spss import read_sav

df, meta = read_sav("survey.sav")

# Export to CSV
df.write_csv("survey.csv")

# Export to Parquet (preserves types better)
df.write_parquet("survey.parquet")

# Export to Excel with labels as headers
labels = {v['name']: v.get('label', v['name']) for v in meta['vars']}
df.rename(labels).write_excel("survey.xlsx")

Preserving Metadata Across Transformations

from svy_io.spss import read_sav, write_sav, get_column_labels

# Read with metadata
df, meta = read_sav("input.sav")

# Transform data
df_clean = df.filter(pl.col("age") > 18)

# Write with preserved labels
write_sav(df_clean, "output.sav", var_labels=get_column_labels(meta))

Differences from Stata I/O

Feature	SPSS	Stata
File formats	`.sav`, `.zsav`, `.por`	`.dta`
Compression	“byte”, “none”, “zsav”	Version-based
User-defined missing	Complex (values + ranges)	Tagged missing (`.a`, `.b`)
Variable names	Case-insensitive, 64 bytes	Case-sensitive, 32 chars
Column normalization	Auto-normalized	Preserved
Default temporal coercion	True for `.sav`	False

--- title: "SPSS File I/O" description: "Read and write SPSS .sav, .zsav, and .por files in Python using svy-io. Fast, Pythonic API with full support for variable labels, value labels, and user-defined missing values." keywords: "read SPSS Python, write sav file Python, SPSS to Polars, Python ReadStat, svy-io SPSS, sav file Python, SPSS metadata Python, zsav Python" subtitle: "Read and write .sav, .zsav, and .por files with full metadata support" format: html: toc: true toc-depth: 3 code-fold: false code-tools: true --- The `svy_io` library provides comprehensive support for reading and writing SPSS files (`.sav`, `.zsav`, and `.por` formats) through a clean, Pythonic API backed by the ReadStat C library. ## Installation ```bash pip install svy-io ``` ## Quick Start ### Reading SPSS Files ```{python} #| eval: false from svy_io.spss import read_sav, read_por, read_spss # Read a .sav file df, meta = read_sav("survey.sav") # Read a compressed .zsav file (automatically handled) df_z, meta_z = read_sav("compressed.zsav") # Read a portable .por file df_por, meta_por = read_por("transport.por") # Auto-detect format based on extension df_auto, meta_auto = read_spss("data.sav") # df is a Polars DataFrame print(df.head()) # meta contains file metadata print(meta.keys()) # dict_keys(['file_label', 'vars', 'value_labels', # 'user_missing', 'n_rows']) ``` ### Writing SPSS Files ```{python} #| eval: false from svy_io.spss import write_sav import polars as pl df = pl.DataFrame({ "subject_id": [1, 2, 3, 4, 5], "age": [25, 30, 35, 40, 45], "treatment": ["A", "B", "A", "B", "A"], "response": [85.5, 92.0, 78.3, 88.9, 95.2] }) # Write with variable labels write_sav( df, "clinical_trial.sav", var_labels={ "subject_id": "Subject ID", "age": "Age in years", "treatment": "Treatment group", "response": "Response score" } ) ``` ## API Reference ### `read_sav()` Read an SPSS `.sav` or `.zsav` file into a Polars DataFrame. Automatically handles compressed files. **Signature:** ```python def read_sav( data_path: str | os.PathLike | io.BufferedIOBase, *, encoding: str | None = None, user_na: bool = False, cols_skip: list[str] | None = None, n_max: int | None = None, rows_skip: int = 0, coerce_temporals: bool = True, zap_empty_str: bool = False ) -> tuple[pl.DataFrame, dict[str, Any]] ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `data_path` | str \| PathLike \| BufferedIOBase | required | Path to the SPSS file or file-like object | | `encoding` | str \| None | None | Character encoding (e.g., "latin1", "utf-8") | | `user_na` | bool | False | Preserve user-defined missing values as data | | `cols_skip` | list[str] \| None | None | Column names to skip during import | | `n_max` | int \| None | None | Maximum number of rows to read | | `rows_skip` | int | 0 | Number of rows to skip from the beginning | | `coerce_temporals` | bool | True | Convert SPSS date/datetime to Python types | | `zap_empty_str` | bool | False | Convert empty strings to `None` | **Returns:** A tuple of `(df, meta)` where: - `df` (pl.DataFrame): The data with normalized column names (lowercase, underscores) - `meta` (dict): Metadata dictionary containing: | Key | Type | Description | |-----|------|-------------| | `file_label` | str \| None | Dataset-level label/description | | `vars` | list | Variable metadata (name, label, format, user_missing) | | `value_labels` | list | Value label sets (categorical mappings) | | `user_missing` | list | User-defined missing value specifications | | `n_rows` | int | Number of rows read | | `labelled_columns` | dict | LabelledSPSS objects (only when `user_na=True`) | **Example:** ```{python} #| eval: false # Basic read df, meta = read_sav("survey.sav") # Read with encoding df, meta = read_sav("survey.sav", encoding="latin1") # Read with options df, meta = read_sav( "survey.sav", cols_skip=["temp_var", "id"], n_max=1000, rows_skip=10, coerce_temporals=True ) # Preserve user-defined missing values df, meta = read_sav("data.sav", user_na=True) labelled_cols = meta.get('labelled_columns', {}) # Access metadata print(f"Dataset: {meta['file_label']}") for var in meta['vars']: print(f" {var['name']}: {var.get('label', 'No label')}") ``` ### `read_por()` Read an SPSS Portable (`.por`) file into a Polars DataFrame. **Signature:** ```python def read_por( data_path: str | os.PathLike | io.BufferedIOBase, *, user_na: bool = False, cols_skip: list[str] | None = None, n_max: int | None = None, rows_skip: int = 0, coerce_temporals: bool = False, zap_empty_str: bool = False ) -> tuple[pl.DataFrame, dict[str, Any]] ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `data_path` | str \| PathLike \| BufferedIOBase | required | Path to the POR file or file-like object | | `user_na` | bool | False | Preserve user-defined missing values as data | | `cols_skip` | list[str] \| None | None | Column names to skip during import | | `n_max` | int \| None | None | Maximum number of rows to read | | `rows_skip` | int | 0 | Number of rows to skip from the beginning | | `coerce_temporals` | bool | False | Convert SPSS date/datetime formats | | `zap_empty_str` | bool | False | Convert empty strings to `None` | **Returns:** - `df` (pl.DataFrame): The data - `meta` (dict): Metadata dictionary (same structure as `read_sav`) **Example:** ```{python} #| eval: false # Read portable file df, meta = read_por("legacy.por") # Read with temporal coercion df, meta = read_por("data.por", coerce_temporals=True) ``` ### `read_spss()` Auto-dispatch to `read_sav()` or `read_por()` based on file extension. **Signature:** ```python def read_spss( data_path: str | os.PathLike, *, encoding: str | None = None, user_na: bool = False, cols_skip: list[str] | None = None, n_max: int | None = None, rows_skip: int = 0, coerce_temporals: bool = False, zap_empty_str: bool = False ) -> tuple[pl.DataFrame, dict[str, Any]] ``` > **Note:** `data_path` must be a filesystem path (not a file-like object). **Example:** ```{python} #| eval: false # Automatically detects .sav or .por format df, meta = read_spss("mydata.sav") df, meta = read_spss("olddata.por") df, meta = read_spss("compressed.zsav") ``` ### `write_sav()` Write a Polars DataFrame to an SPSS `.sav` file with optional compression, variable labels, value labels, and user-defined missing values. **Signature:** ```python def write_sav( df: pl.DataFrame, path: str | Path, *, compress: str = "byte", adjust_tz: bool = True, var_labels: dict[str, str] | None = None, user_missing: list[dict[str, Any]] | None = None, value_labels: list[dict[str, Any]] | None = None ) -> pl.DataFrame ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | pl.DataFrame | required | DataFrame to write | | `path` | str \| PathLike | required | Output file path | | `compress` | str | "byte" | Compression: "byte", "none", or "zsav" | | `adjust_tz` | bool | True | Adjust timezone for datetime columns | | `var_labels` | dict[str, str] \| None | None | Variable labels `{"col": "description"}` | | `user_missing` | list[dict] \| None | None | User-defined missing specifications | | `value_labels` | list[dict] \| None | None | Value label definitions | **User Missing Format:** ```python user_missing = [ {"col": "income", "values": [-99, -98]}, # Specific values {"col": "age", "range": (0, 10)}, # Range of values {"col": "score", "values": [999], "range": (-1, 0)} # Both ] ``` **Value Labels Format:** ```python value_labels = [ {"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}}, {"col": "treatment", "labels": {"1": "Control", "2": "Treatment A"}} ] ``` **Returns:** - `df` (pl.DataFrame): The original input DataFrame (unchanged) **Raises:** - `ValueError`: Invalid column names (duplicates, reserved words, invalid characters) - `RuntimeError`: If the underlying writer encounters an error **Example:** ```{python} #| eval: false import polars as pl from svy_io.spss import write_sav df = pl.DataFrame({ "subject_id": [1, 2, 3, 4, 5], "age": [25, 30, 35, 40, 45], "gender": [1, 2, 1, 2, 3], "income": [50000, 75000, -99, 60000, 85000] }) # Complete example with all features write_sav( df, "complete.sav", compress="byte", var_labels={ "subject_id": "Unique subject identifier", "age": "Age at enrollment (years)", "gender": "Self-reported gender", "income": "Household income (USD)" }, value_labels=[ {"col": "gender", "labels": {"1": "Male", "2": "Female", "3": "Other"}} ], user_missing=[ {"col": "income", "values": [-99, -98]} ] ) ``` ## Metadata Helper Functions ### `get_column_labels()` Extract variable labels from metadata. ```{python} #| eval: false from svy_io.spss import read_sav, get_column_labels df, meta = read_sav("survey.sav") labels = get_column_labels(meta) # {'age': 'Age in years', 'income': 'Annual income', ...} ``` ### `get_value_labels_for_column()` Get value labels for a specific column. ```{python} #| eval: false from svy_io.spss import read_sav, get_value_labels_for_column df, meta = read_sav("survey.sav") gender_labels = get_value_labels_for_column(meta, "gender") # {'1': 'Male', '2': 'Female', '3': 'Other'} ``` ### `get_user_missing_for_column()` Get user-defined missing value specifications for a column. ```{python} #| eval: false from svy_io.spss import read_sav, get_user_missing_for_column df, meta = read_sav("survey.sav") income_missing = get_user_missing_for_column(meta, "income") # {'values': [-99, -98], 'range': None} ``` ## Working with User-Defined Missing Values SPSS supports **user-defined missing values**, which are distinct from system missing (null). These allow researchers to distinguish between different types of missing data (e.g., "refused to answer" vs. "not applicable"). ### Reading Files with User-Defined Missing ```{python} #| eval: false # Default: Convert user-defined missing to None df, meta = read_sav("survey.sav", user_na=False) # Preserve user-defined missing as data df, meta = read_sav("survey.sav", user_na=True) labelled_cols = meta.get('labelled_columns', {}) if 'income' in labelled_cols: labelled_income = labelled_cols['income'] print(labelled_income.na_values) # [-99, -98] print(labelled_income.na_range) # None or (low, high) ``` ### Writing Files with User-Defined Missing ```{python} #| eval: false write_sav( df, "survey.sav", user_missing=[ {"col": "q1", "values": [-99]}, # Specific values {"col": "q2", "values": [-99, -98, -97]}, # Multiple values {"col": "age", "range": (100, 999)} # Range ] ) ``` ## Column Name Normalization SPSS files are automatically normalized to Python-friendly column names: - Whitespace stripped - Converted to lowercase - Dots, spaces, and dashes replaced with underscores - Multiple underscores collapsed to single underscore ```{python} #| eval: false # Original SPSS names: "Income Level", "AGE.YEARS", "Q-1" df, meta = read_sav("survey.sav") print(df.columns) # ['income_level', 'age_years', 'q_1'] ``` ## Variable Name Validation When writing SPSS files, variable names must follow SPSS rules: - Start with a letter - Contain only letters, numbers, and underscores - Maximum 64 bytes (UTF-8 encoded) - Not a reserved word (ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH) - Case-insensitive uniqueness ```{python} #| eval: false # Valid names df_valid = pl.DataFrame({ "age": [25, 30], "income_2024": [50000, 60000], "response_A": [1, 2] }) write_sav(df_valid, "valid.sav") # Works # Invalid names raise ValueError df_invalid = pl.DataFrame({ "1age": [25, 30], # Can't start with number "income$": [50000], # Invalid character "ALL": [1, 2] # Reserved word }) ``` ## Compression Options | Option | Description | File Size | Compatibility | |--------|-------------|-----------|---------------| | `"none"` | No compression | Largest | All versions | | `"byte"` | Byte compression (default) | Medium | All versions | | `"zsav"` | ZLIB compression | Smallest | SPSS 21+ | ```{python} #| eval: false write_sav(df, "uncompressed.sav", compress="none") write_sav(df, "compressed.sav", compress="byte") write_sav(df, "compressed.zsav", compress="zsav") ``` ## Temporal Data Handling SPSS date and datetime formats are automatically converted: ```{python} #| eval: false from datetime import datetime, date import polars as pl from svy_io.spss import write_sav, read_sav df = pl.DataFrame({ "id": [1, 2, 3], "birth_date": [date(1990, 1, 1), date(1985, 5, 15), date(1992, 12, 31)], "visit_datetime": [ datetime(2024, 1, 15, 10, 30), datetime(2024, 2, 20, 14, 45), datetime(2024, 3, 10, 9, 0) ] }) # Write with automatic timezone adjustment write_sav(df, "dates.sav", adjust_tz=True) # Read with automatic temporal coercion df2, meta = read_sav("dates.sav", coerce_temporals=True) ``` ## File-Like Object Support Both `read_sav()` and `read_por()` support file-like objects: ```{python} #| eval: false from io import BytesIO from svy_io.spss import read_sav # Read from bytes in memory with open("survey.sav", "rb") as f: data = f.read() bio = BytesIO(data) df, meta = read_sav(bio) # Useful for cloud storage, HTTP streams, etc. import requests response = requests.get("https://example.com/data.sav", stream=True) df, meta = read_sav(BytesIO(response.content)) ``` ## Performance Tips ```{python} #| eval: false # 1. Skip unnecessary columns df, meta = read_sav("large_survey.sav", cols_skip=["verbatim_comments"]) # 2. Limit rows for exploration df, meta = read_sav("large_survey.sav", n_max=10000) # 3. Disable temporal coercion if not needed df, meta = read_sav("large_survey.sav", coerce_temporals=False) # 4. Use compression when writing large files write_sav(df, "output.zsav", compress="zsav") ``` ## Common Patterns ### Converting SPSS to Other Formats ```{python} #| eval: false from svy_io.spss import read_sav df, meta = read_sav("survey.sav") # Export to CSV df.write_csv("survey.csv") # Export to Parquet (preserves types better) df.write_parquet("survey.parquet") # Export to Excel with labels as headers labels = {v['name']: v.get('label', v['name']) for v in meta['vars']} df.rename(labels).write_excel("survey.xlsx") ``` ### Preserving Metadata Across Transformations ```{python} #| eval: false from svy_io.spss import read_sav, write_sav, get_column_labels # Read with metadata df, meta = read_sav("input.sav") # Transform data df_clean = df.filter(pl.col("age") > 18) # Write with preserved labels write_sav(df_clean, "output.sav", var_labels=get_column_labels(meta)) ``` ## Differences from Stata I/O | Feature | SPSS | Stata | |---------|------|-------| | File formats | `.sav`, `.zsav`, `.por` | `.dta` | | Compression | "byte", "none", "zsav" | Version-based | | User-defined missing | Complex (values + ranges) | Tagged missing (`.a`, `.b`) | | Variable names | Case-insensitive, 64 bytes | Case-sensitive, 32 chars | | Column normalization | Auto-normalized | Preserved | | Default temporal coercion | True for `.sav` | False | ## See Also - [Stata File I/O](stata.qmd) — Read and write Stata `.dta` files - [SAS File I/O](sas.qmd) — Read and write SAS `.sas7bdat` files - [ReadStat](https://github.com/WizardMac/ReadStat) — Upstream C library - [Polars Documentation](https://pola-rs.github.io/polars/) — DataFrame operations