Stata File I/O

Read and write .dta files with full metadata support

Read and write Stata .dta files in Python using svy-io. Fast, Pythonic API backed by ReadStat with full support for variable labels, value labels, and metadata.
Keywords

read Stata Python, write dta file Python, Stata to Polars, Python ReadStat, svy-io Stata, dta file Python, Stata metadata Python

The svy_io library provides comprehensive support for reading and writing Stata .dta files through a clean, Pythonic API backed by the ReadStat C library.

Installation

pip install svy-io

Quick Start

Reading Stata Files

from svy_io.stata import read_dta

# Read a Stata file
df, meta = read_dta("data.dta")

# df is a Polars DataFrame
print(df.head())

# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
#            'user_missing', 'n_rows', 'tagged_missings', 'notes'])

Writing Stata Files

from svy_io.stata import write_dta
import polars as pl

df = pl.DataFrame({
    "age": [25, 30, 35, 40],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "score": [85.5, 92.0, 78.3, 88.9]
})

write_dta(df, "output.dta", version=15)

API Reference

read_dta()

Read a Stata .dta file into a Polars DataFrame.

Signature:

def read_dta(
    data_path: str,
    *,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False,
    factorize: bool = False,
    levels: str = "default",
    ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter Type Default Description
data_path str required Path to the Stata .dta file
cols_skip list[str] | None None Column names to skip during import
n_max int | None None Maximum number of rows to read
rows_skip int 0 Number of rows to skip from the beginning
coerce_temporals bool False Convert Stata date/datetime formats to Python date/datetime
zap_empty_str bool False Convert empty strings to None
factorize bool False Convert value-labeled variables to Polars categoricals
levels str “default” How to handle factor levels (“default”, “labels”, “values”)
ordered bool False Whether categorical variables should be ordered

Returns:

A tuple of (df, meta) where:

  • df (pl.DataFrame): The data
  • meta (dict): Metadata dictionary containing:
Key Type Description
file_label str | None Dataset-level label/description
vars list Variable metadata (name, label, format, etc.)
value_labels list Value label sets (categorical mappings)
user_missing list User-defined missing value specifications
n_rows int Number of rows read
tagged_missings list Tagged missing value information
notes list Dataset notes/comments

Example:

# Read with options
df, meta = read_dta(
    "survey_data.dta",
    cols_skip=["temp_var", "id"],
    n_max=1000,
    rows_skip=10,
    coerce_temporals=True
)

# Access metadata
print(f"Dataset: {meta['file_label']}")
for var in meta['vars']:
    print(f"  {var['name']}: {var.get('label', 'No label')}")

write_dta()

Write a Polars DataFrame to a Stata .dta file.

Signature:

def write_dta(
    df: pl.DataFrame,
    path: str | os.PathLike | io.BufferedIOBase,
    *,
    version: int = 15,
    file_label: str | None = None,
    var_labels: dict[str, str] | None = None,
    value_labels: dict[str, dict] | None = None,
    strl_threshold: int = 2045,
    adjust_tz: bool = True,
    na_policy: str = "nan"
) -> pl.DataFrame

Parameters:

Parameter Type Default Description
df pl.DataFrame required DataFrame to write
path str | PathLike | BufferedIOBase required Output file path or file-like object
version int 15 Stata version (8–15, or internal codes 113–119)
file_label str | None None Dataset description (max 80 characters)
var_labels dict[str, str] | None None Variable labels {"var_name": "description"}
value_labels dict[str, dict] | None None Value labels (not yet implemented)
strl_threshold int 2045 Maximum string length before error (max 2045)
adjust_tz bool True Adjust timezone for datetime columns
na_policy str “nan” How to handle infinity: “nan”, “error”, or “keep”

Returns:

  • df (pl.DataFrame): The input DataFrame (unmodified), for method chaining

Raises:

  • ValueError: If strings exceed 2045 bytes, file_label > 80 chars, or other validation errors
  • RuntimeError: If the underlying ReadStat library encounters an error

Example:

write_dta(
    df,
    "output.dta",
    version=14,
    file_label="Survey Data 2024",
    var_labels={
        "age": "Age in years",
        "income": "Annual income (USD)",
        "region": "Geographic region"
    },
    na_policy="nan"
)

Common Usage Patterns

Reading and Processing

# Read data
df, meta = read_dta("input.dta")

# Process with Polars
df = (df
    .filter(pl.col("age") > 25)
    .with_columns([
        (pl.col("income") * 1.1).alias("adjusted_income"),
        pl.col("date").str.to_date().alias("date_parsed")
    ])
    .select(["id", "age", "adjusted_income", "date_parsed"])
)

print(df)

Preserving Metadata

def transform_with_metadata(input_path, output_path, transform_fn):
    """Read, transform, and write while preserving metadata."""
    # Read with metadata
    df, meta = read_dta(input_path)

    # Apply transformation
    df = transform_fn(df)

    # Extract metadata for columns that still exist
    var_labels = {
        v['name']: v.get('label')
        for v in meta.get('vars', [])
        if v['name'] in df.columns
    }

    # Write with preserved metadata
    write_dta(
        df,
        output_path,
        file_label=meta.get('file_label'),
        var_labels=var_labels
    )

# Use it
transform_with_metadata(
    "input.dta",
    "output.dta",
    lambda df: df.filter(pl.col("year") == 2024)
)

Creating Files from Scratch

import polars as pl
from datetime import date

# Create data
df = pl.DataFrame({
    "id": range(1, 101),
    "name": [f"Person_{i}" for i in range(1, 101)],
    "treatment": ["A", "B"] * 50,
    "outcome": pl.Series(range(100), dtype=pl.Float64) * 1.5,
    "date": [date(2024, 1, 1)] * 100
})

# Write with full metadata
write_dta(
    df,
    "experiment.dta",
    version=15,
    file_label="RCT Study - Treatment Effects 2024",
    var_labels={
        "id": "Participant identifier",
        "name": "Participant name",
        "treatment": "Treatment group assignment (A=control, B=treatment)",
        "outcome": "Primary outcome measure (standardized score)",
        "date": "Date of measurement"
    }
)

Working with Value Labels

# Read data with value labels
df, meta = read_dta("survey.dta")

# Inspect value labels
for vl in meta['value_labels']:
    print(f"\n{vl['set_name']}:")
    for value, label in vl['mapping'].items():
        print(f"  {value} = {label}")

# Example output:
# education:
#   1 = Less than high school
#   2 = High school
#   3 = Some college
#   4 = Bachelor's degree
#   5 = Graduate degree

# Apply value labels to create readable data
if meta.get('value_labels'):
    for vl in meta['value_labels']:
        # Find which variable uses this label set
        var_name = next(
            (v['name'] for v in meta['vars']
             if v.get('label_set') == vl['set_name']),
            None
        )
        if var_name and var_name in df.columns:
            # Map numeric codes to labels
            mapping = {int(k): v for k, v in vl['mapping'].items()}
            df = df.with_columns(
                pl.col(var_name).replace(mapping).alias(f"{var_name}_label")
            )

Handling Long Strings

Stata limits string fields to 2045 bytes. Here’s how to handle this:

# Check string lengths
max_lengths = {
    col: df[col].str.len_bytes().max()
    for col in df.columns
    if df[col].dtype == pl.Utf8
}

print("String column lengths:")
for col, max_len in max_lengths.items():
    status = "✓" if max_len <= 2045 else "✗ TOO LONG"
    print(f"  {col}: {max_len} bytes {status}")

# Option 1: Truncate
df = df.with_columns([
    pl.col(col).str.slice(0, 2045).alias(col)
    for col, length in max_lengths.items()
    if length > 2045
])

# Option 2: Use alternative format
if any(length > 2045 for length in max_lengths.values()):
    print("Strings too long for Stata, using Parquet instead")
    df.write_parquet("data.parquet")

# Option 3: Split into multiple columns
df = df.with_columns([
    pl.col("long_text").str.slice(0, 2045).alias("text_part1"),
    pl.col("long_text").str.slice(2045, 2045).alias("text_part2"),
])

Roundtrip Workflow

# Read original
df_original, meta = read_dta("original.dta")

# Process
df_processed = (df_original
    .filter(pl.col("valid") == 1)
    .with_columns([
        (pl.col("value") * 1.05).alias("adjusted_value")
    ])
)

# Write back with original metadata
var_labels = {v['name']: v.get('label') for v in meta['vars']}
var_labels['adjusted_value'] = "Value adjusted by 5%"

write_dta(
    df_processed,
    "processed.dta",
    version=15,
    file_label=meta.get('file_label'),
    var_labels=var_labels
)

Stata Version Reference

Version Mapping

Version Internal Code Year Notes
8 113 2003 Basic support
9–10 113–114 2005–2007
11 114 2009
12 115 2011 Unicode support
13 117 2013 strL introduced*
14 118 2015
15 119 2017 Latest

*strL (long strings >2045 bytes) is currently unavailable due to a ReadStat library bug.

Specifying Version

# All equivalent ways to write Stata 14 format
write_dta(df, "out.dta", version=14)   # Recommended
write_dta(df, "out.dta", version=118)  # Internal code

Data Type Conversions

Reading (Stata → Python)

Stata Type Polars Type Notes
byte Float64 Numeric with missings
int Float64 Numeric with missings
long Float64 Numeric with missings
float Float64
double Float64
str# String Fixed-width strings
strL String Long strings (read only)*

*strL strings can be read but not currently written due to a ReadStat bug.

Writing (Python → Stata)

Polars Type Stata Type Notes
Int8, Int16, Int32, Int64 double All integers → double
UInt8, UInt16, UInt32, UInt64 double Unsigned → double
Float32, Float64 double
Boolean double True=1, False=0
String str# Max 2045 bytes
Date double With %td format
Datetime double With %tc format
Categorical double Not yet labeled*

*Categorical → labeled integers not yet implemented.

Error Handling

Common Errors and Solutions

from svy_io.stata import write_dta
import polars as pl

df = pl.DataFrame({"text": ["A" * 3000]})

try:
    write_dta(df, "out.dta")
except ValueError as e:
    if "longer than 2045 bytes" in str(e):
        print("Error: String too long")
        print("\nSolutions:")
        print("1. Truncate: df.with_columns(pl.col('text').str.slice(0, 2045))")
        print("2. Use Parquet: df.write_parquet('out.parquet')")
        print("3. Split column: see documentation")
    elif "file_label must be 80" in str(e):
        print("Error: File label too long (max 80 characters)")
    else:
        raise

Validation Helper

def validate_for_stata(df: pl.DataFrame) -> list[str]:
    """Check if DataFrame can be written to Stata."""
    issues = []

    # Check string lengths
    for col in df.columns:
        if df[col].dtype == pl.Utf8:
            max_len = df[col].str.len_bytes().max()
            if max_len > 2045:
                issues.append(f"Column '{col}' has strings up to {max_len} bytes (max: 2045)")

    # Check column names
    for col in df.columns:
        if len(col) > 32:
            issues.append(f"Column name '{col}' too long (max: 32 characters)")

    return issues

# Use it
issues = validate_for_stata(df)
if issues:
    print("Cannot write to Stata:")
    for issue in issues:
        print(f"  - {issue}")
else:
    write_dta(df, "output.dta")

Feature Support

Feature Read Write Notes
Basic data types All numeric, string, boolean
Variable labels Full roundtrip
File labels Full roundtrip
Value labels Read only (write pending)
Strings ≤2045 bytes Full support
Strings >2045 bytes ReadStat bug blocks write
Tagged missing Read only (write pending)
Date/Datetime With optional coercion
Notes Read only
UTF-8 Full Unicode support
All versions (8–15) Complete support

Performance Tips

# 1. Skip unnecessary columns
df, meta = read_dta("large_file.dta", cols_skip=["temp1", "temp2", "unused"])

# 2. Limit rows when exploring
df_sample, _ = read_dta("large_file.dta", n_max=1000)

# 3. Use Polars lazy evaluation for large files
df_lazy = pl.scan_csv("intermediate.csv")
result = (df_lazy
    .filter(pl.col("year") == 2024)
    .group_by("region")
    .agg(pl.col("value").mean())
    .collect()
)
write_dta(result, "summary.dta")

Known Limitations

  1. strL Strings (>2045 bytes) — Cannot write due to ReadStat v1.1.9 bug. The library raises a clear error with workarounds.

  2. Value Labels on Write — Not yet implemented. You can read files with value labels but cannot create new ones when writing.

  3. Categorical Variables — Polars Categorical types are not automatically converted to Stata labeled integers.

  4. Tagged Missing Values on Write — Cannot write tagged missing values (e.g., .a, .b). These can be read from existing files.

  5. Variable Name Length — Stata has a 32-character limit on variable names.

  6. File Label Length — Maximum 80 characters for dataset-level labels.

See Also