Stata File I/O

Read and write .dta files with full metadata support

Read and write Stata .dta files in Python using svy-io. Fast, Pythonic API backed by ReadStat with full support for variable labels, value labels, and metadata.

Keywords

read Stata Python, write dta file Python, Stata to Polars, Python ReadStat, svy-io Stata, dta file Python, Stata metadata Python

The svy_io library provides comprehensive support for reading and writing Stata .dta files through a clean, Pythonic API backed by the ReadStat C library.

Installation

pip install svy-io

Quick Start

Reading Stata Files

from svy_io.stata import read_dta

# Read a Stata file
df, meta = read_dta("data.dta")

# df is a Polars DataFrame
print(df.head())

# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
#            'user_missing', 'n_rows', 'tagged_missings', 'notes'])

Writing Stata Files

from svy_io.stata import write_dta
import polars as pl

df = pl.DataFrame({
    "age": [25, 30, 35, 40],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "score": [85.5, 92.0, 78.3, 88.9]
})

write_dta(df, "output.dta", version=15)

API Reference

`read_dta()`

Read a Stata .dta file into a Polars DataFrame.

Signature:

def read_dta(
    data_path: str,
    *,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False,
    factorize: bool = False,
    levels: str = "default",
    ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter	Type	Default	Description
`data_path`	str	required	Path to the Stata `.dta` file
`cols_skip`	list[str] \| None	None	Column names to skip during import
`n_max`	int \| None	None	Maximum number of rows to read
`rows_skip`	int	0	Number of rows to skip from the beginning
`coerce_temporals`	bool	False	Convert Stata date/datetime formats to Python date/datetime
`zap_empty_str`	bool	False	Convert empty strings to `None`
`factorize`	bool	False	Convert value-labeled variables to Polars categoricals
`levels`	str	“default”	How to handle factor levels (“default”, “labels”, “values”)
`ordered`	bool	False	Whether categorical variables should be ordered

Returns:

A tuple of (df, meta) where:

df (pl.DataFrame): The data
meta (dict): Metadata dictionary containing:

Key	Type	Description
`file_label`	str \| None	Dataset-level label/description
`vars`	list	Variable metadata (name, label, format, etc.)
`value_labels`	list	Value label sets (categorical mappings)
`user_missing`	list	User-defined missing value specifications
`n_rows`	int	Number of rows read
`tagged_missings`	list	Tagged missing value information
`notes`	list	Dataset notes/comments

Example:

# Read with options
df, meta = read_dta(
    "survey_data.dta",
    cols_skip=["temp_var", "id"],
    n_max=1000,
    rows_skip=10,
    coerce_temporals=True
)

# Access metadata
print(f"Dataset: {meta['file_label']}")
for var in meta['vars']:
    print(f"  {var['name']}: {var.get('label', 'No label')}")

`write_dta()`

Write a Polars DataFrame to a Stata .dta file.

Signature:

def write_dta(
    df: pl.DataFrame,
    path: str | os.PathLike | io.BufferedIOBase,
    *,
    version: int = 15,
    file_label: str | None = None,
    var_labels: dict[str, str] | None = None,
    value_labels: dict[str, dict] | None = None,
    strl_threshold: int = 2045,
    adjust_tz: bool = True,
    na_policy: str = "nan"
) -> pl.DataFrame

Parameters:

Parameter	Type	Default	Description
`df`	pl.DataFrame	required	DataFrame to write
`path`	str \| PathLike \| BufferedIOBase	required	Output file path or file-like object
`version`	int	15	Stata version (8–15, or internal codes 113–119)
`file_label`	str \| None	None	Dataset description (max 80 characters)
`var_labels`	dict[str, str] \| None	None	Variable labels `{"var_name": "description"}`
`value_labels`	dict[str, dict] \| None	None	Value labels (not yet implemented)
`strl_threshold`	int	2045	Maximum string length before error (max 2045)
`adjust_tz`	bool	True	Adjust timezone for datetime columns
`na_policy`	str	“nan”	How to handle infinity: “nan”, “error”, or “keep”

Returns:

df (pl.DataFrame): The input DataFrame (unmodified), for method chaining

Raises:

ValueError: If strings exceed 2045 bytes, file_label > 80 chars, or other validation errors
RuntimeError: If the underlying ReadStat library encounters an error

Example:

write_dta(
    df,
    "output.dta",
    version=14,
    file_label="Survey Data 2024",
    var_labels={
        "age": "Age in years",
        "income": "Annual income (USD)",
        "region": "Geographic region"
    },
    na_policy="nan"
)

Common Usage Patterns

Reading and Processing

# Read data
df, meta = read_dta("input.dta")

# Process with Polars
df = (df
    .filter(pl.col("age") > 25)
    .with_columns([
        (pl.col("income") * 1.1).alias("adjusted_income"),
        pl.col("date").str.to_date().alias("date_parsed")
    ])
    .select(["id", "age", "adjusted_income", "date_parsed"])
)

print(df)

Preserving Metadata

def transform_with_metadata(input_path, output_path, transform_fn):
    """Read, transform, and write while preserving metadata."""
    # Read with metadata
    df, meta = read_dta(input_path)

    # Apply transformation
    df = transform_fn(df)

    # Extract metadata for columns that still exist
    var_labels = {
        v['name']: v.get('label')
        for v in meta.get('vars', [])
        if v['name'] in df.columns
    }

    # Write with preserved metadata
    write_dta(
        df,
        output_path,
        file_label=meta.get('file_label'),
        var_labels=var_labels
    )

# Use it
transform_with_metadata(
    "input.dta",
    "output.dta",
    lambda df: df.filter(pl.col("year") == 2024)
)

Creating Files from Scratch

import polars as pl
from datetime import date

# Create data
df = pl.DataFrame({
    "id": range(1, 101),
    "name": [f"Person_{i}" for i in range(1, 101)],
    "treatment": ["A", "B"] * 50,
    "outcome": pl.Series(range(100), dtype=pl.Float64) * 1.5,
    "date": [date(2024, 1, 1)] * 100
})

# Write with full metadata
write_dta(
    df,
    "experiment.dta",
    version=15,
    file_label="RCT Study - Treatment Effects 2024",
    var_labels={
        "id": "Participant identifier",
        "name": "Participant name",
        "treatment": "Treatment group assignment (A=control, B=treatment)",
        "outcome": "Primary outcome measure (standardized score)",
        "date": "Date of measurement"
    }
)

Working with Value Labels

# Read data with value labels
df, meta = read_dta("survey.dta")

# Inspect value labels
for vl in meta['value_labels']:
    print(f"\n{vl['set_name']}:")
    for value, label in vl['mapping'].items():
        print(f"  {value} = {label}")

# Example output:
# education:
#   1 = Less than high school
#   2 = High school
#   3 = Some college
#   4 = Bachelor's degree
#   5 = Graduate degree

# Apply value labels to create readable data
if meta.get('value_labels'):
    for vl in meta['value_labels']:
        # Find which variable uses this label set
        var_name = next(
            (v['name'] for v in meta['vars']
             if v.get('label_set') == vl['set_name']),
            None
        )
        if var_name and var_name in df.columns:
            # Map numeric codes to labels
            mapping = {int(k): v for k, v in vl['mapping'].items()}
            df = df.with_columns(
                pl.col(var_name).replace(mapping).alias(f"{var_name}_label")
            )

Handling Long Strings

Stata limits string fields to 2045 bytes. Here’s how to handle this:

# Check string lengths
max_lengths = {
    col: df[col].str.len_bytes().max()
    for col in df.columns
    if df[col].dtype == pl.Utf8
}

print("String column lengths:")
for col, max_len in max_lengths.items():
    status = "✓" if max_len <= 2045 else "✗ TOO LONG"
    print(f"  {col}: {max_len} bytes {status}")

# Option 1: Truncate
df = df.with_columns([
    pl.col(col).str.slice(0, 2045).alias(col)
    for col, length in max_lengths.items()
    if length > 2045
])

# Option 2: Use alternative format
if any(length > 2045 for length in max_lengths.values()):
    print("Strings too long for Stata, using Parquet instead")
    df.write_parquet("data.parquet")

# Option 3: Split into multiple columns
df = df.with_columns([
    pl.col("long_text").str.slice(0, 2045).alias("text_part1"),
    pl.col("long_text").str.slice(2045, 2045).alias("text_part2"),
])

Roundtrip Workflow

# Read original
df_original, meta = read_dta("original.dta")

# Process
df_processed = (df_original
    .filter(pl.col("valid") == 1)
    .with_columns([
        (pl.col("value") * 1.05).alias("adjusted_value")
    ])
)

# Write back with original metadata
var_labels = {v['name']: v.get('label') for v in meta['vars']}
var_labels['adjusted_value'] = "Value adjusted by 5%"

write_dta(
    df_processed,
    "processed.dta",
    version=15,
    file_label=meta.get('file_label'),
    var_labels=var_labels
)

Stata Version Reference

Version Mapping

Version	Internal Code	Year	Notes
8	113	2003	Basic support
9–10	113–114	2005–2007
11	114	2009
12	115	2011	Unicode support
13	117	2013	strL introduced*
14	118	2015
15	119	2017	Latest

*strL (long strings >2045 bytes) is currently unavailable due to a ReadStat library bug.

Specifying Version

# All equivalent ways to write Stata 14 format
write_dta(df, "out.dta", version=14)   # Recommended
write_dta(df, "out.dta", version=118)  # Internal code

Data Type Conversions

Reading (Stata → Python)

Stata Type	Polars Type	Notes
byte	Float64	Numeric with missings
int	Float64	Numeric with missings
long	Float64	Numeric with missings
float	Float64
double	Float64
str#	String	Fixed-width strings
strL	String	Long strings (read only)*

*strL strings can be read but not currently written due to a ReadStat bug.

Writing (Python → Stata)

Polars Type	Stata Type	Notes
Int8, Int16, Int32, Int64	double	All integers → double
UInt8, UInt16, UInt32, UInt64	double	Unsigned → double
Float32, Float64	double
Boolean	double	True=1, False=0
String	str#	Max 2045 bytes
Date	double	With %td format
Datetime	double	With %tc format
Categorical	double	Not yet labeled*

*Categorical → labeled integers not yet implemented.

Error Handling

Common Errors and Solutions

from svy_io.stata import write_dta
import polars as pl

df = pl.DataFrame({"text": ["A" * 3000]})

try:
    write_dta(df, "out.dta")
except ValueError as e:
    if "longer than 2045 bytes" in str(e):
        print("Error: String too long")
        print("\nSolutions:")
        print("1. Truncate: df.with_columns(pl.col('text').str.slice(0, 2045))")
        print("2. Use Parquet: df.write_parquet('out.parquet')")
        print("3. Split column: see documentation")
    elif "file_label must be 80" in str(e):
        print("Error: File label too long (max 80 characters)")
    else:
        raise

Validation Helper

def validate_for_stata(df: pl.DataFrame) -> list[str]:
    """Check if DataFrame can be written to Stata."""
    issues = []

    # Check string lengths
    for col in df.columns:
        if df[col].dtype == pl.Utf8:
            max_len = df[col].str.len_bytes().max()
            if max_len > 2045:
                issues.append(f"Column '{col}' has strings up to {max_len} bytes (max: 2045)")

    # Check column names
    for col in df.columns:
        if len(col) > 32:
            issues.append(f"Column name '{col}' too long (max: 32 characters)")

    return issues

# Use it
issues = validate_for_stata(df)
if issues:
    print("Cannot write to Stata:")
    for issue in issues:
        print(f"  - {issue}")
else:
    write_dta(df, "output.dta")

Feature Support

Feature	Read	Write	Notes
Basic data types	✅	✅	All numeric, string, boolean
Variable labels	✅	✅	Full roundtrip
File labels	✅	✅	Full roundtrip
Value labels	✅	❌	Read only (write pending)
Strings ≤2045 bytes	✅	✅	Full support
Strings >2045 bytes	✅	❌	ReadStat bug blocks write
Tagged missing	✅	❌	Read only (write pending)
Date/Datetime	✅	✅	With optional coercion
Notes	✅	❌	Read only
UTF-8	✅	✅	Full Unicode support
All versions (8–15)	✅	✅	Complete support

Performance Tips

# 1. Skip unnecessary columns
df, meta = read_dta("large_file.dta", cols_skip=["temp1", "temp2", "unused"])

# 2. Limit rows when exploring
df_sample, _ = read_dta("large_file.dta", n_max=1000)

# 3. Use Polars lazy evaluation for large files
df_lazy = pl.scan_csv("intermediate.csv")
result = (df_lazy
    .filter(pl.col("year") == 2024)
    .group_by("region")
    .agg(pl.col("value").mean())
    .collect()
)
write_dta(result, "summary.dta")

Known Limitations

strL Strings (>2045 bytes) — Cannot write due to ReadStat v1.1.9 bug. The library raises a clear error with workarounds.
Value Labels on Write — Not yet implemented. You can read files with value labels but cannot create new ones when writing.
Categorical Variables — Polars Categorical types are not automatically converted to Stata labeled integers.
Tagged Missing Values on Write — Cannot write tagged missing values (e.g., .a, .b). These can be read from existing files.
Variable Name Length — Stata has a 32-character limit on variable names.
File Label Length — Maximum 80 characters for dataset-level labels.

--- title: "Stata File I/O" description: "Read and write Stata .dta files in Python using svy-io. Fast, Pythonic API backed by ReadStat with full support for variable labels, value labels, and metadata." keywords: "read Stata Python, write dta file Python, Stata to Polars, Python ReadStat, svy-io Stata, dta file Python, Stata metadata Python" subtitle: "Read and write .dta files with full metadata support" format: html: toc: true toc-depth: 3 code-fold: false code-tools: true --- The `svy_io` library provides comprehensive support for reading and writing Stata `.dta` files through a clean, Pythonic API backed by the ReadStat C library. ## Installation ```bash pip install svy-io ``` ## Quick Start ### Reading Stata Files ```{python} #| eval: false from svy_io.stata import read_dta # Read a Stata file df, meta = read_dta("data.dta") # df is a Polars DataFrame print(df.head()) # meta contains file metadata print(meta.keys()) # dict_keys(['file_label', 'vars', 'value_labels', # 'user_missing', 'n_rows', 'tagged_missings', 'notes']) ``` ### Writing Stata Files ```{python} #| eval: false from svy_io.stata import write_dta import polars as pl df = pl.DataFrame({ "age": [25, 30, 35, 40], "name": ["Alice", "Bob", "Charlie", "Diana"], "score": [85.5, 92.0, 78.3, 88.9] }) write_dta(df, "output.dta", version=15) ``` ## API Reference ### `read_dta()` Read a Stata `.dta` file into a Polars DataFrame. **Signature:** ```python def read_dta( data_path: str, *, cols_skip: list[str] | None = None, n_max: int | None = None, rows_skip: int = 0, coerce_temporals: bool = False, zap_empty_str: bool = False, factorize: bool = False, levels: str = "default", ordered: bool = False ) -> tuple[pl.DataFrame, dict[str, Any]] ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `data_path` | str | required | Path to the Stata `.dta` file | | `cols_skip` | list[str] \| None | None | Column names to skip during import | | `n_max` | int \| None | None | Maximum number of rows to read | | `rows_skip` | int | 0 | Number of rows to skip from the beginning | | `coerce_temporals` | bool | False | Convert Stata date/datetime formats to Python date/datetime | | `zap_empty_str` | bool | False | Convert empty strings to `None` | | `factorize` | bool | False | Convert value-labeled variables to Polars categoricals | | `levels` | str | "default" | How to handle factor levels ("default", "labels", "values") | | `ordered` | bool | False | Whether categorical variables should be ordered | **Returns:** A tuple of `(df, meta)` where: - `df` (pl.DataFrame): The data - `meta` (dict): Metadata dictionary containing: | Key | Type | Description | |-----|------|-------------| | `file_label` | str \| None | Dataset-level label/description | | `vars` | list | Variable metadata (name, label, format, etc.) | | `value_labels` | list | Value label sets (categorical mappings) | | `user_missing` | list | User-defined missing value specifications | | `n_rows` | int | Number of rows read | | `tagged_missings` | list | Tagged missing value information | | `notes` | list | Dataset notes/comments | **Example:** ```{python} #| eval: false # Read with options df, meta = read_dta( "survey_data.dta", cols_skip=["temp_var", "id"], n_max=1000, rows_skip=10, coerce_temporals=True ) # Access metadata print(f"Dataset: {meta['file_label']}") for var in meta['vars']: print(f" {var['name']}: {var.get('label', 'No label')}") ``` ### `write_dta()` Write a Polars DataFrame to a Stata `.dta` file. **Signature:** ```python def write_dta( df: pl.DataFrame, path: str | os.PathLike | io.BufferedIOBase, *, version: int = 15, file_label: str | None = None, var_labels: dict[str, str] | None = None, value_labels: dict[str, dict] | None = None, strl_threshold: int = 2045, adjust_tz: bool = True, na_policy: str = "nan" ) -> pl.DataFrame ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | pl.DataFrame | required | DataFrame to write | | `path` | str \| PathLike \| BufferedIOBase | required | Output file path or file-like object | | `version` | int | 15 | Stata version (8–15, or internal codes 113–119) | | `file_label` | str \| None | None | Dataset description (max 80 characters) | | `var_labels` | dict[str, str] \| None | None | Variable labels `{"var_name": "description"}` | | `value_labels` | dict[str, dict] \| None | None | Value labels (not yet implemented) | | `strl_threshold` | int | 2045 | Maximum string length before error (max 2045) | | `adjust_tz` | bool | True | Adjust timezone for datetime columns | | `na_policy` | str | "nan" | How to handle infinity: "nan", "error", or "keep" | **Returns:** - `df` (pl.DataFrame): The input DataFrame (unmodified), for method chaining **Raises:** - `ValueError`: If strings exceed 2045 bytes, file_label > 80 chars, or other validation errors - `RuntimeError`: If the underlying ReadStat library encounters an error **Example:** ```{python} #| eval: false write_dta( df, "output.dta", version=14, file_label="Survey Data 2024", var_labels={ "age": "Age in years", "income": "Annual income (USD)", "region": "Geographic region" }, na_policy="nan" ) ``` ## Common Usage Patterns ### Reading and Processing ```{python} #| eval: false # Read data df, meta = read_dta("input.dta") # Process with Polars df = (df .filter(pl.col("age") > 25) .with_columns([ (pl.col("income") * 1.1).alias("adjusted_income"), pl.col("date").str.to_date().alias("date_parsed") ]) .select(["id", "age", "adjusted_income", "date_parsed"]) ) print(df) ``` ### Preserving Metadata ```{python} #| eval: false def transform_with_metadata(input_path, output_path, transform_fn): """Read, transform, and write while preserving metadata.""" # Read with metadata df, meta = read_dta(input_path) # Apply transformation df = transform_fn(df) # Extract metadata for columns that still exist var_labels = { v['name']: v.get('label') for v in meta.get('vars', []) if v['name'] in df.columns } # Write with preserved metadata write_dta( df, output_path, file_label=meta.get('file_label'), var_labels=var_labels ) # Use it transform_with_metadata( "input.dta", "output.dta", lambda df: df.filter(pl.col("year") == 2024) ) ``` ### Creating Files from Scratch ```{python} #| eval: false import polars as pl from datetime import date # Create data df = pl.DataFrame({ "id": range(1, 101), "name": [f"Person_{i}" for i in range(1, 101)], "treatment": ["A", "B"] * 50, "outcome": pl.Series(range(100), dtype=pl.Float64) * 1.5, "date": [date(2024, 1, 1)] * 100 }) # Write with full metadata write_dta( df, "experiment.dta", version=15, file_label="RCT Study - Treatment Effects 2024", var_labels={ "id": "Participant identifier", "name": "Participant name", "treatment": "Treatment group assignment (A=control, B=treatment)", "outcome": "Primary outcome measure (standardized score)", "date": "Date of measurement" } ) ``` ### Working with Value Labels ```{python} #| eval: false # Read data with value labels df, meta = read_dta("survey.dta") # Inspect value labels for vl in meta['value_labels']: print(f"\n{vl['set_name']}:") for value, label in vl['mapping'].items(): print(f" {value} = {label}") # Example output: # education: # 1 = Less than high school # 2 = High school # 3 = Some college # 4 = Bachelor's degree # 5 = Graduate degree # Apply value labels to create readable data if meta.get('value_labels'): for vl in meta['value_labels']: # Find which variable uses this label set var_name = next( (v['name'] for v in meta['vars'] if v.get('label_set') == vl['set_name']), None ) if var_name and var_name in df.columns: # Map numeric codes to labels mapping = {int(k): v for k, v in vl['mapping'].items()} df = df.with_columns( pl.col(var_name).replace(mapping).alias(f"{var_name}_label") ) ``` ### Handling Long Strings Stata limits string fields to 2045 bytes. Here's how to handle this: ```{python} #| eval: false # Check string lengths max_lengths = { col: df[col].str.len_bytes().max() for col in df.columns if df[col].dtype == pl.Utf8 } print("String column lengths:") for col, max_len in max_lengths.items(): status = "✓" if max_len <= 2045 else "✗ TOO LONG" print(f" {col}: {max_len} bytes {status}") # Option 1: Truncate df = df.with_columns([ pl.col(col).str.slice(0, 2045).alias(col) for col, length in max_lengths.items() if length > 2045 ]) # Option 2: Use alternative format if any(length > 2045 for length in max_lengths.values()): print("Strings too long for Stata, using Parquet instead") df.write_parquet("data.parquet") # Option 3: Split into multiple columns df = df.with_columns([ pl.col("long_text").str.slice(0, 2045).alias("text_part1"), pl.col("long_text").str.slice(2045, 2045).alias("text_part2"), ]) ``` ### Roundtrip Workflow ```{python} #| eval: false # Read original df_original, meta = read_dta("original.dta") # Process df_processed = (df_original .filter(pl.col("valid") == 1) .with_columns([ (pl.col("value") * 1.05).alias("adjusted_value") ]) ) # Write back with original metadata var_labels = {v['name']: v.get('label') for v in meta['vars']} var_labels['adjusted_value'] = "Value adjusted by 5%" write_dta( df_processed, "processed.dta", version=15, file_label=meta.get('file_label'), var_labels=var_labels ) ``` ## Stata Version Reference ### Version Mapping | Version | Internal Code | Year | Notes | |---------|---------------|------|-------| | 8 | 113 | 2003 | Basic support | | 9–10 | 113–114 | 2005–2007 | | | 11 | 114 | 2009 | | | 12 | 115 | 2011 | Unicode support | | 13 | 117 | 2013 | strL introduced* | | 14 | 118 | 2015 | | | 15 | 119 | 2017 | Latest | *strL (long strings >2045 bytes) is currently unavailable due to a ReadStat library bug. ### Specifying Version ```{python} #| eval: false # All equivalent ways to write Stata 14 format write_dta(df, "out.dta", version=14) # Recommended write_dta(df, "out.dta", version=118) # Internal code ``` ## Data Type Conversions ### Reading (Stata → Python) | Stata Type | Polars Type | Notes | |------------|-------------|-------| | byte | Float64 | Numeric with missings | | int | Float64 | Numeric with missings | | long | Float64 | Numeric with missings | | float | Float64 | | | double | Float64 | | | str# | String | Fixed-width strings | | strL | String | Long strings (read only)* | *strL strings can be read but not currently written due to a ReadStat bug. ### Writing (Python → Stata) | Polars Type | Stata Type | Notes | |-------------|------------|-------| | Int8, Int16, Int32, Int64 | double | All integers → double | | UInt8, UInt16, UInt32, UInt64 | double | Unsigned → double | | Float32, Float64 | double | | | Boolean | double | True=1, False=0 | | String | str# | Max 2045 bytes | | Date | double | With %td format | | Datetime | double | With %tc format | | Categorical | double | Not yet labeled* | *Categorical → labeled integers not yet implemented. ## Error Handling ### Common Errors and Solutions ```{python} #| eval: false from svy_io.stata import write_dta import polars as pl df = pl.DataFrame({"text": ["A" * 3000]}) try: write_dta(df, "out.dta") except ValueError as e: if "longer than 2045 bytes" in str(e): print("Error: String too long") print("\nSolutions:") print("1. Truncate: df.with_columns(pl.col('text').str.slice(0, 2045))") print("2. Use Parquet: df.write_parquet('out.parquet')") print("3. Split column: see documentation") elif "file_label must be 80" in str(e): print("Error: File label too long (max 80 characters)") else: raise ``` ### Validation Helper ```{python} #| eval: false def validate_for_stata(df: pl.DataFrame) -> list[str]: """Check if DataFrame can be written to Stata.""" issues = [] # Check string lengths for col in df.columns: if df[col].dtype == pl.Utf8: max_len = df[col].str.len_bytes().max() if max_len > 2045: issues.append(f"Column '{col}' has strings up to {max_len} bytes (max: 2045)") # Check column names for col in df.columns: if len(col) > 32: issues.append(f"Column name '{col}' too long (max: 32 characters)") return issues # Use it issues = validate_for_stata(df) if issues: print("Cannot write to Stata:") for issue in issues: print(f" - {issue}") else: write_dta(df, "output.dta") ``` ## Feature Support | Feature | Read | Write | Notes | |---------|:----:|:-----:|-------| | Basic data types | ✅ | ✅ | All numeric, string, boolean | | Variable labels | ✅ | ✅ | Full roundtrip | | File labels | ✅ | ✅ | Full roundtrip | | Value labels | ✅ | ❌ | Read only (write pending) | | Strings ≤2045 bytes | ✅ | ✅ | Full support | | Strings >2045 bytes | ✅ | ❌ | ReadStat bug blocks write | | Tagged missing | ✅ | ❌ | Read only (write pending) | | Date/Datetime | ✅ | ✅ | With optional coercion | | Notes | ✅ | ❌ | Read only | | UTF-8 | ✅ | ✅ | Full Unicode support | | All versions (8–15) | ✅ | ✅ | Complete support | ## Performance Tips ```{python} #| eval: false # 1. Skip unnecessary columns df, meta = read_dta("large_file.dta", cols_skip=["temp1", "temp2", "unused"]) # 2. Limit rows when exploring df_sample, _ = read_dta("large_file.dta", n_max=1000) # 3. Use Polars lazy evaluation for large files df_lazy = pl.scan_csv("intermediate.csv") result = (df_lazy .filter(pl.col("year") == 2024) .group_by("region") .agg(pl.col("value").mean()) .collect() ) write_dta(result, "summary.dta") ``` ## Known Limitations 1. **strL Strings (>2045 bytes)** — Cannot write due to ReadStat v1.1.9 bug. The library raises a clear error with workarounds. 2. **Value Labels on Write** — Not yet implemented. You can read files with value labels but cannot create new ones when writing. 3. **Categorical Variables** — Polars Categorical types are not automatically converted to Stata labeled integers. 4. **Tagged Missing Values on Write** — Cannot write tagged missing values (e.g., `.a`, `.b`). These can be read from existing files. 5. **Variable Name Length** — Stata has a 32-character limit on variable names. 6. **File Label Length** — Maximum 80 characters for dataset-level labels. ## See Also - [SPSS File I/O](spss.qmd) — Read and write SPSS `.sav` files - [SAS File I/O](sas.qmd) — Read and write SAS `.sas7bdat` files - [Polars Documentation](https://pola-rs.github.io/polars/) — DataFrame operations - [ReadStat](https://github.com/WizardMac/ReadStat) — Upstream C library