from svy_io.stata import read_dta
# Read a Stata file
df, meta = read_dta("data.dta")
# df is a Polars DataFrame
print(df.head())
# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
# 'user_missing', 'n_rows', 'tagged_missings', 'notes'])Stata File I/O
Read and write .dta files with full metadata support
read Stata Python, write dta file Python, Stata to Polars, Python ReadStat, svy-io Stata, dta file Python, Stata metadata Python
The svy_io library provides comprehensive support for reading and writing Stata .dta files through a clean, Pythonic API backed by the ReadStat C library.
Installation
pip install svy-ioQuick Start
Reading Stata Files
Writing Stata Files
from svy_io.stata import write_dta
import polars as pl
df = pl.DataFrame({
"age": [25, 30, 35, 40],
"name": ["Alice", "Bob", "Charlie", "Diana"],
"score": [85.5, 92.0, 78.3, 88.9]
})
write_dta(df, "output.dta", version=15)API Reference
read_dta()
Read a Stata .dta file into a Polars DataFrame.
Signature:
def read_dta(
data_path: str,
*,
cols_skip: list[str] | None = None,
n_max: int | None = None,
rows_skip: int = 0,
coerce_temporals: bool = False,
zap_empty_str: bool = False,
factorize: bool = False,
levels: str = "default",
ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | required | Path to the Stata .dta file |
cols_skip |
list[str] | None | None | Column names to skip during import |
n_max |
int | None | None | Maximum number of rows to read |
rows_skip |
int | 0 | Number of rows to skip from the beginning |
coerce_temporals |
bool | False | Convert Stata date/datetime formats to Python date/datetime |
zap_empty_str |
bool | False | Convert empty strings to None |
factorize |
bool | False | Convert value-labeled variables to Polars categoricals |
levels |
str | “default” | How to handle factor levels (“default”, “labels”, “values”) |
ordered |
bool | False | Whether categorical variables should be ordered |
Returns:
A tuple of (df, meta) where:
df(pl.DataFrame): The datameta(dict): Metadata dictionary containing:
| Key | Type | Description |
|---|---|---|
file_label |
str | None | Dataset-level label/description |
vars |
list | Variable metadata (name, label, format, etc.) |
value_labels |
list | Value label sets (categorical mappings) |
user_missing |
list | User-defined missing value specifications |
n_rows |
int | Number of rows read |
tagged_missings |
list | Tagged missing value information |
notes |
list | Dataset notes/comments |
Example:
# Read with options
df, meta = read_dta(
"survey_data.dta",
cols_skip=["temp_var", "id"],
n_max=1000,
rows_skip=10,
coerce_temporals=True
)
# Access metadata
print(f"Dataset: {meta['file_label']}")
for var in meta['vars']:
print(f" {var['name']}: {var.get('label', 'No label')}")write_dta()
Write a Polars DataFrame to a Stata .dta file.
Signature:
def write_dta(
df: pl.DataFrame,
path: str | os.PathLike | io.BufferedIOBase,
*,
version: int = 15,
file_label: str | None = None,
var_labels: dict[str, str] | None = None,
value_labels: dict[str, dict] | None = None,
strl_threshold: int = 2045,
adjust_tz: bool = True,
na_policy: str = "nan"
) -> pl.DataFrameParameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pl.DataFrame | required | DataFrame to write |
path |
str | PathLike | BufferedIOBase | required | Output file path or file-like object |
version |
int | 15 | Stata version (8–15, or internal codes 113–119) |
file_label |
str | None | None | Dataset description (max 80 characters) |
var_labels |
dict[str, str] | None | None | Variable labels {"var_name": "description"} |
value_labels |
dict[str, dict] | None | None | Value labels (not yet implemented) |
strl_threshold |
int | 2045 | Maximum string length before error (max 2045) |
adjust_tz |
bool | True | Adjust timezone for datetime columns |
na_policy |
str | “nan” | How to handle infinity: “nan”, “error”, or “keep” |
Returns:
df(pl.DataFrame): The input DataFrame (unmodified), for method chaining
Raises:
ValueError: If strings exceed 2045 bytes, file_label > 80 chars, or other validation errorsRuntimeError: If the underlying ReadStat library encounters an error
Example:
write_dta(
df,
"output.dta",
version=14,
file_label="Survey Data 2024",
var_labels={
"age": "Age in years",
"income": "Annual income (USD)",
"region": "Geographic region"
},
na_policy="nan"
)Common Usage Patterns
Reading and Processing
# Read data
df, meta = read_dta("input.dta")
# Process with Polars
df = (df
.filter(pl.col("age") > 25)
.with_columns([
(pl.col("income") * 1.1).alias("adjusted_income"),
pl.col("date").str.to_date().alias("date_parsed")
])
.select(["id", "age", "adjusted_income", "date_parsed"])
)
print(df)Preserving Metadata
def transform_with_metadata(input_path, output_path, transform_fn):
"""Read, transform, and write while preserving metadata."""
# Read with metadata
df, meta = read_dta(input_path)
# Apply transformation
df = transform_fn(df)
# Extract metadata for columns that still exist
var_labels = {
v['name']: v.get('label')
for v in meta.get('vars', [])
if v['name'] in df.columns
}
# Write with preserved metadata
write_dta(
df,
output_path,
file_label=meta.get('file_label'),
var_labels=var_labels
)
# Use it
transform_with_metadata(
"input.dta",
"output.dta",
lambda df: df.filter(pl.col("year") == 2024)
)Creating Files from Scratch
import polars as pl
from datetime import date
# Create data
df = pl.DataFrame({
"id": range(1, 101),
"name": [f"Person_{i}" for i in range(1, 101)],
"treatment": ["A", "B"] * 50,
"outcome": pl.Series(range(100), dtype=pl.Float64) * 1.5,
"date": [date(2024, 1, 1)] * 100
})
# Write with full metadata
write_dta(
df,
"experiment.dta",
version=15,
file_label="RCT Study - Treatment Effects 2024",
var_labels={
"id": "Participant identifier",
"name": "Participant name",
"treatment": "Treatment group assignment (A=control, B=treatment)",
"outcome": "Primary outcome measure (standardized score)",
"date": "Date of measurement"
}
)Working with Value Labels
# Read data with value labels
df, meta = read_dta("survey.dta")
# Inspect value labels
for vl in meta['value_labels']:
print(f"\n{vl['set_name']}:")
for value, label in vl['mapping'].items():
print(f" {value} = {label}")
# Example output:
# education:
# 1 = Less than high school
# 2 = High school
# 3 = Some college
# 4 = Bachelor's degree
# 5 = Graduate degree
# Apply value labels to create readable data
if meta.get('value_labels'):
for vl in meta['value_labels']:
# Find which variable uses this label set
var_name = next(
(v['name'] for v in meta['vars']
if v.get('label_set') == vl['set_name']),
None
)
if var_name and var_name in df.columns:
# Map numeric codes to labels
mapping = {int(k): v for k, v in vl['mapping'].items()}
df = df.with_columns(
pl.col(var_name).replace(mapping).alias(f"{var_name}_label")
)Handling Long Strings
Stata limits string fields to 2045 bytes. Here’s how to handle this:
# Check string lengths
max_lengths = {
col: df[col].str.len_bytes().max()
for col in df.columns
if df[col].dtype == pl.Utf8
}
print("String column lengths:")
for col, max_len in max_lengths.items():
status = "✓" if max_len <= 2045 else "✗ TOO LONG"
print(f" {col}: {max_len} bytes {status}")
# Option 1: Truncate
df = df.with_columns([
pl.col(col).str.slice(0, 2045).alias(col)
for col, length in max_lengths.items()
if length > 2045
])
# Option 2: Use alternative format
if any(length > 2045 for length in max_lengths.values()):
print("Strings too long for Stata, using Parquet instead")
df.write_parquet("data.parquet")
# Option 3: Split into multiple columns
df = df.with_columns([
pl.col("long_text").str.slice(0, 2045).alias("text_part1"),
pl.col("long_text").str.slice(2045, 2045).alias("text_part2"),
])Roundtrip Workflow
# Read original
df_original, meta = read_dta("original.dta")
# Process
df_processed = (df_original
.filter(pl.col("valid") == 1)
.with_columns([
(pl.col("value") * 1.05).alias("adjusted_value")
])
)
# Write back with original metadata
var_labels = {v['name']: v.get('label') for v in meta['vars']}
var_labels['adjusted_value'] = "Value adjusted by 5%"
write_dta(
df_processed,
"processed.dta",
version=15,
file_label=meta.get('file_label'),
var_labels=var_labels
)Stata Version Reference
Version Mapping
| Version | Internal Code | Year | Notes |
|---|---|---|---|
| 8 | 113 | 2003 | Basic support |
| 9–10 | 113–114 | 2005–2007 | |
| 11 | 114 | 2009 | |
| 12 | 115 | 2011 | Unicode support |
| 13 | 117 | 2013 | strL introduced* |
| 14 | 118 | 2015 | |
| 15 | 119 | 2017 | Latest |
*strL (long strings >2045 bytes) is currently unavailable due to a ReadStat library bug.
Specifying Version
# All equivalent ways to write Stata 14 format
write_dta(df, "out.dta", version=14) # Recommended
write_dta(df, "out.dta", version=118) # Internal codeData Type Conversions
Reading (Stata → Python)
| Stata Type | Polars Type | Notes |
|---|---|---|
| byte | Float64 | Numeric with missings |
| int | Float64 | Numeric with missings |
| long | Float64 | Numeric with missings |
| float | Float64 | |
| double | Float64 | |
| str# | String | Fixed-width strings |
| strL | String | Long strings (read only)* |
*strL strings can be read but not currently written due to a ReadStat bug.
Writing (Python → Stata)
| Polars Type | Stata Type | Notes |
|---|---|---|
| Int8, Int16, Int32, Int64 | double | All integers → double |
| UInt8, UInt16, UInt32, UInt64 | double | Unsigned → double |
| Float32, Float64 | double | |
| Boolean | double | True=1, False=0 |
| String | str# | Max 2045 bytes |
| Date | double | With %td format |
| Datetime | double | With %tc format |
| Categorical | double | Not yet labeled* |
*Categorical → labeled integers not yet implemented.
Error Handling
Common Errors and Solutions
from svy_io.stata import write_dta
import polars as pl
df = pl.DataFrame({"text": ["A" * 3000]})
try:
write_dta(df, "out.dta")
except ValueError as e:
if "longer than 2045 bytes" in str(e):
print("Error: String too long")
print("\nSolutions:")
print("1. Truncate: df.with_columns(pl.col('text').str.slice(0, 2045))")
print("2. Use Parquet: df.write_parquet('out.parquet')")
print("3. Split column: see documentation")
elif "file_label must be 80" in str(e):
print("Error: File label too long (max 80 characters)")
else:
raiseValidation Helper
def validate_for_stata(df: pl.DataFrame) -> list[str]:
"""Check if DataFrame can be written to Stata."""
issues = []
# Check string lengths
for col in df.columns:
if df[col].dtype == pl.Utf8:
max_len = df[col].str.len_bytes().max()
if max_len > 2045:
issues.append(f"Column '{col}' has strings up to {max_len} bytes (max: 2045)")
# Check column names
for col in df.columns:
if len(col) > 32:
issues.append(f"Column name '{col}' too long (max: 32 characters)")
return issues
# Use it
issues = validate_for_stata(df)
if issues:
print("Cannot write to Stata:")
for issue in issues:
print(f" - {issue}")
else:
write_dta(df, "output.dta")Feature Support
| Feature | Read | Write | Notes |
|---|---|---|---|
| Basic data types | ✅ | ✅ | All numeric, string, boolean |
| Variable labels | ✅ | ✅ | Full roundtrip |
| File labels | ✅ | ✅ | Full roundtrip |
| Value labels | ✅ | ❌ | Read only (write pending) |
| Strings ≤2045 bytes | ✅ | ✅ | Full support |
| Strings >2045 bytes | ✅ | ❌ | ReadStat bug blocks write |
| Tagged missing | ✅ | ❌ | Read only (write pending) |
| Date/Datetime | ✅ | ✅ | With optional coercion |
| Notes | ✅ | ❌ | Read only |
| UTF-8 | ✅ | ✅ | Full Unicode support |
| All versions (8–15) | ✅ | ✅ | Complete support |
Performance Tips
# 1. Skip unnecessary columns
df, meta = read_dta("large_file.dta", cols_skip=["temp1", "temp2", "unused"])
# 2. Limit rows when exploring
df_sample, _ = read_dta("large_file.dta", n_max=1000)
# 3. Use Polars lazy evaluation for large files
df_lazy = pl.scan_csv("intermediate.csv")
result = (df_lazy
.filter(pl.col("year") == 2024)
.group_by("region")
.agg(pl.col("value").mean())
.collect()
)
write_dta(result, "summary.dta")Known Limitations
strL Strings (>2045 bytes) — Cannot write due to ReadStat v1.1.9 bug. The library raises a clear error with workarounds.
Value Labels on Write — Not yet implemented. You can read files with value labels but cannot create new ones when writing.
Categorical Variables — Polars Categorical types are not automatically converted to Stata labeled integers.
Tagged Missing Values on Write — Cannot write tagged missing values (e.g.,
.a,.b). These can be read from existing files.Variable Name Length — Stata has a 32-character limit on variable names.
File Label Length — Maximum 80 characters for dataset-level labels.
See Also
- SPSS File I/O — Read and write SPSS
.savfiles - SAS File I/O — Read and write SAS
.sas7bdatfiles - Polars Documentation — DataFrame operations
- ReadStat — Upstream C library