from svy_io.sas import read_sas, read_xpt
# Read a SAS7BDAT file
df, meta = read_sas("data.sas7bdat")
# Read a SAS Transport (XPT) file
df_xpt, meta_xpt = read_xpt("transport.xpt")
# df is a Polars DataFrame
print(df.head())
# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
# 'user_missing', 'n_rows', 'tagged_missings'])SAS File I/O
Read and write .sas7bdat and .xpt files with full metadata support
read SAS Python, write xpt file Python, SAS to Polars, Python ReadStat, svy-io SAS, sas7bdat Python, xpt transport file, CDISC Python, FDA submission Python
The svy_io library provides comprehensive support for reading and writing SAS files (.sas7bdat and .xpt formats) through a clean, Pythonic API backed by the ReadStat C library.
Installation
pip install svy-ioQuick Start
Reading SAS Files
Writing SAS Transport Files
from svy_io.sas import write_xpt
import polars as pl
df = pl.DataFrame({
"age": [25, 30, 35, 40],
"name": ["Alice", "Bob", "Charlie", "Diana"],
"score": [85.5, 92.0, 78.3, 88.9]
})
write_xpt(df, "output.xpt", version=8, label="Study Data 2024")API Reference
read_sas()
Read a SAS7BDAT dataset file into a Polars DataFrame. Supports reading from zip archives.
Signature:
def read_sas(
data_path: str,
*,
catalog_path: str | None = None,
encoding: str | None = None,
catalog_encoding: str | None = None,
cols_skip: list[str] | None = None,
n_max: int | None = None,
rows_skip: int = 0,
coerce_temporals: bool = False,
zap_empty_str: bool = False,
factorize: bool = False,
levels: str = "default",
ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | required | Path to SAS7BDAT file or zip archive |
catalog_path |
str | None | None | Path to SAS7BCAT catalog for value labels |
encoding |
str | None | None | Character encoding (e.g., “latin1”, “utf-8”) |
catalog_encoding |
str | None | None | Encoding for the catalog file |
cols_skip |
list[str] | None | None | Column names to skip |
n_max |
int | None | None | Maximum rows to read |
rows_skip |
int | 0 | Rows to skip from beginning |
coerce_temporals |
bool | False | Convert SAS dates to Python types |
zap_empty_str |
bool | False | Convert empty strings to None |
factorize |
bool | False | Convert labeled variables to categoricals |
levels |
str | “default” | Factor levels: “default”, “labels”, “values”, “both” |
ordered |
bool | False | Whether categoricals should be ordered |
Returns:
A tuple of (df, meta) where:
df(pl.DataFrame): The datameta(dict): Metadata dictionary containing:
| Key | Type | Description |
|---|---|---|
file_label |
str | None | Dataset-level label |
vars |
list | Variable metadata (name, label, format) |
value_labels |
list | Value label sets from catalog |
user_missing |
list | User-defined missing specifications |
n_rows |
int | Number of rows read |
tagged_missings |
list | Tagged missing value info (.A-.Z) |
Example:
# Basic read
df, meta = read_sas("survey.sas7bdat")
# Read with catalog for value labels
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")
# Read from zip archive (auto-extracts)
df, meta = read_sas("data.zip")
# Read with options
df, meta = read_sas(
"survey.sas7bdat",
catalog_path="formats.sas7bcat",
cols_skip=["temp_var"],
n_max=1000,
coerce_temporals=True,
factorize=True
)read_xpt()
Read a SAS Transport (XPT) file into a Polars DataFrame.
Signature:
def read_xpt(
data_path: str | os.PathLike,
*,
n_max: int | None = None,
coerce_temporals: bool = True,
zap_empty_str: bool = False,
factorize: bool = False,
levels: str = "default",
ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path |
str | PathLike | required | Path to the XPT file |
n_max |
int | None | None | Maximum rows to read |
coerce_temporals |
bool | True | Convert SAS dates to Python types |
zap_empty_str |
bool | False | Convert empty strings to None |
factorize |
bool | False | Convert labeled variables to categoricals |
levels |
str | “default” | Factor levels handling |
ordered |
bool | False | Whether categoricals should be ordered |
Example:
# Read transport file
df, meta = read_xpt("data.xpt")
# Read with temporal coercion (recommended)
df, meta = read_xpt("data.xpt", coerce_temporals=True)write_xpt()
Write a Polars DataFrame to a SAS Transport (XPT) file.
Signature:
def write_xpt(
df: pl.DataFrame,
path: str | Path,
*,
version: int = 8,
name: str | None = None,
label: str | None = None,
adjust_tz: bool = True
) -> NoneParameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pl.DataFrame | required | DataFrame to write |
path |
str | PathLike | required | Output file path |
version |
int | 8 | XPT version: 5 or 8 |
name |
str | None | None | Dataset name (max 8 chars v5, 32 chars v8) |
label |
str | None | None | Dataset description (max 40 chars) |
adjust_tz |
bool | True | Adjust timezone for datetime columns |
Raises:
ValueError: If name/label exceed length limitsRuntimeError: If the writer encounters an error
Example:
# Write version 8 (recommended)
write_xpt(df, "clinical_trial.xpt", version=8, label="Phase II Results")
# Write version 5 (legacy compatibility)
write_xpt(df, "legacy.xpt", version=5, name="LEGACY")read_sas_arrow()
Read a SAS7BDAT file into a PyArrow Table with preserved metadata.
Signature:
def read_sas_arrow(
data_path: str,
*,
catalog_path: str | None = None,
encoding: str | None = None,
catalog_encoding: str | None = None,
cols_skip: list[str] | None = None,
n_max: int | None = None,
rows_skip: int = 0
) -> tuple[pa.Table, dict[str, Any]]Example:
from svy_io.sas import read_sas_arrow
table, meta = read_sas_arrow("data.sas7bdat")
# Arrow metadata preserved in field metadata
for field in table.schema:
print(f"{field.name}: {field.metadata}")
# Convert to Polars if needed
import polars as pl
df = pl.from_arrow(table)Metadata Utility Functions
get_column_labels()
Extract variable labels from metadata.
from svy_io.sas import read_sas, get_column_labels
df, meta = read_sas("data.sas7bdat")
labels = get_column_labels(meta)
# {'age': 'Age in years', 'income': 'Annual income', ...}get_value_labels_for_column()
Get value label mappings for a specific column.
from svy_io.sas import read_sas, get_value_labels_for_column
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")
gender_labels = get_value_labels_for_column(meta, "gender")
# {'1': 'Male', '2': 'Female', '3': 'Other'}get_tagged_na_info()
Extract information about tagged missing values (.A-.Z).
from svy_io.sas import read_sas, get_tagged_na_info
df, meta = read_sas("data.sas7bdat")
tagged_info = get_tagged_na_info(meta)
# {'age': ['A', 'B'], 'income': ['Z']}Working with Value Labels
SAS stores value labels (formats) in separate .sas7bcat catalog files:
# Read with catalog for value labels
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")
# Inspect value labels
for vl in meta['value_labels']:
print(f"\n{vl['set_name']}:")
for value, label in vl['mapping'].items():
print(f" {value} = {label}")
# Auto-convert to categoricals
df, meta = read_sas(
"survey.sas7bdat",
catalog_path="formats.sas7bcat",
factorize=True
)
print(df["gender"].dtype) # CategoricalTemporal Data Handling
SAS stores dates as days since 1960-01-01 and datetimes as seconds since 1960-01-01:
# Without coercion (raw numeric values)
df, meta = read_sas("data.sas7bdat", coerce_temporals=False)
print(df["date_var"]) # 21915.0
# With coercion (Python dates)
df, meta = read_sas("data.sas7bdat", coerce_temporals=True)
print(df["date_var"]) # 2020-01-15Format codes determine conversion:
- DATE, MMDDYY, YYMMDD →
pl.Date - DATETIME, DATETIME20 →
pl.Datetime - TIME →
pl.Duration
XPT Version Comparison
| Feature | XPT v5 | XPT v8 |
|---|---|---|
| Variable name length | 8 chars | 32 chars |
| String length | 200 chars | Unlimited |
| Character encoding | ASCII | UTF-8 |
| SAS compatibility | SAS 6+ | SAS 8+ |
| FDA acceptance | ✅ | ✅ |
Recommendation: Use version 8 unless you need compatibility with very old SAS installations.
Data Type Conversions
Reading (SAS → Python)
| SAS Type | Polars Type | Notes |
|---|---|---|
| Numeric | Float64 | All SAS numeric → Float64 |
| Character | Utf8 | String data |
| Date | Float64 or Date | Use coerce_temporals=True |
| Datetime | Float64 or Datetime | Use coerce_temporals=True |
Writing (Python → XPT)
| Polars Type | XPT Type | Notes |
|---|---|---|
| Int8–Int64, UInt8–UInt64 | Numeric (double) | All integers → double |
| Float32, Float64 | Numeric (double) | |
| Boolean | Numeric (double) | True=1, False=0 |
| Utf8 | Character | |
| Date, Datetime | Numeric (double) | Days/seconds since 1960 |
FDA Submissions and CDISC
Creating CDISC-Compliant XPT Files
df_adsl = pl.DataFrame({
"STUDYID": ["STUDY001"] * 100,
"USUBJID": [f"001-{i:03d}" for i in range(1, 101)],
"SUBJID": [f"{i:03d}" for i in range(1, 101)],
"AGE": pl.Series([25 + i % 50 for i in range(100)], dtype=pl.Float64),
"SEX": ["M", "F"] * 50,
"ARM": ["PLACEBO", "TREATMENT"] * 50
})
write_xpt(
df_adsl,
"adsl.xpt",
version=8,
name="ADSL",
label="Subject-Level Analysis Dataset"
)CDISC Validation Helper
def validate_cdisc_xpt(df: pl.DataFrame, dataset_name: str) -> list[str]:
"""Validate DataFrame meets CDISC/FDA requirements."""
issues = []
# Required SDTM variables
required = ["STUDYID", "USUBJID"]
missing = [v for v in required if v not in df.columns]
if missing:
issues.append(f"Missing required: {', '.join(missing)}")
# Variable name length (max 8 for CDISC)
long_names = [c for c in df.columns if len(c) > 8]
if long_names:
issues.append(f"Names > 8 chars: {', '.join(long_names)}")
# Dataset name length
if len(dataset_name) > 8:
issues.append(f"Dataset name '{dataset_name}' > 8 chars")
return issues
issues = validate_cdisc_xpt(df_adsl, "ADSL")
if not issues:
print("✅ CDISC-compliant")Common Patterns
Converting SAS to Other Formats
df, meta = read_sas("data.sas7bdat", catalog_path="formats.sas7bcat")
# To CSV
df.write_csv("data.csv")
# To Parquet (preserves types)
df.write_parquet("data.parquet")
# To Stata
from svy_io.stata import write_dta
write_dta(df, "data.dta")Preserving Metadata
from svy_io.sas import read_sas, write_xpt, get_column_labels
df, meta = read_sas("input.sas7bdat", catalog_path="formats.sas7bcat")
# Transform
df_filtered = df.filter(pl.col("valid") == 1)
# Write with preserved label
write_xpt(
df_filtered,
"output.xpt",
label=meta.get('file_label', 'Processed Dataset')
)Reading from Zip Archives
# Auto-extracts and reads SAS files from zip
df, meta = read_sas("data.zip")
# If zip contains both .sas7bdat and .sas7bcat, catalog is used automaticallyPerformance Tips
# 1. Skip unnecessary columns
df, meta = read_sas("wide.sas7bdat", cols_skip=["temp1", "temp2"])
# 2. Limit rows for exploration
df, meta = read_sas("large.sas7bdat", n_max=1000)
# 3. Read metadata only
_, meta = read_sas("data.sas7bdat", n_max=0)
# 4. Use lazy evaluation
df, meta = read_sas("input.sas7bdat")
result = (df.lazy()
.filter(pl.col("year") == 2024)
.group_by("region")
.agg(pl.col("value").sum())
.collect()
)Feature Support
| Feature | Read SAS7BDAT | Read XPT | Write XPT |
|---|---|---|---|
| Basic data types | ✅ | ✅ | ✅ |
| Variable labels | ✅ | ✅ | ❌ |
| File labels | ✅ | ✅ | ✅ |
| Value labels (formats) | ✅* | ✅* | ❌ |
| Date/Datetime | ✅ | ✅ | ✅ |
| Tagged missing (.A-.Z) | ✅ | ✅ | ❌ |
| Zip archives | ✅ | ❌ | ❌ |
| XPT v5 and v8 | N/A | ✅ | ✅ |
*Requires .sas7bcat catalog file for SAS7BDAT
Known Limitations
Compressed SAS7BDAT — ReadStat doesn’t support compressed files. Decompress in SAS first.
SAS7BDAT Write — Not implemented. Use XPT for SAS-compatible output.
Value Labels on Write — Cannot create labeled integers when writing XPT.
Tagged Missing on Write — Cannot write
.A–.Zmissing values.Long Strings in XPT v5 — Maximum 200 characters. Use version 8 for longer strings.
Encoding Detection — No automatic detection. Try common encodings: latin1, utf-8, cp1252.
See Also
- Stata File I/O — Read and write Stata
.dtafiles - SPSS File I/O — Read and write SPSS
.savfiles - ReadStat — Upstream C library
- CDISC Standards — Clinical data interchange
- Polars Documentation — DataFrame operations