SAS File I/O

Read and write .sas7bdat and .xpt files with full metadata support

Read and write SAS .sas7bdat and .xpt files in Python using svy-io. Fast, Pythonic API with support for value labels, format catalogs, and FDA-compliant transport files.
Keywords

read SAS Python, write xpt file Python, SAS to Polars, Python ReadStat, svy-io SAS, sas7bdat Python, xpt transport file, CDISC Python, FDA submission Python

The svy_io library provides comprehensive support for reading and writing SAS files (.sas7bdat and .xpt formats) through a clean, Pythonic API backed by the ReadStat C library.

Installation

pip install svy-io

Quick Start

Reading SAS Files

from svy_io.sas import read_sas, read_xpt

# Read a SAS7BDAT file
df, meta = read_sas("data.sas7bdat")

# Read a SAS Transport (XPT) file
df_xpt, meta_xpt = read_xpt("transport.xpt")

# df is a Polars DataFrame
print(df.head())

# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
#            'user_missing', 'n_rows', 'tagged_missings'])

Writing SAS Transport Files

from svy_io.sas import write_xpt
import polars as pl

df = pl.DataFrame({
    "age": [25, 30, 35, 40],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "score": [85.5, 92.0, 78.3, 88.9]
})

write_xpt(df, "output.xpt", version=8, label="Study Data 2024")

API Reference

read_sas()

Read a SAS7BDAT dataset file into a Polars DataFrame. Supports reading from zip archives.

Signature:

def read_sas(
    data_path: str,
    *,
    catalog_path: str | None = None,
    encoding: str | None = None,
    catalog_encoding: str | None = None,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False,
    factorize: bool = False,
    levels: str = "default",
    ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter Type Default Description
data_path str required Path to SAS7BDAT file or zip archive
catalog_path str | None None Path to SAS7BCAT catalog for value labels
encoding str | None None Character encoding (e.g., “latin1”, “utf-8”)
catalog_encoding str | None None Encoding for the catalog file
cols_skip list[str] | None None Column names to skip
n_max int | None None Maximum rows to read
rows_skip int 0 Rows to skip from beginning
coerce_temporals bool False Convert SAS dates to Python types
zap_empty_str bool False Convert empty strings to None
factorize bool False Convert labeled variables to categoricals
levels str “default” Factor levels: “default”, “labels”, “values”, “both”
ordered bool False Whether categoricals should be ordered

Returns:

A tuple of (df, meta) where:

  • df (pl.DataFrame): The data
  • meta (dict): Metadata dictionary containing:
Key Type Description
file_label str | None Dataset-level label
vars list Variable metadata (name, label, format)
value_labels list Value label sets from catalog
user_missing list User-defined missing specifications
n_rows int Number of rows read
tagged_missings list Tagged missing value info (.A-.Z)

Example:

# Basic read
df, meta = read_sas("survey.sas7bdat")

# Read with catalog for value labels
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")

# Read from zip archive (auto-extracts)
df, meta = read_sas("data.zip")

# Read with options
df, meta = read_sas(
    "survey.sas7bdat",
    catalog_path="formats.sas7bcat",
    cols_skip=["temp_var"],
    n_max=1000,
    coerce_temporals=True,
    factorize=True
)

read_xpt()

Read a SAS Transport (XPT) file into a Polars DataFrame.

Signature:

def read_xpt(
    data_path: str | os.PathLike,
    *,
    n_max: int | None = None,
    coerce_temporals: bool = True,
    zap_empty_str: bool = False,
    factorize: bool = False,
    levels: str = "default",
    ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter Type Default Description
data_path str | PathLike required Path to the XPT file
n_max int | None None Maximum rows to read
coerce_temporals bool True Convert SAS dates to Python types
zap_empty_str bool False Convert empty strings to None
factorize bool False Convert labeled variables to categoricals
levels str “default” Factor levels handling
ordered bool False Whether categoricals should be ordered

Example:

# Read transport file
df, meta = read_xpt("data.xpt")

# Read with temporal coercion (recommended)
df, meta = read_xpt("data.xpt", coerce_temporals=True)

write_xpt()

Write a Polars DataFrame to a SAS Transport (XPT) file.

Signature:

def write_xpt(
    df: pl.DataFrame,
    path: str | Path,
    *,
    version: int = 8,
    name: str | None = None,
    label: str | None = None,
    adjust_tz: bool = True
) -> None

Parameters:

Parameter Type Default Description
df pl.DataFrame required DataFrame to write
path str | PathLike required Output file path
version int 8 XPT version: 5 or 8
name str | None None Dataset name (max 8 chars v5, 32 chars v8)
label str | None None Dataset description (max 40 chars)
adjust_tz bool True Adjust timezone for datetime columns

Raises:

  • ValueError: If name/label exceed length limits
  • RuntimeError: If the writer encounters an error

Example:

# Write version 8 (recommended)
write_xpt(df, "clinical_trial.xpt", version=8, label="Phase II Results")

# Write version 5 (legacy compatibility)
write_xpt(df, "legacy.xpt", version=5, name="LEGACY")

read_sas_arrow()

Read a SAS7BDAT file into a PyArrow Table with preserved metadata.

Signature:

def read_sas_arrow(
    data_path: str,
    *,
    catalog_path: str | None = None,
    encoding: str | None = None,
    catalog_encoding: str | None = None,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0
) -> tuple[pa.Table, dict[str, Any]]

Example:

from svy_io.sas import read_sas_arrow

table, meta = read_sas_arrow("data.sas7bdat")

# Arrow metadata preserved in field metadata
for field in table.schema:
    print(f"{field.name}: {field.metadata}")

# Convert to Polars if needed
import polars as pl
df = pl.from_arrow(table)

Metadata Utility Functions

get_column_labels()

Extract variable labels from metadata.

from svy_io.sas import read_sas, get_column_labels

df, meta = read_sas("data.sas7bdat")
labels = get_column_labels(meta)
# {'age': 'Age in years', 'income': 'Annual income', ...}

get_value_labels_for_column()

Get value label mappings for a specific column.

from svy_io.sas import read_sas, get_value_labels_for_column

df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")
gender_labels = get_value_labels_for_column(meta, "gender")
# {'1': 'Male', '2': 'Female', '3': 'Other'}

get_tagged_na_info()

Extract information about tagged missing values (.A-.Z).

from svy_io.sas import read_sas, get_tagged_na_info

df, meta = read_sas("data.sas7bdat")
tagged_info = get_tagged_na_info(meta)
# {'age': ['A', 'B'], 'income': ['Z']}

Working with Value Labels

SAS stores value labels (formats) in separate .sas7bcat catalog files:

# Read with catalog for value labels
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")

# Inspect value labels
for vl in meta['value_labels']:
    print(f"\n{vl['set_name']}:")
    for value, label in vl['mapping'].items():
        print(f"  {value} = {label}")

# Auto-convert to categoricals
df, meta = read_sas(
    "survey.sas7bdat",
    catalog_path="formats.sas7bcat",
    factorize=True
)
print(df["gender"].dtype)  # Categorical

Temporal Data Handling

SAS stores dates as days since 1960-01-01 and datetimes as seconds since 1960-01-01:

# Without coercion (raw numeric values)
df, meta = read_sas("data.sas7bdat", coerce_temporals=False)
print(df["date_var"])  # 21915.0

# With coercion (Python dates)
df, meta = read_sas("data.sas7bdat", coerce_temporals=True)
print(df["date_var"])  # 2020-01-15

Format codes determine conversion:

  • DATE, MMDDYY, YYMMDD → pl.Date
  • DATETIME, DATETIME20 → pl.Datetime
  • TIME → pl.Duration

XPT Version Comparison

Feature XPT v5 XPT v8
Variable name length 8 chars 32 chars
String length 200 chars Unlimited
Character encoding ASCII UTF-8
SAS compatibility SAS 6+ SAS 8+
FDA acceptance

Recommendation: Use version 8 unless you need compatibility with very old SAS installations.

Data Type Conversions

Reading (SAS → Python)

SAS Type Polars Type Notes
Numeric Float64 All SAS numeric → Float64
Character Utf8 String data
Date Float64 or Date Use coerce_temporals=True
Datetime Float64 or Datetime Use coerce_temporals=True

Writing (Python → XPT)

Polars Type XPT Type Notes
Int8–Int64, UInt8–UInt64 Numeric (double) All integers → double
Float32, Float64 Numeric (double)
Boolean Numeric (double) True=1, False=0
Utf8 Character
Date, Datetime Numeric (double) Days/seconds since 1960

FDA Submissions and CDISC

Creating CDISC-Compliant XPT Files

df_adsl = pl.DataFrame({
    "STUDYID": ["STUDY001"] * 100,
    "USUBJID": [f"001-{i:03d}" for i in range(1, 101)],
    "SUBJID": [f"{i:03d}" for i in range(1, 101)],
    "AGE": pl.Series([25 + i % 50 for i in range(100)], dtype=pl.Float64),
    "SEX": ["M", "F"] * 50,
    "ARM": ["PLACEBO", "TREATMENT"] * 50
})

write_xpt(
    df_adsl,
    "adsl.xpt",
    version=8,
    name="ADSL",
    label="Subject-Level Analysis Dataset"
)

CDISC Validation Helper

def validate_cdisc_xpt(df: pl.DataFrame, dataset_name: str) -> list[str]:
    """Validate DataFrame meets CDISC/FDA requirements."""
    issues = []

    # Required SDTM variables
    required = ["STUDYID", "USUBJID"]
    missing = [v for v in required if v not in df.columns]
    if missing:
        issues.append(f"Missing required: {', '.join(missing)}")

    # Variable name length (max 8 for CDISC)
    long_names = [c for c in df.columns if len(c) > 8]
    if long_names:
        issues.append(f"Names > 8 chars: {', '.join(long_names)}")

    # Dataset name length
    if len(dataset_name) > 8:
        issues.append(f"Dataset name '{dataset_name}' > 8 chars")

    return issues

issues = validate_cdisc_xpt(df_adsl, "ADSL")
if not issues:
    print("✅ CDISC-compliant")

Common Patterns

Converting SAS to Other Formats

df, meta = read_sas("data.sas7bdat", catalog_path="formats.sas7bcat")

# To CSV
df.write_csv("data.csv")

# To Parquet (preserves types)
df.write_parquet("data.parquet")

# To Stata
from svy_io.stata import write_dta
write_dta(df, "data.dta")

Preserving Metadata

from svy_io.sas import read_sas, write_xpt, get_column_labels

df, meta = read_sas("input.sas7bdat", catalog_path="formats.sas7bcat")

# Transform
df_filtered = df.filter(pl.col("valid") == 1)

# Write with preserved label
write_xpt(
    df_filtered,
    "output.xpt",
    label=meta.get('file_label', 'Processed Dataset')
)

Reading from Zip Archives

# Auto-extracts and reads SAS files from zip
df, meta = read_sas("data.zip")

# If zip contains both .sas7bdat and .sas7bcat, catalog is used automatically

Performance Tips

# 1. Skip unnecessary columns
df, meta = read_sas("wide.sas7bdat", cols_skip=["temp1", "temp2"])

# 2. Limit rows for exploration
df, meta = read_sas("large.sas7bdat", n_max=1000)

# 3. Read metadata only
_, meta = read_sas("data.sas7bdat", n_max=0)

# 4. Use lazy evaluation
df, meta = read_sas("input.sas7bdat")
result = (df.lazy()
    .filter(pl.col("year") == 2024)
    .group_by("region")
    .agg(pl.col("value").sum())
    .collect()
)

Feature Support

Feature Read SAS7BDAT Read XPT Write XPT
Basic data types
Variable labels
File labels
Value labels (formats) ✅* ✅*
Date/Datetime
Tagged missing (.A-.Z)
Zip archives
XPT v5 and v8 N/A

*Requires .sas7bcat catalog file for SAS7BDAT

Known Limitations

  1. Compressed SAS7BDAT — ReadStat doesn’t support compressed files. Decompress in SAS first.

  2. SAS7BDAT Write — Not implemented. Use XPT for SAS-compatible output.

  3. Value Labels on Write — Cannot create labeled integers when writing XPT.

  4. Tagged Missing on Write — Cannot write .A.Z missing values.

  5. Long Strings in XPT v5 — Maximum 200 characters. Use version 8 for longer strings.

  6. Encoding Detection — No automatic detection. Try common encodings: latin1, utf-8, cp1252.

See Also