SAS File I/O

Read and write .sas7bdat and .xpt files with full metadata support

Read and write SAS .sas7bdat and .xpt files in Python using svy-io. Fast, Pythonic API with support for value labels, format catalogs, and FDA-compliant transport files.

Keywords

read SAS Python, write xpt file Python, SAS to Polars, Python ReadStat, svy-io SAS, sas7bdat Python, xpt transport file, CDISC Python, FDA submission Python

The svy_io library provides comprehensive support for reading and writing SAS files (.sas7bdat and .xpt formats) through a clean, Pythonic API backed by the ReadStat C library.

Installation

pip install svy-io

Quick Start

Reading SAS Files

from svy_io.sas import read_sas, read_xpt

# Read a SAS7BDAT file
df, meta = read_sas("data.sas7bdat")

# Read a SAS Transport (XPT) file
df_xpt, meta_xpt = read_xpt("transport.xpt")

# df is a Polars DataFrame
print(df.head())

# meta contains file metadata
print(meta.keys())
# dict_keys(['file_label', 'vars', 'value_labels',
#            'user_missing', 'n_rows', 'tagged_missings'])

Writing SAS Transport Files

from svy_io.sas import write_xpt
import polars as pl

df = pl.DataFrame({
    "age": [25, 30, 35, 40],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "score": [85.5, 92.0, 78.3, 88.9]
})

write_xpt(df, "output.xpt", version=8, label="Study Data 2024")

API Reference

`read_sas()`

Read a SAS7BDAT dataset file into a Polars DataFrame. Supports reading from zip archives.

Signature:

def read_sas(
    data_path: str,
    *,
    catalog_path: str | None = None,
    encoding: str | None = None,
    catalog_encoding: str | None = None,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0,
    coerce_temporals: bool = False,
    zap_empty_str: bool = False,
    factorize: bool = False,
    levels: str = "default",
    ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter	Type	Default	Description
`data_path`	str	required	Path to SAS7BDAT file or zip archive
`catalog_path`	str \| None	None	Path to SAS7BCAT catalog for value labels
`encoding`	str \| None	None	Character encoding (e.g., “latin1”, “utf-8”)
`catalog_encoding`	str \| None	None	Encoding for the catalog file
`cols_skip`	list[str] \| None	None	Column names to skip
`n_max`	int \| None	None	Maximum rows to read
`rows_skip`	int	0	Rows to skip from beginning
`coerce_temporals`	bool	False	Convert SAS dates to Python types
`zap_empty_str`	bool	False	Convert empty strings to `None`
`factorize`	bool	False	Convert labeled variables to categoricals
`levels`	str	“default”	Factor levels: “default”, “labels”, “values”, “both”
`ordered`	bool	False	Whether categoricals should be ordered

Returns:

A tuple of (df, meta) where:

df (pl.DataFrame): The data
meta (dict): Metadata dictionary containing:

Key	Type	Description
`file_label`	str \| None	Dataset-level label
`vars`	list	Variable metadata (name, label, format)
`value_labels`	list	Value label sets from catalog
`user_missing`	list	User-defined missing specifications
`n_rows`	int	Number of rows read
`tagged_missings`	list	Tagged missing value info (.A-.Z)

Example:

# Basic read
df, meta = read_sas("survey.sas7bdat")

# Read with catalog for value labels
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")

# Read from zip archive (auto-extracts)
df, meta = read_sas("data.zip")

# Read with options
df, meta = read_sas(
    "survey.sas7bdat",
    catalog_path="formats.sas7bcat",
    cols_skip=["temp_var"],
    n_max=1000,
    coerce_temporals=True,
    factorize=True
)

`read_xpt()`

Read a SAS Transport (XPT) file into a Polars DataFrame.

Signature:

def read_xpt(
    data_path: str | os.PathLike,
    *,
    n_max: int | None = None,
    coerce_temporals: bool = True,
    zap_empty_str: bool = False,
    factorize: bool = False,
    levels: str = "default",
    ordered: bool = False
) -> tuple[pl.DataFrame, dict[str, Any]]

Parameters:

Parameter	Type	Default	Description
`data_path`	str \| PathLike	required	Path to the XPT file
`n_max`	int \| None	None	Maximum rows to read
`coerce_temporals`	bool	True	Convert SAS dates to Python types
`zap_empty_str`	bool	False	Convert empty strings to `None`
`factorize`	bool	False	Convert labeled variables to categoricals
`levels`	str	“default”	Factor levels handling
`ordered`	bool	False	Whether categoricals should be ordered

Example:

# Read transport file
df, meta = read_xpt("data.xpt")

# Read with temporal coercion (recommended)
df, meta = read_xpt("data.xpt", coerce_temporals=True)

`write_xpt()`

Write a Polars DataFrame to a SAS Transport (XPT) file.

Signature:

def write_xpt(
    df: pl.DataFrame,
    path: str | Path,
    *,
    version: int = 8,
    name: str | None = None,
    label: str | None = None,
    adjust_tz: bool = True
) -> None

Parameters:

Parameter	Type	Default	Description
`df`	pl.DataFrame	required	DataFrame to write
`path`	str \| PathLike	required	Output file path
`version`	int	8	XPT version: 5 or 8
`name`	str \| None	None	Dataset name (max 8 chars v5, 32 chars v8)
`label`	str \| None	None	Dataset description (max 40 chars)
`adjust_tz`	bool	True	Adjust timezone for datetime columns

Raises:

ValueError: If name/label exceed length limits
RuntimeError: If the writer encounters an error

Example:

# Write version 8 (recommended)
write_xpt(df, "clinical_trial.xpt", version=8, label="Phase II Results")

# Write version 5 (legacy compatibility)
write_xpt(df, "legacy.xpt", version=5, name="LEGACY")

`read_sas_arrow()`

Read a SAS7BDAT file into a PyArrow Table with preserved metadata.

Signature:

def read_sas_arrow(
    data_path: str,
    *,
    catalog_path: str | None = None,
    encoding: str | None = None,
    catalog_encoding: str | None = None,
    cols_skip: list[str] | None = None,
    n_max: int | None = None,
    rows_skip: int = 0
) -> tuple[pa.Table, dict[str, Any]]

Example:

from svy_io.sas import read_sas_arrow

table, meta = read_sas_arrow("data.sas7bdat")

# Arrow metadata preserved in field metadata
for field in table.schema:
    print(f"{field.name}: {field.metadata}")

# Convert to Polars if needed
import polars as pl
df = pl.from_arrow(table)

Metadata Utility Functions

`get_column_labels()`

Extract variable labels from metadata.

from svy_io.sas import read_sas, get_column_labels

df, meta = read_sas("data.sas7bdat")
labels = get_column_labels(meta)
# {'age': 'Age in years', 'income': 'Annual income', ...}

`get_value_labels_for_column()`

Get value label mappings for a specific column.

from svy_io.sas import read_sas, get_value_labels_for_column

df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")
gender_labels = get_value_labels_for_column(meta, "gender")
# {'1': 'Male', '2': 'Female', '3': 'Other'}

`get_tagged_na_info()`

Extract information about tagged missing values (.A-.Z).

from svy_io.sas import read_sas, get_tagged_na_info

df, meta = read_sas("data.sas7bdat")
tagged_info = get_tagged_na_info(meta)
# {'age': ['A', 'B'], 'income': ['Z']}

Working with Value Labels

SAS stores value labels (formats) in separate .sas7bcat catalog files:

# Read with catalog for value labels
df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat")

# Inspect value labels
for vl in meta['value_labels']:
    print(f"\n{vl['set_name']}:")
    for value, label in vl['mapping'].items():
        print(f"  {value} = {label}")

# Auto-convert to categoricals
df, meta = read_sas(
    "survey.sas7bdat",
    catalog_path="formats.sas7bcat",
    factorize=True
)
print(df["gender"].dtype)  # Categorical

Temporal Data Handling

SAS stores dates as days since 1960-01-01 and datetimes as seconds since 1960-01-01:

# Without coercion (raw numeric values)
df, meta = read_sas("data.sas7bdat", coerce_temporals=False)
print(df["date_var"])  # 21915.0

# With coercion (Python dates)
df, meta = read_sas("data.sas7bdat", coerce_temporals=True)
print(df["date_var"])  # 2020-01-15

Format codes determine conversion:

DATE, MMDDYY, YYMMDD → pl.Date
DATETIME, DATETIME20 → pl.Datetime
TIME → pl.Duration

XPT Version Comparison

Feature	XPT v5	XPT v8
Variable name length	8 chars	32 chars
String length	200 chars	Unlimited
Character encoding	ASCII	UTF-8
SAS compatibility	SAS 6+	SAS 8+
FDA acceptance	✅	✅

Recommendation: Use version 8 unless you need compatibility with very old SAS installations.

Data Type Conversions

Reading (SAS → Python)

SAS Type	Polars Type	Notes
Numeric	Float64	All SAS numeric → Float64
Character	Utf8	String data
Date	Float64 or Date	Use `coerce_temporals=True`
Datetime	Float64 or Datetime	Use `coerce_temporals=True`

Writing (Python → XPT)

Polars Type	XPT Type	Notes
Int8–Int64, UInt8–UInt64	Numeric (double)	All integers → double
Float32, Float64	Numeric (double)
Boolean	Numeric (double)	True=1, False=0
Utf8	Character
Date, Datetime	Numeric (double)	Days/seconds since 1960

FDA Submissions and CDISC

Creating CDISC-Compliant XPT Files

df_adsl = pl.DataFrame({
    "STUDYID": ["STUDY001"] * 100,
    "USUBJID": [f"001-{i:03d}" for i in range(1, 101)],
    "SUBJID": [f"{i:03d}" for i in range(1, 101)],
    "AGE": pl.Series([25 + i % 50 for i in range(100)], dtype=pl.Float64),
    "SEX": ["M", "F"] * 50,
    "ARM": ["PLACEBO", "TREATMENT"] * 50
})

write_xpt(
    df_adsl,
    "adsl.xpt",
    version=8,
    name="ADSL",
    label="Subject-Level Analysis Dataset"
)

CDISC Validation Helper

def validate_cdisc_xpt(df: pl.DataFrame, dataset_name: str) -> list[str]:
    """Validate DataFrame meets CDISC/FDA requirements."""
    issues = []

    # Required SDTM variables
    required = ["STUDYID", "USUBJID"]
    missing = [v for v in required if v not in df.columns]
    if missing:
        issues.append(f"Missing required: {', '.join(missing)}")

    # Variable name length (max 8 for CDISC)
    long_names = [c for c in df.columns if len(c) > 8]
    if long_names:
        issues.append(f"Names > 8 chars: {', '.join(long_names)}")

    # Dataset name length
    if len(dataset_name) > 8:
        issues.append(f"Dataset name '{dataset_name}' > 8 chars")

    return issues

issues = validate_cdisc_xpt(df_adsl, "ADSL")
if not issues:
    print("✅ CDISC-compliant")

Common Patterns

Converting SAS to Other Formats

df, meta = read_sas("data.sas7bdat", catalog_path="formats.sas7bcat")

# To CSV
df.write_csv("data.csv")

# To Parquet (preserves types)
df.write_parquet("data.parquet")

# To Stata
from svy_io.stata import write_dta
write_dta(df, "data.dta")

Preserving Metadata

from svy_io.sas import read_sas, write_xpt, get_column_labels

df, meta = read_sas("input.sas7bdat", catalog_path="formats.sas7bcat")

# Transform
df_filtered = df.filter(pl.col("valid") == 1)

# Write with preserved label
write_xpt(
    df_filtered,
    "output.xpt",
    label=meta.get('file_label', 'Processed Dataset')
)

Reading from Zip Archives

# Auto-extracts and reads SAS files from zip
df, meta = read_sas("data.zip")

# If zip contains both .sas7bdat and .sas7bcat, catalog is used automatically

Performance Tips

# 1. Skip unnecessary columns
df, meta = read_sas("wide.sas7bdat", cols_skip=["temp1", "temp2"])

# 2. Limit rows for exploration
df, meta = read_sas("large.sas7bdat", n_max=1000)

# 3. Read metadata only
_, meta = read_sas("data.sas7bdat", n_max=0)

# 4. Use lazy evaluation
df, meta = read_sas("input.sas7bdat")
result = (df.lazy()
    .filter(pl.col("year") == 2024)
    .group_by("region")
    .agg(pl.col("value").sum())
    .collect()
)

Feature Support

Feature	Read SAS7BDAT	Read XPT	Write XPT
Basic data types	✅	✅	✅
Variable labels	✅	✅	❌
File labels	✅	✅	✅
Value labels (formats)	✅*	✅*	❌
Date/Datetime	✅	✅	✅
Tagged missing (.A-.Z)	✅	✅	❌
Zip archives	✅	❌	❌
XPT v5 and v8	N/A	✅	✅

*Requires .sas7bcat catalog file for SAS7BDAT

Known Limitations

Compressed SAS7BDAT — ReadStat doesn’t support compressed files. Decompress in SAS first.
SAS7BDAT Write — Not implemented. Use XPT for SAS-compatible output.
Value Labels on Write — Cannot create labeled integers when writing XPT.
Tagged Missing on Write — Cannot write .A–.Z missing values.
Long Strings in XPT v5 — Maximum 200 characters. Use version 8 for longer strings.
Encoding Detection — No automatic detection. Try common encodings: latin1, utf-8, cp1252.

--- title: "SAS File I/O" description: "Read and write SAS .sas7bdat and .xpt files in Python using svy-io. Fast, Pythonic API with support for value labels, format catalogs, and FDA-compliant transport files." keywords: "read SAS Python, write xpt file Python, SAS to Polars, Python ReadStat, svy-io SAS, sas7bdat Python, xpt transport file, CDISC Python, FDA submission Python" subtitle: "Read and write .sas7bdat and .xpt files with full metadata support" format: html: toc: true toc-depth: 3 code-fold: false code-tools: true --- The `svy_io` library provides comprehensive support for reading and writing SAS files (`.sas7bdat` and `.xpt` formats) through a clean, Pythonic API backed by the ReadStat C library. ## Installation ```bash pip install svy-io ``` ## Quick Start ### Reading SAS Files ```{python} #| eval: false from svy_io.sas import read_sas, read_xpt # Read a SAS7BDAT file df, meta = read_sas("data.sas7bdat") # Read a SAS Transport (XPT) file df_xpt, meta_xpt = read_xpt("transport.xpt") # df is a Polars DataFrame print(df.head()) # meta contains file metadata print(meta.keys()) # dict_keys(['file_label', 'vars', 'value_labels', # 'user_missing', 'n_rows', 'tagged_missings']) ``` ### Writing SAS Transport Files ```{python} #| eval: false from svy_io.sas import write_xpt import polars as pl df = pl.DataFrame({ "age": [25, 30, 35, 40], "name": ["Alice", "Bob", "Charlie", "Diana"], "score": [85.5, 92.0, 78.3, 88.9] }) write_xpt(df, "output.xpt", version=8, label="Study Data 2024") ``` ## API Reference ### `read_sas()` Read a SAS7BDAT dataset file into a Polars DataFrame. Supports reading from zip archives. **Signature:** ```python def read_sas( data_path: str, *, catalog_path: str | None = None, encoding: str | None = None, catalog_encoding: str | None = None, cols_skip: list[str] | None = None, n_max: int | None = None, rows_skip: int = 0, coerce_temporals: bool = False, zap_empty_str: bool = False, factorize: bool = False, levels: str = "default", ordered: bool = False ) -> tuple[pl.DataFrame, dict[str, Any]] ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `data_path` | str | required | Path to SAS7BDAT file or zip archive | | `catalog_path` | str \| None | None | Path to SAS7BCAT catalog for value labels | | `encoding` | str \| None | None | Character encoding (e.g., "latin1", "utf-8") | | `catalog_encoding` | str \| None | None | Encoding for the catalog file | | `cols_skip` | list[str] \| None | None | Column names to skip | | `n_max` | int \| None | None | Maximum rows to read | | `rows_skip` | int | 0 | Rows to skip from beginning | | `coerce_temporals` | bool | False | Convert SAS dates to Python types | | `zap_empty_str` | bool | False | Convert empty strings to `None` | | `factorize` | bool | False | Convert labeled variables to categoricals | | `levels` | str | "default" | Factor levels: "default", "labels", "values", "both" | | `ordered` | bool | False | Whether categoricals should be ordered | **Returns:** A tuple of `(df, meta)` where: - `df` (pl.DataFrame): The data - `meta` (dict): Metadata dictionary containing: | Key | Type | Description | |-----|------|-------------| | `file_label` | str \| None | Dataset-level label | | `vars` | list | Variable metadata (name, label, format) | | `value_labels` | list | Value label sets from catalog | | `user_missing` | list | User-defined missing specifications | | `n_rows` | int | Number of rows read | | `tagged_missings` | list | Tagged missing value info (.A-.Z) | **Example:** ```{python} #| eval: false # Basic read df, meta = read_sas("survey.sas7bdat") # Read with catalog for value labels df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat") # Read from zip archive (auto-extracts) df, meta = read_sas("data.zip") # Read with options df, meta = read_sas( "survey.sas7bdat", catalog_path="formats.sas7bcat", cols_skip=["temp_var"], n_max=1000, coerce_temporals=True, factorize=True ) ``` ### `read_xpt()` Read a SAS Transport (XPT) file into a Polars DataFrame. **Signature:** ```python def read_xpt( data_path: str | os.PathLike, *, n_max: int | None = None, coerce_temporals: bool = True, zap_empty_str: bool = False, factorize: bool = False, levels: str = "default", ordered: bool = False ) -> tuple[pl.DataFrame, dict[str, Any]] ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `data_path` | str \| PathLike | required | Path to the XPT file | | `n_max` | int \| None | None | Maximum rows to read | | `coerce_temporals` | bool | True | Convert SAS dates to Python types | | `zap_empty_str` | bool | False | Convert empty strings to `None` | | `factorize` | bool | False | Convert labeled variables to categoricals | | `levels` | str | "default" | Factor levels handling | | `ordered` | bool | False | Whether categoricals should be ordered | **Example:** ```{python} #| eval: false # Read transport file df, meta = read_xpt("data.xpt") # Read with temporal coercion (recommended) df, meta = read_xpt("data.xpt", coerce_temporals=True) ``` ### `write_xpt()` Write a Polars DataFrame to a SAS Transport (XPT) file. **Signature:** ```python def write_xpt( df: pl.DataFrame, path: str | Path, *, version: int = 8, name: str | None = None, label: str | None = None, adjust_tz: bool = True ) -> None ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | pl.DataFrame | required | DataFrame to write | | `path` | str \| PathLike | required | Output file path | | `version` | int | 8 | XPT version: 5 or 8 | | `name` | str \| None | None | Dataset name (max 8 chars v5, 32 chars v8) | | `label` | str \| None | None | Dataset description (max 40 chars) | | `adjust_tz` | bool | True | Adjust timezone for datetime columns | **Raises:** - `ValueError`: If name/label exceed length limits - `RuntimeError`: If the writer encounters an error **Example:** ```{python} #| eval: false # Write version 8 (recommended) write_xpt(df, "clinical_trial.xpt", version=8, label="Phase II Results") # Write version 5 (legacy compatibility) write_xpt(df, "legacy.xpt", version=5, name="LEGACY") ``` ### `read_sas_arrow()` Read a SAS7BDAT file into a PyArrow Table with preserved metadata. **Signature:** ```python def read_sas_arrow( data_path: str, *, catalog_path: str | None = None, encoding: str | None = None, catalog_encoding: str | None = None, cols_skip: list[str] | None = None, n_max: int | None = None, rows_skip: int = 0 ) -> tuple[pa.Table, dict[str, Any]] ``` **Example:** ```{python} #| eval: false from svy_io.sas import read_sas_arrow table, meta = read_sas_arrow("data.sas7bdat") # Arrow metadata preserved in field metadata for field in table.schema: print(f"{field.name}: {field.metadata}") # Convert to Polars if needed import polars as pl df = pl.from_arrow(table) ``` ## Metadata Utility Functions ### `get_column_labels()` Extract variable labels from metadata. ```{python} #| eval: false from svy_io.sas import read_sas, get_column_labels df, meta = read_sas("data.sas7bdat") labels = get_column_labels(meta) # {'age': 'Age in years', 'income': 'Annual income', ...} ``` ### `get_value_labels_for_column()` Get value label mappings for a specific column. ```{python} #| eval: false from svy_io.sas import read_sas, get_value_labels_for_column df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat") gender_labels = get_value_labels_for_column(meta, "gender") # {'1': 'Male', '2': 'Female', '3': 'Other'} ``` ### `get_tagged_na_info()` Extract information about tagged missing values (.A-.Z). ```{python} #| eval: false from svy_io.sas import read_sas, get_tagged_na_info df, meta = read_sas("data.sas7bdat") tagged_info = get_tagged_na_info(meta) # {'age': ['A', 'B'], 'income': ['Z']} ``` ## Working with Value Labels SAS stores value labels (formats) in separate `.sas7bcat` catalog files: ```{python} #| eval: false # Read with catalog for value labels df, meta = read_sas("survey.sas7bdat", catalog_path="formats.sas7bcat") # Inspect value labels for vl in meta['value_labels']: print(f"\n{vl['set_name']}:") for value, label in vl['mapping'].items(): print(f" {value} = {label}") # Auto-convert to categoricals df, meta = read_sas( "survey.sas7bdat", catalog_path="formats.sas7bcat", factorize=True ) print(df["gender"].dtype) # Categorical ``` ## Temporal Data Handling SAS stores dates as days since 1960-01-01 and datetimes as seconds since 1960-01-01: ```{python} #| eval: false # Without coercion (raw numeric values) df, meta = read_sas("data.sas7bdat", coerce_temporals=False) print(df["date_var"]) # 21915.0 # With coercion (Python dates) df, meta = read_sas("data.sas7bdat", coerce_temporals=True) print(df["date_var"]) # 2020-01-15 ``` Format codes determine conversion: - DATE, MMDDYY, YYMMDD → `pl.Date` - DATETIME, DATETIME20 → `pl.Datetime` - TIME → `pl.Duration` ## XPT Version Comparison | Feature | XPT v5 | XPT v8 | |---------|--------|--------| | Variable name length | 8 chars | 32 chars | | String length | 200 chars | Unlimited | | Character encoding | ASCII | UTF-8 | | SAS compatibility | SAS 6+ | SAS 8+ | | FDA acceptance | ✅ | ✅ | **Recommendation:** Use version 8 unless you need compatibility with very old SAS installations. ## Data Type Conversions ### Reading (SAS → Python) | SAS Type | Polars Type | Notes | |----------|-------------|-------| | Numeric | Float64 | All SAS numeric → Float64 | | Character | Utf8 | String data | | Date | Float64 or Date | Use `coerce_temporals=True` | | Datetime | Float64 or Datetime | Use `coerce_temporals=True` | ### Writing (Python → XPT) | Polars Type | XPT Type | Notes | |-------------|----------|-------| | Int8–Int64, UInt8–UInt64 | Numeric (double) | All integers → double | | Float32, Float64 | Numeric (double) | | | Boolean | Numeric (double) | True=1, False=0 | | Utf8 | Character | | | Date, Datetime | Numeric (double) | Days/seconds since 1960 | ## FDA Submissions and CDISC ### Creating CDISC-Compliant XPT Files ```{python} #| eval: false df_adsl = pl.DataFrame({ "STUDYID": ["STUDY001"] * 100, "USUBJID": [f"001-{i:03d}" for i in range(1, 101)], "SUBJID": [f"{i:03d}" for i in range(1, 101)], "AGE": pl.Series([25 + i % 50 for i in range(100)], dtype=pl.Float64), "SEX": ["M", "F"] * 50, "ARM": ["PLACEBO", "TREATMENT"] * 50 }) write_xpt( df_adsl, "adsl.xpt", version=8, name="ADSL", label="Subject-Level Analysis Dataset" ) ``` ### CDISC Validation Helper ```{python} #| eval: false def validate_cdisc_xpt(df: pl.DataFrame, dataset_name: str) -> list[str]: """Validate DataFrame meets CDISC/FDA requirements.""" issues = [] # Required SDTM variables required = ["STUDYID", "USUBJID"] missing = [v for v in required if v not in df.columns] if missing: issues.append(f"Missing required: {', '.join(missing)}") # Variable name length (max 8 for CDISC) long_names = [c for c in df.columns if len(c) > 8] if long_names: issues.append(f"Names > 8 chars: {', '.join(long_names)}") # Dataset name length if len(dataset_name) > 8: issues.append(f"Dataset name '{dataset_name}' > 8 chars") return issues issues = validate_cdisc_xpt(df_adsl, "ADSL") if not issues: print("✅ CDISC-compliant") ``` ## Common Patterns ### Converting SAS to Other Formats ```{python} #| eval: false df, meta = read_sas("data.sas7bdat", catalog_path="formats.sas7bcat") # To CSV df.write_csv("data.csv") # To Parquet (preserves types) df.write_parquet("data.parquet") # To Stata from svy_io.stata import write_dta write_dta(df, "data.dta") ``` ### Preserving Metadata ```{python} #| eval: false from svy_io.sas import read_sas, write_xpt, get_column_labels df, meta = read_sas("input.sas7bdat", catalog_path="formats.sas7bcat") # Transform df_filtered = df.filter(pl.col("valid") == 1) # Write with preserved label write_xpt( df_filtered, "output.xpt", label=meta.get('file_label', 'Processed Dataset') ) ``` ### Reading from Zip Archives ```{python} #| eval: false # Auto-extracts and reads SAS files from zip df, meta = read_sas("data.zip") # If zip contains both .sas7bdat and .sas7bcat, catalog is used automatically ``` ## Performance Tips ```{python} #| eval: false # 1. Skip unnecessary columns df, meta = read_sas("wide.sas7bdat", cols_skip=["temp1", "temp2"]) # 2. Limit rows for exploration df, meta = read_sas("large.sas7bdat", n_max=1000) # 3. Read metadata only _, meta = read_sas("data.sas7bdat", n_max=0) # 4. Use lazy evaluation df, meta = read_sas("input.sas7bdat") result = (df.lazy() .filter(pl.col("year") == 2024) .group_by("region") .agg(pl.col("value").sum()) .collect() ) ``` ## Feature Support | Feature | Read SAS7BDAT | Read XPT | Write XPT | |---------|:-------------:|:--------:|:---------:| | Basic data types | ✅ | ✅ | ✅ | | Variable labels | ✅ | ✅ | ❌ | | File labels | ✅ | ✅ | ✅ | | Value labels (formats) | ✅* | ✅* | ❌ | | Date/Datetime | ✅ | ✅ | ✅ | | Tagged missing (.A-.Z) | ✅ | ✅ | ❌ | | Zip archives | ✅ | ❌ | ❌ | | XPT v5 and v8 | N/A | ✅ | ✅ | *Requires `.sas7bcat` catalog file for SAS7BDAT ## Known Limitations 1. **Compressed SAS7BDAT** — ReadStat doesn't support compressed files. Decompress in SAS first. 2. **SAS7BDAT Write** — Not implemented. Use XPT for SAS-compatible output. 3. **Value Labels on Write** — Cannot create labeled integers when writing XPT. 4. **Tagged Missing on Write** — Cannot write `.A`–`.Z` missing values. 5. **Long Strings in XPT v5** — Maximum 200 characters. Use version 8 for longer strings. 6. **Encoding Detection** — No automatic detection. Try common encodings: latin1, utf-8, cp1252. ## See Also - [Stata File I/O](stata.qmd) — Read and write Stata `.dta` files - [SPSS File I/O](spss.qmd) — Read and write SPSS `.sav` files - [ReadStat](https://github.com/WizardMac/ReadStat) — Upstream C library - [CDISC Standards](https://www.cdisc.org/) — Clinical data interchange - [Polars Documentation](https://pola-rs.github.io/polars/) — DataFrame operations