Serialization

BenchResult supports three persistence formats: Parquet (lossless binary, recommended), CSV + JSON sidecar (portable, human-inspectable), and BenchPipeline.from_metadata (full pipeline replay from saved metadata). All formats round-trip X, y, and the complete provenance metadata dict.

Parquet

Installation

Parquet support requires the optional [io] extra:

pip install synthbench[io]

This installs pyarrow>=14.0. The core synthbench package does not import pyarrow at module level, so the library remains importable without it.

Writing

from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="regression"),
    corruptors=[MeasurementNoiseCorruptor(noise_level=0.2)],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

result.to_parquet("data.parquet")

Reading

from synthbench import BenchResult

restored = BenchResult.from_parquet("data.parquet")
print(restored.X.shape)          # (500, 10)
print(restored.y.shape)          # (500,)
print(restored.metadata["dgp_class"])   # "LinearDGP"

from_parquet reconstructs X as a float64 NumPy array, y as a 1-D float64 array, and metadata as the original Python dict.

What is stored

X is written as columns named feature_0, feature_1, ... (dtype float64).
y is written as a sentinel column named __y__ (dtype float64).
The full metadata dict is embedded in the Parquet file's schema metadata under the bytes key b"synthbench_metadata".

Why bytes keys?

Parquet schema metadata uses bytes keys per the Parquet spec. The value is UTF-8 encoded JSON produced via json.dumps(metadata).encode("utf-8").

Limitations

All feature columns are coerced to float64. Boolean or integer features are preserved numerically but lose their original dtype.
Importing pyarrow is deferred to the method body; import synthbench does not require pyarrow to be installed. Calling to_parquet or from_parquet without the [io] extra installed raises ImportError with an install instruction.

CSV + Sidecar JSON

Writing

to_csv writes two files: the data file and a companion sidecar containing the metadata.

from synthbench import BenchPipeline, LinearDGP

pipeline = BenchPipeline(LinearDGP(task_type="classification"))
result = pipeline.run(n_samples=300, n_features=8, random_state=7)

result.to_csv("data.csv")
# Produces: data.csv  (feature columns + __y__ column)
#           data.meta.json  (full metadata dict as JSON)

The CSV file contains feature_0, feature_1, ... columns followed by the __y__ sentinel column, written using numpy.savetxt. The sidecar is written to {stem}.meta.json via json.dump.

Reading

from synthbench import BenchResult

restored = BenchResult.from_csv("data.csv")
print(restored.X.shape)   # (300, 8)
print(restored.metadata["synthbench_version"])

from_csv resolves the sidecar path automatically using pathlib.Path.with_suffix combined with the .meta.json naming convention.

Sidecar required

from_csv raises FileNotFoundError if the sidecar {stem}.meta.json is missing. Keep both files together when sharing datasets.

When to use CSV vs Parquet

Use case	Recommended format
Maximum fidelity, binary	Parquet
Human-inspectable data	CSV
Deployment constraint (no pyarrow)	CSV
Large files	Parquet (smaller, faster)
Sharing with non-Python consumers	CSV

Pipeline replay via from_metadata

Motivation

The metadata dict stored alongside X and y contains everything needed to reconstruct the generating pipeline: the DGP class and kwargs, the corruptor chain (feature and label), and the master random_state. BenchPipeline.from_metadata reads that metadata and returns a ready-to-run pipeline.

Round-trip example

from synthbench import BenchPipeline, BenchResult, LinearDGP, MeasurementNoiseCorruptor

# Original run
pipeline = BenchPipeline(
    LinearDGP(task_type="classification"),
    corruptors=[MeasurementNoiseCorruptor(noise_level=0.1)],
)
result = pipeline.run(n_samples=200, n_features=10, random_state=42)

# Persist
result.to_parquet("data.parquet")

# Later, in a different session
restored = BenchResult.from_parquet("data.parquet")
replayed_pipeline = BenchPipeline.from_metadata(restored.metadata)
meta_dgp = restored.metadata["dgp_params"]
result2 = replayed_pipeline.run(
    n_samples=meta_dgp["n_samples"],
    n_features=meta_dgp["n_features"],
    random_state=meta_dgp["random_state"],
)

# Bit-identical
import numpy as np
assert np.array_equal(result2.X, restored.X)
assert np.array_equal(result2.y, restored.y)

What enables replay?

The dgp_params dict includes a dgp_key field (e.g. "linear") that uniquely identifies the generating DGP class in the internal registry. Combined with the serialized corruptor_params and label_corruptor_params entries, this is sufficient to reconstruct the exact pipeline.

Constraints on bit-identical replay

You must pass the same n_samples, n_features, and random_state to the replayed .run(). These are stored in metadata["dgp_params"].
The same versions of synthbench, numpy, and scikit-learn must be used. Version provenance is stored in the metadata for auditing.
RandomNeuralDGP replay additionally requires the [neural] extra (torch). The DGP registry is populated lazily; from_metadata triggers the import on demand.

Metadata enrichment fields

Every BenchResult metadata dict produced by BenchPipeline.run() (after v1.1) additionally includes:

Key	Type	Notes
`bayes_error`	`float` or `None`	Empirical 1-NN LOO error rate for classification; `None` for regression or NaN-containing X.
`bayes_error_method`	`str` or `None`	`"empirical_knn"` for classification; `None` otherwise.
`bayes_error_n_subsample`	`int` or absent	Present only when sample count exceeded the 2000-sample LOO cap.
`effective_rank`	`float` or `None`	Roy & Vetterli (2007) entropy-based rank of post-corruption X. `None` when X contains NaN.
`dgp_params.dgp_key`	`str`	Registry key used by `from_metadata` to reconstruct the DGP.

These fields are persisted alongside the rest of the metadata in both Parquet and CSV sidecar formats.