Serialization
BenchResult supports three persistence formats: Parquet (lossless binary, recommended), CSV + JSON sidecar (portable, human-inspectable), and BenchPipeline.from_metadata (full pipeline replay from saved metadata). All formats round-trip X, y, and the complete provenance metadata dict.
Parquet
Installation
Parquet support requires the optional [io] extra:
This installs pyarrow>=14.0. The core synthbench package does not import pyarrow at module level, so the library remains importable without it.
Writing
from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="regression"),
corruptors=[MeasurementNoiseCorruptor(noise_level=0.2)],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
result.to_parquet("data.parquet")
Reading
from synthbench import BenchResult
restored = BenchResult.from_parquet("data.parquet")
print(restored.X.shape) # (500, 10)
print(restored.y.shape) # (500,)
print(restored.metadata["dgp_class"]) # "LinearDGP"
from_parquet reconstructs X as a float64 NumPy array, y as a 1-D float64 array, and metadata as the original Python dict.
What is stored
Xis written as columns namedfeature_0,feature_1, ... (dtypefloat64).yis written as a sentinel column named__y__(dtypefloat64).- The full metadata dict is embedded in the Parquet file's schema metadata under the bytes key
b"synthbench_metadata".
Why bytes keys?
Parquet schema metadata uses bytes keys per the Parquet spec. The value
is UTF-8 encoded JSON produced via json.dumps(metadata).encode("utf-8").
Limitations
- All feature columns are coerced to
float64. Boolean or integer features are preserved numerically but lose their original dtype. - Importing
pyarrowis deferred to the method body;import synthbenchdoes not requirepyarrowto be installed. Callingto_parquetorfrom_parquetwithout the[io]extra installed raisesImportErrorwith an install instruction.
CSV + Sidecar JSON
Writing
to_csv writes two files: the data file and a companion sidecar containing the metadata.
from synthbench import BenchPipeline, LinearDGP
pipeline = BenchPipeline(LinearDGP(task_type="classification"))
result = pipeline.run(n_samples=300, n_features=8, random_state=7)
result.to_csv("data.csv")
# Produces: data.csv (feature columns + __y__ column)
# data.meta.json (full metadata dict as JSON)
The CSV file contains feature_0, feature_1, ... columns followed by the __y__ sentinel column, written using numpy.savetxt. The sidecar is written to {stem}.meta.json via json.dump.
Reading
from synthbench import BenchResult
restored = BenchResult.from_csv("data.csv")
print(restored.X.shape) # (300, 8)
print(restored.metadata["synthbench_version"])
from_csv resolves the sidecar path automatically using pathlib.Path.with_suffix combined with the .meta.json naming convention.
Sidecar required
from_csv raises FileNotFoundError if the sidecar
{stem}.meta.json is missing. Keep both files together
when sharing datasets.
When to use CSV vs Parquet
| Use case | Recommended format |
|---|---|
| Maximum fidelity, binary | Parquet |
| Human-inspectable data | CSV |
| Deployment constraint (no pyarrow) | CSV |
| Large files | Parquet (smaller, faster) |
| Sharing with non-Python consumers | CSV |
Pipeline replay via from_metadata
Motivation
The metadata dict stored alongside X and y contains everything needed to reconstruct the generating pipeline: the DGP class and kwargs, the corruptor chain (feature and label), and the master random_state. BenchPipeline.from_metadata reads that metadata and returns a ready-to-run pipeline.
Round-trip example
from synthbench import BenchPipeline, BenchResult, LinearDGP, MeasurementNoiseCorruptor
# Original run
pipeline = BenchPipeline(
LinearDGP(task_type="classification"),
corruptors=[MeasurementNoiseCorruptor(noise_level=0.1)],
)
result = pipeline.run(n_samples=200, n_features=10, random_state=42)
# Persist
result.to_parquet("data.parquet")
# Later, in a different session
restored = BenchResult.from_parquet("data.parquet")
replayed_pipeline = BenchPipeline.from_metadata(restored.metadata)
meta_dgp = restored.metadata["dgp_params"]
result2 = replayed_pipeline.run(
n_samples=meta_dgp["n_samples"],
n_features=meta_dgp["n_features"],
random_state=meta_dgp["random_state"],
)
# Bit-identical
import numpy as np
assert np.array_equal(result2.X, restored.X)
assert np.array_equal(result2.y, restored.y)
What enables replay?
The dgp_params dict includes a dgp_key field (e.g. "linear")
that uniquely identifies the generating DGP class in the internal
registry. Combined with the serialized corruptor_params and
label_corruptor_params entries, this is sufficient to reconstruct
the exact pipeline.
Constraints on bit-identical replay
- You must pass the same
n_samples,n_features, andrandom_stateto the replayed.run(). These are stored inmetadata["dgp_params"]. - The same versions of synthbench, numpy, and scikit-learn must be used. Version provenance is stored in the metadata for auditing.
RandomNeuralDGPreplay additionally requires the[neural]extra (torch). The DGP registry is populated lazily;from_metadatatriggers the import on demand.
Metadata enrichment fields
Every BenchResult metadata dict produced by BenchPipeline.run() (after v1.1) additionally includes:
| Key | Type | Notes |
|---|---|---|
bayes_error |
float or None |
Empirical 1-NN LOO error rate for classification; None for regression or NaN-containing X. |
bayes_error_method |
str or None |
"empirical_knn" for classification; None otherwise. |
bayes_error_n_subsample |
int or absent |
Present only when sample count exceeded the 2000-sample LOO cap. |
effective_rank |
float or None |
Roy & Vetterli (2007) entropy-based rank of post-corruption X. None when X contains NaN. |
dgp_params.dgp_key |
str |
Registry key used by from_metadata to reconstruct the DGP. |
These fields are persisted alongside the rest of the metadata in both Parquet and CSV sidecar formats.