Pipeline & Result API Reference
Orchestrates a DGP and an ordered chain of corruptors.
BenchPipeline is the primary user-facing entry point for generating corrupted benchmark datasets. It:
- Accepts a BaseDGP instance and an optional list of BaseCorruptor instances.
- Enforces the canonical corruptor application order: collinearity → categorical → measurement_noise → outlier → missing_data.
- Manages seed derivation using an internal seed derivation function so
that the same
random_statealways produces identical results. - Assembles complete provenance metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dgp
|
A concrete BaseDGP instance. |
required | |
corruptors
|
Optional list of BaseCorruptor instances. If None or empty, the pipeline runs the DGP only. |
None
|
Examples:
>>> from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor
>>> pipeline = BenchPipeline(LinearDGP(), [MeasurementNoiseCorruptor()])
>>> result = pipeline.run(n_samples=200, random_state=42)
from_metadata(metadata)
classmethod
Reconstruct a BenchPipeline from saved metadata (SER-04).
Callers should then invoke .run(n_samples, n_features, random_state)
using the values from metadata["dgp_params"] for bit-identical replay.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
dict
|
A metadata dict produced by a prior
BenchPipeline.run call
(must include |
required |
Returns:
| Type | Description |
|---|---|
BenchPipeline
|
A new pipeline reconstructed from the saved |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
run(n_samples, n_features=10, random_state=0)
Generate a corrupted benchmark dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows in the output feature matrix. |
required |
n_features
|
int
|
Number of columns in the output feature matrix. |
10
|
random_state
|
int
|
Master integer seed. The same value always produces bit-identical results. |
0
|
Returns:
| Type | Description |
|---|---|
BenchResult
|
Contains the (corrupted) feature matrix X, target y, and rich metadata including signal/effective importances and provenance. |
Canonical output type for all DGP and pipeline operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix of shape |
required |
y
|
ndarray
|
Target array of shape |
required |
metadata
|
dict
|
Free-form dict describing how the dataset was generated. See
|
dict()
|
from_csv(path)
classmethod
Read X, y from CSV and metadata from {stem}.meta.json sidecar.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the sidecar |
from_json(s)
classmethod
Deserialise a BenchResult from a JSON string.
X and y are reconstructed as np.ndarray.
from_parquet(path)
classmethod
Read X, y, and metadata from a Parquet file written by to_parquet.
Requires the [io] optional extra: pip install synthbench[io].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
to_csv(path)
Write X, y to CSV and metadata to a {stem}.meta.json sidecar.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination path for the |
required |
to_json()
Serialise the result to a JSON string.
X and y are converted to nested Python lists via
.tolist(). metadata is serialised as-is and must therefore
already contain only JSON-native values.
to_parquet(path)
Write X, y, and metadata to a Parquet file.
Metadata is embedded in the Parquet schema header under the
b"synthbench_metadata" key.
Requires the [io] optional extra: pip install synthbench[io].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination path for the |
required |