Pipeline & Result API Reference

Orchestrates a DGP and an ordered chain of corruptors.

BenchPipeline is the primary user-facing entry point for generating corrupted benchmark datasets. It:

Accepts a BaseDGP instance and an optional list of BaseCorruptor instances.
Enforces the canonical corruptor application order: collinearity → categorical → measurement_noise → outlier → missing_data.
Manages seed derivation using an internal seed derivation function so that the same random_state always produces identical results.
Assembles complete provenance metadata.

Parameters:

Name	Type	Description	Default
`dgp`		A concrete BaseDGP instance.	required
`corruptors`		Optional list of BaseCorruptor instances. If None or empty, the pipeline runs the DGP only.	`None`

Examples:

>>> from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor
>>> pipeline = BenchPipeline(LinearDGP(), [MeasurementNoiseCorruptor()])
>>> result = pipeline.run(n_samples=200, random_state=42)

`from_metadata(metadata)` `classmethod`

Reconstruct a BenchPipeline from saved metadata (SER-04).

Callers should then invoke .run(n_samples, n_features, random_state) using the values from metadata["dgp_params"] for bit-identical replay.

Parameters:

Name	Type	Description	Default
`metadata`	`dict`	A metadata dict produced by a prior BenchPipeline.run call (must include `dgp_params["dgp_key"]` — see SER-05).	required

Returns:

Type	Description
`BenchPipeline`	A new pipeline reconstructed from the saved `dgp_params`, `corruptor_params`, and `label_corruptor_params`.

Raises:

Type	Description
`KeyError`	If `metadata["dgp_params"]["dgp_key"]` is missing or not found in the registry.

`run(n_samples, n_features=10, random_state=0)`

Generate a corrupted benchmark dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows in the output feature matrix.	required
`n_features`	`int`	Number of columns in the output feature matrix.	`10`
`random_state`	`int`	Master integer seed. The same value always produces bit-identical results.	`0`

Returns:

Type	Description
`BenchResult`	Contains the (corrupted) feature matrix X, target y, and rich metadata including signal/effective importances and provenance.

Canonical output type for all DGP and pipeline operations.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix of shape `(n_samples, n_features)`.	required
`y`	`ndarray`	Target array of shape `(n_samples,)`.	required
`metadata`	`dict`	Free-form dict describing how the dataset was generated. See `METADATA_SCHEMA_KEYS` for the agreed key set. All values must be JSON-native so that `to_json` works without a custom encoder.	`dict()`

`from_csv(path)` `classmethod`

Read X, y from CSV and metadata from {stem}.meta.json sidecar.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the `.csv` file.	required

Returns:

Type	Description
`BenchResult`

Raises:

Type	Description
`FileNotFoundError`	If the sidecar `.meta.json` file is missing. The `.csv` alone is not a valid `BenchResult`.

`from_json(s)` `classmethod`

Deserialise a BenchResult from a JSON string.

X and y are reconstructed as np.ndarray.

`from_parquet(path)` `classmethod`

Read X, y, and metadata from a Parquet file written by to_parquet.

Requires the [io] optional extra: pip install synthbench[io].

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to the `.parquet` file.	required

Returns:

Type	Description
`BenchResult`

`to_csv(path)`

Write X, y to CSV and metadata to a {stem}.meta.json sidecar.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Destination path for the `.csv` file. A companion `{stem}.meta.json` file is written in the same directory.	required

`to_json()`

Serialise the result to a JSON string.

X and y are converted to nested Python lists via .tolist(). metadata is serialised as-is and must therefore already contain only JSON-native values.

`to_parquet(path)`

Write X, y, and metadata to a Parquet file.

Metadata is embedded in the Parquet schema header under the b"synthbench_metadata" key.

Requires the [io] optional extra: pip install synthbench[io].

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Destination path for the `.parquet` file.	required

Pipeline & Result API Reference

from_metadata(metadata) classmethod

run(n_samples, n_features=10, random_state=0)

from_csv(path) classmethod

from_json(s) classmethod

from_parquet(path) classmethod

to_csv(path)

to_json()

to_parquet(path)

`from_metadata(metadata)` `classmethod`

`run(n_samples, n_features=10, random_state=0)`

`from_csv(path)` `classmethod`

`from_json(s)` `classmethod`

`from_parquet(path)` `classmethod`

`to_csv(path)`

`to_json()`

`to_parquet(path)`