Skip to content

Pipeline & Result API Reference

Orchestrates a DGP and an ordered chain of corruptors.

BenchPipeline is the primary user-facing entry point for generating corrupted benchmark datasets. It:

  • Accepts a BaseDGP instance and an optional list of BaseCorruptor instances.
  • Enforces the canonical corruptor application order: collinearity → categorical → measurement_noise → outlier → missing_data.
  • Manages seed derivation using an internal seed derivation function so that the same random_state always produces identical results.
  • Assembles complete provenance metadata.

Parameters:

Name Type Description Default
dgp

A concrete BaseDGP instance.

required
corruptors

Optional list of BaseCorruptor instances. If None or empty, the pipeline runs the DGP only.

None

Examples:

>>> from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor
>>> pipeline = BenchPipeline(LinearDGP(), [MeasurementNoiseCorruptor()])
>>> result = pipeline.run(n_samples=200, random_state=42)

from_metadata(metadata) classmethod

Reconstruct a BenchPipeline from saved metadata (SER-04).

Callers should then invoke .run(n_samples, n_features, random_state) using the values from metadata["dgp_params"] for bit-identical replay.

Parameters:

Name Type Description Default
metadata dict

A metadata dict produced by a prior BenchPipeline.run call (must include dgp_params["dgp_key"] — see SER-05).

required

Returns:

Type Description
BenchPipeline

A new pipeline reconstructed from the saved dgp_params, corruptor_params, and label_corruptor_params.

Raises:

Type Description
KeyError

If metadata["dgp_params"]["dgp_key"] is missing or not found in the registry.

run(n_samples, n_features=10, random_state=0)

Generate a corrupted benchmark dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows in the output feature matrix.

required
n_features int

Number of columns in the output feature matrix.

10
random_state int

Master integer seed. The same value always produces bit-identical results.

0

Returns:

Type Description
BenchResult

Contains the (corrupted) feature matrix X, target y, and rich metadata including signal/effective importances and provenance.

Canonical output type for all DGP and pipeline operations.

Parameters:

Name Type Description Default
X ndarray

Feature matrix of shape (n_samples, n_features).

required
y ndarray

Target array of shape (n_samples,).

required
metadata dict

Free-form dict describing how the dataset was generated. See METADATA_SCHEMA_KEYS for the agreed key set. All values must be JSON-native so that to_json works without a custom encoder.

dict()

from_csv(path) classmethod

Read X, y from CSV and metadata from {stem}.meta.json sidecar.

Parameters:

Name Type Description Default
path str | Path

Path to the .csv file.

required

Returns:

Type Description
BenchResult

Raises:

Type Description
FileNotFoundError

If the sidecar .meta.json file is missing. The .csv alone is not a valid BenchResult.

from_json(s) classmethod

Deserialise a BenchResult from a JSON string.

X and y are reconstructed as np.ndarray.

from_parquet(path) classmethod

Read X, y, and metadata from a Parquet file written by to_parquet.

Requires the [io] optional extra: pip install synthbench[io].

Parameters:

Name Type Description Default
path str | Path

Path to the .parquet file.

required

Returns:

Type Description
BenchResult

to_csv(path)

Write X, y to CSV and metadata to a {stem}.meta.json sidecar.

Parameters:

Name Type Description Default
path str | Path

Destination path for the .csv file. A companion {stem}.meta.json file is written in the same directory.

required

to_json()

Serialise the result to a JSON string.

X and y are converted to nested Python lists via .tolist(). metadata is serialised as-is and must therefore already contain only JSON-native values.

to_parquet(path)

Write X, y, and metadata to a Parquet file.

Metadata is embedded in the Parquet schema header under the b"synthbench_metadata" key.

Requires the [io] optional extra: pip install synthbench[io].

Parameters:

Name Type Description Default
path str | Path

Destination path for the .parquet file.

required