Skip to content

BenchPipeline

BenchPipeline is the primary user-facing entry point for generating benchmark datasets. It composes a DGP with an optional chain of corruptors, manages deterministic seed derivation, and assembles complete provenance metadata.

What it does

  1. Accepts a DGP instance and an optional list of corruptor instances.
  2. Enforces the canonical corruptor application order.
  3. Derives per-component seeds from the master random_state using a hash-based derivation so that each component gets an independent but deterministic seed.
  4. Runs the DGP, then applies each corruptor in order, updating effective_feature_importances at each step.
  5. Returns a BenchResult with the complete feature matrix, target, and metadata.

Full Example

import synthbench
from synthbench import (
    BenchPipeline,
    LinearDGP,
    MeasurementNoiseCorruptor,
    MissingDataCorruptor,
)

dgp = LinearDGP(complexity="high", task_type="regression", random_state=0)
pipeline = BenchPipeline(
    dgp,
    corruptors=[
        MeasurementNoiseCorruptor(noise_level=0.3, severity=0.4),
        MissingDataCorruptor(proportion=0.05, severity=0.2),
    ],
)
result = pipeline.run(n_samples=1000, n_features=20, random_state=42)

# Feature matrix and target
print(result.X.shape)   # (1000, 20)
print(result.y.shape)   # (1000,)

# Metadata fields
print(list(result.metadata.keys()))

BenchResult Fields

Field Type Description
X np.ndarray Feature matrix of shape (n_samples, n_features)
y np.ndarray Target array of shape (n_samples,)
metadata dict Provenance and importance information (see below)

Metadata Keys

result.metadata["dgp_class"]                    # "LinearDGP"
result.metadata["dgp_params"]                   # dict of DGP constructor params (includes dgp_key)
result.metadata["signal_feature_importances"]   # importances before corruption
result.metadata["effective_feature_importances"] # importances after corruption
result.metadata["corruptor_order"]              # list of corruptor class names applied
result.metadata["corruptor_params"]             # list of per-corruptor param dicts
result.metadata["bayes_error"]                  # 1-NN LOO error rate (classification); None for regression
result.metadata["bayes_error_method"]           # "empirical_knn" for classification; None otherwise
result.metadata["effective_rank"]               # Roy & Vetterli (2007) rank of post-corruption X
result.metadata["synthbench_version"]           # package version string
result.metadata["numpy_version"]                # numpy version string
result.metadata["python_version"]               # python version string

Signal vs. Effective Importances

signal_feature_importances captures the ground-truth importance structure from the DGP and doesn't change after corruption. effective_feature_importances starts as a copy of the signal importances and is updated by each corruptor as information is degraded.

signal = result.metadata["signal_feature_importances"]
effective = result.metadata["effective_feature_importances"]

print(sum(signal.values()))    # 1.0
print(sum(effective.values())) # <= 1.0 after corruption

Reproducibility

The same random_state always produces bit-identical results. BenchPipeline uses hash-based seed derivation (derive_seeds) to assign independent seeds to the DGP and each corruptor. Changing random_state gives a completely different dataset.

r1 = pipeline.run(n_samples=100, n_features=5, random_state=0)
r2 = pipeline.run(n_samples=100, n_features=5, random_state=0)
import numpy as np
assert np.array_equal(r1.X, r2.X)  # True, always identical

Usage Patterns

Common patterns for working with BenchPipeline in research workflows.

Minimal pipeline (no corruption)

from synthbench import BenchPipeline, LinearDGP

pipeline = BenchPipeline(LinearDGP(task_type="classification"))
result = pipeline.run(n_samples=500, n_features=10, random_state=0)
print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)

Fixing the DGP, varying random_state

Different random_state values produce statistically independent datasets from the same DGP and corruptor configuration, which is useful for generating multiple train/test splits.

from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="regression"),
    corruptors=[MeasurementNoiseCorruptor(severity="medium")],
)
datasets = [
    pipeline.run(n_samples=300, n_features=8, random_state=seed)
    for seed in range(10)
]
# datasets[i] and datasets[j] are independently seeded for all i != j

Comparing DGP complexity tiers

from synthbench import BenchPipeline, LinearDGP

results = {}
for complexity in ("low", "medium", "high"):
    pipeline = BenchPipeline(LinearDGP(task_type="classification", complexity=complexity))
    results[complexity] = pipeline.run(n_samples=500, n_features=10, random_state=0)

for c, r in results.items():
    print(c, r.metadata["bayes_error"])

Accessing metadata enrichment fields

Every BenchResult produced by BenchPipeline.run() includes enrichment fields computed after corruption:

from synthbench import BenchPipeline, LinearDGP

result = BenchPipeline(LinearDGP(task_type="classification")).run(
    n_samples=500, n_features=10, random_state=0
)
print(result.metadata["bayes_error"])          # 1-NN LOO error rate
print(result.metadata["bayes_error_method"])   # "empirical_knn"
print(result.metadata["effective_rank"])       # Roy & Vetterli 2007

Regression tasks

bayes_error and bayes_error_method are None for regression tasks. effective_rank is always computed (for both classification and regression) unless the post-corruption feature matrix contains NaN values.

Pipeline replay via from_metadata

See the Serialization reference for the full replay workflow using BenchPipeline.from_metadata().