BenchPipeline
BenchPipeline is the primary user-facing entry point for generating benchmark datasets. It composes a DGP with an optional chain of corruptors, manages deterministic seed derivation, and assembles complete provenance metadata.
What it does
- Accepts a DGP instance and an optional list of corruptor instances.
- Enforces the canonical corruptor application order.
- Derives per-component seeds from the master
random_stateusing a hash-based derivation so that each component gets an independent but deterministic seed. - Runs the DGP, then applies each corruptor in order, updating
effective_feature_importancesat each step. - Returns a
BenchResultwith the complete feature matrix, target, and metadata.
Full Example
import synthbench
from synthbench import (
BenchPipeline,
LinearDGP,
MeasurementNoiseCorruptor,
MissingDataCorruptor,
)
dgp = LinearDGP(complexity="high", task_type="regression", random_state=0)
pipeline = BenchPipeline(
dgp,
corruptors=[
MeasurementNoiseCorruptor(noise_level=0.3, severity=0.4),
MissingDataCorruptor(proportion=0.05, severity=0.2),
],
)
result = pipeline.run(n_samples=1000, n_features=20, random_state=42)
# Feature matrix and target
print(result.X.shape) # (1000, 20)
print(result.y.shape) # (1000,)
# Metadata fields
print(list(result.metadata.keys()))
BenchResult Fields
| Field | Type | Description |
|---|---|---|
X |
np.ndarray |
Feature matrix of shape (n_samples, n_features) |
y |
np.ndarray |
Target array of shape (n_samples,) |
metadata |
dict |
Provenance and importance information (see below) |
Metadata Keys
result.metadata["dgp_class"] # "LinearDGP"
result.metadata["dgp_params"] # dict of DGP constructor params (includes dgp_key)
result.metadata["signal_feature_importances"] # importances before corruption
result.metadata["effective_feature_importances"] # importances after corruption
result.metadata["corruptor_order"] # list of corruptor class names applied
result.metadata["corruptor_params"] # list of per-corruptor param dicts
result.metadata["bayes_error"] # 1-NN LOO error rate (classification); None for regression
result.metadata["bayes_error_method"] # "empirical_knn" for classification; None otherwise
result.metadata["effective_rank"] # Roy & Vetterli (2007) rank of post-corruption X
result.metadata["synthbench_version"] # package version string
result.metadata["numpy_version"] # numpy version string
result.metadata["python_version"] # python version string
Signal vs. Effective Importances
signal_feature_importances captures the ground-truth importance structure from the DGP and doesn't change after corruption. effective_feature_importances starts as a copy of the signal importances and is updated by each corruptor as information is degraded.
signal = result.metadata["signal_feature_importances"]
effective = result.metadata["effective_feature_importances"]
print(sum(signal.values())) # 1.0
print(sum(effective.values())) # <= 1.0 after corruption
Reproducibility
The same random_state always produces bit-identical results. BenchPipeline uses hash-based seed derivation (derive_seeds) to assign independent seeds to the DGP and each corruptor. Changing random_state gives a completely different dataset.
r1 = pipeline.run(n_samples=100, n_features=5, random_state=0)
r2 = pipeline.run(n_samples=100, n_features=5, random_state=0)
import numpy as np
assert np.array_equal(r1.X, r2.X) # True, always identical
Usage Patterns
Common patterns for working with BenchPipeline in research workflows.
Minimal pipeline (no corruption)
from synthbench import BenchPipeline, LinearDGP
pipeline = BenchPipeline(LinearDGP(task_type="classification"))
result = pipeline.run(n_samples=500, n_features=10, random_state=0)
print(result.X.shape) # (500, 10)
print(result.y.shape) # (500,)
Fixing the DGP, varying random_state
Different random_state values produce statistically independent datasets
from the same DGP and corruptor configuration, which is useful for generating
multiple train/test splits.
from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="regression"),
corruptors=[MeasurementNoiseCorruptor(severity="medium")],
)
datasets = [
pipeline.run(n_samples=300, n_features=8, random_state=seed)
for seed in range(10)
]
# datasets[i] and datasets[j] are independently seeded for all i != j
Comparing DGP complexity tiers
from synthbench import BenchPipeline, LinearDGP
results = {}
for complexity in ("low", "medium", "high"):
pipeline = BenchPipeline(LinearDGP(task_type="classification", complexity=complexity))
results[complexity] = pipeline.run(n_samples=500, n_features=10, random_state=0)
for c, r in results.items():
print(c, r.metadata["bayes_error"])
Accessing metadata enrichment fields
Every BenchResult produced by BenchPipeline.run() includes enrichment
fields computed after corruption:
from synthbench import BenchPipeline, LinearDGP
result = BenchPipeline(LinearDGP(task_type="classification")).run(
n_samples=500, n_features=10, random_state=0
)
print(result.metadata["bayes_error"]) # 1-NN LOO error rate
print(result.metadata["bayes_error_method"]) # "empirical_knn"
print(result.metadata["effective_rank"]) # Roy & Vetterli 2007
Regression tasks
bayes_error and bayes_error_method are None for regression tasks.
effective_rank is always computed (for both classification and regression)
unless the post-corruption feature matrix contains NaN values.
Pipeline replay via from_metadata
See the Serialization reference for the full replay
workflow using BenchPipeline.from_metadata().