synthbench
synthbench is a Python library for generating reproducible, metadata-rich synthetic datasets for benchmarking. It provides eight data-generating process (DGP) families, five corruptors for introducing structural messiness, and a pipeline that composes them with deterministic seed management.
Installation
For RandomNeuralDGP (requires PyTorch):
Quickstart
import synthbench
from synthbench import BenchPipeline, LinearDGP
pipeline = BenchPipeline(LinearDGP(complexity="medium", task_type="regression"))
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
print(result.X.shape) # (500, 10)
print(result.y.shape) # (500,)
print(list(result.metadata.keys()))
Data-Generating Processes
| DGP | Description |
|---|---|
| LinearDGP | Linear combination of features with controlled coefficient structure |
| PolynomialDGP | Polynomial interactions with controlled degree |
| TreeDGP | Decision-tree-style splits with controlled depth |
| FriedmanDGP | Classic Friedman benchmark functions (1, 2, or 3) |
| AdditiveDGP | GAM-style sum of univariate functions |
| SparseDGP | Sparse signal with controlled number of informative features |
| GeometricDGP | Geometric manifold structures (moons, circles, spirals) |
| RandomNeuralDGP | Random neural network output as a nonlinear signal |
Reference
- Corruptors: transform X to introduce noise, outliers, missing data, and more
- Pipeline: BenchPipeline composes a DGP with corruptors reproducibly