Skip to content

synthbench

synthbench is a Python library for generating reproducible, metadata-rich synthetic datasets for benchmarking. It provides eight data-generating process (DGP) families, five corruptors for introducing structural messiness, and a pipeline that composes them with deterministic seed management.

Installation

pip install git+https://github.com/JanTeichertKluge/synth-bench.git

For RandomNeuralDGP (requires PyTorch):

pip install "git+https://github.com/JanTeichertKluge/synth-bench.git#egg=synthbench[neural]"

Quickstart

import synthbench
from synthbench import BenchPipeline, LinearDGP

pipeline = BenchPipeline(LinearDGP(complexity="medium", task_type="regression"))
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)
print(list(result.metadata.keys()))

Data-Generating Processes

DGP Description
LinearDGP Linear combination of features with controlled coefficient structure
PolynomialDGP Polynomial interactions with controlled degree
TreeDGP Decision-tree-style splits with controlled depth
FriedmanDGP Classic Friedman benchmark functions (1, 2, or 3)
AdditiveDGP GAM-style sum of univariate functions
SparseDGP Sparse signal with controlled number of informative features
GeometricDGP Geometric manifold structures (moons, circles, spirals)
RandomNeuralDGP Random neural network output as a nonlinear signal

Reference

  • Corruptors: transform X to introduce noise, outliers, missing data, and more
  • Pipeline: BenchPipeline composes a DGP with corruptors reproducibly