Skip to content

LinearDGP

LinearDGP generates datasets where the target is a linear combination of input features. The complexity parameter controls the coefficient structure: how many features carry signal and how uniformly their coefficients are distributed.

Quick Start

import synthbench
from synthbench import BenchPipeline, LinearDGP

dgp = LinearDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)
print(list(result.metadata.keys()))

# Signal importances sum to 1.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values()))  # 1.0

Parameters

Parameter Default Description
complexity "medium" Controls coefficient structure: "low" uses fewer features, "high" uses more with varied weights
task_type "regression" "regression" for continuous target, "classification" for binary labels
random_state 0 Integer seed for reproducibility
class_weight 0.5 (Classification only) Fraction of samples in the positive class

Notes

  • Low complexity: Few informative features with near-equal coefficients.
  • Medium complexity: Moderate number of informative features.
  • High complexity: Many features with varying coefficient magnitudes.
  • Signal feature importances reflect the squared coefficient magnitudes, normalized to sum to 1.0.
  • Noise features receive exactly 0.0 importance in signal_feature_importances.