LinearDGP
LinearDGP generates datasets where the target is a linear combination of input features. The complexity parameter controls the coefficient structure: how many features carry signal and how uniformly their coefficients are distributed.
Quick Start
import synthbench
from synthbench import BenchPipeline, LinearDGP
dgp = LinearDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
print(result.X.shape) # (500, 10)
print(result.y.shape) # (500,)
print(list(result.metadata.keys()))
# Signal importances sum to 1.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values())) # 1.0
Parameters
| Parameter | Default | Description |
|---|---|---|
complexity |
"medium" |
Controls coefficient structure: "low" uses fewer features, "high" uses more with varied weights |
task_type |
"regression" |
"regression" for continuous target, "classification" for binary labels |
random_state |
0 |
Integer seed for reproducibility |
class_weight |
0.5 |
(Classification only) Fraction of samples in the positive class |
Notes
- Low complexity: Few informative features with near-equal coefficients.
- Medium complexity: Moderate number of informative features.
- High complexity: Many features with varying coefficient magnitudes.
- Signal feature importances reflect the squared coefficient magnitudes, normalized to sum to 1.0.
- Noise features receive exactly
0.0importance insignal_feature_importances.