SparseDGP
SparseDGP generates datasets with a controlled number of truly informative features; the remaining features are pure noise. This makes it useful for benchmarking feature selection and sparse methods.
Quick Start
import synthbench
from synthbench import BenchPipeline, SparseDGP
dgp = SparseDGP(n_informative=3, complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
print(result.X.shape) # (500, 10)
print(result.y.shape) # (500,)
print(list(result.metadata.keys()))
# Signal importances sum to 1.0; noise features are exactly 0.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values())) # 1.0
# Confirm noise features have zero importance
zero_importance = [k for k, v in importances.items() if v == 0.0]
print(f"Noise features: {len(zero_importance)}") # 7
Parameters
| Parameter | Default | Description |
|---|---|---|
n_informative |
3 |
Number of features that carry signal; remaining features are pure noise |
complexity |
"medium" |
Controls the relationship between informative features and target |
task_type |
"regression" |
"regression" for continuous target, "classification" for binary labels |
random_state |
0 |
Integer seed for reproducibility |
class_weight |
0.5 |
(Classification only) Fraction of samples in the positive class |
Notes
- Noise features receive exactly
0.0importance (not near-zero) insignal_feature_importances. This is a hard contract, not a floating-point approximation. n_informativemust be less thann_features(the value passed to.run()).- Ideal for benchmarking LASSO, elastic net, sparse random forests, and other variable-selection methods.