SparseDGP

SparseDGP generates datasets with a controlled number of truly informative features; the remaining features are pure noise. This makes it useful for benchmarking feature selection and sparse methods.

Quick Start

import synthbench
from synthbench import BenchPipeline, SparseDGP

dgp = SparseDGP(n_informative=3, complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)
print(list(result.metadata.keys()))

# Signal importances sum to 1.0; noise features are exactly 0.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values()))  # 1.0

# Confirm noise features have zero importance
zero_importance = [k for k, v in importances.items() if v == 0.0]
print(f"Noise features: {len(zero_importance)}")  # 7

Parameters

Parameter	Default	Description
`n_informative`	`3`	Number of features that carry signal; remaining features are pure noise
`complexity`	`"medium"`	Controls the relationship between informative features and target
`task_type`	`"regression"`	`"regression"` for continuous target, `"classification"` for binary labels
`random_state`	`0`	Integer seed for reproducibility
`class_weight`	`0.5`	(Classification only) Fraction of samples in the positive class

Notes

Noise features receive exactly 0.0 importance (not near-zero) in signal_feature_importances. This is a hard contract, not a floating-point approximation.
n_informative must be less than n_features (the value passed to .run()).
Ideal for benchmarking LASSO, elastic net, sparse random forests, and other variable-selection methods.