Skip to content

SparseDGP

SparseDGP generates datasets with a controlled number of truly informative features; the remaining features are pure noise. This makes it useful for benchmarking feature selection and sparse methods.

Quick Start

import synthbench
from synthbench import BenchPipeline, SparseDGP

dgp = SparseDGP(n_informative=3, complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)
print(list(result.metadata.keys()))

# Signal importances sum to 1.0; noise features are exactly 0.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values()))  # 1.0

# Confirm noise features have zero importance
zero_importance = [k for k, v in importances.items() if v == 0.0]
print(f"Noise features: {len(zero_importance)}")  # 7

Parameters

Parameter Default Description
n_informative 3 Number of features that carry signal; remaining features are pure noise
complexity "medium" Controls the relationship between informative features and target
task_type "regression" "regression" for continuous target, "classification" for binary labels
random_state 0 Integer seed for reproducibility
class_weight 0.5 (Classification only) Fraction of samples in the positive class

Notes

  • Noise features receive exactly 0.0 importance (not near-zero) in signal_feature_importances. This is a hard contract, not a floating-point approximation.
  • n_informative must be less than n_features (the value passed to .run()).
  • Ideal for benchmarking LASSO, elastic net, sparse random forests, and other variable-selection methods.