TreeDGP
TreeDGP generates datasets using a decision-tree-style splitting structure. The complexity parameter controls tree depth and the number of splits, producing increasingly complex piecewise-constant relationships between features and the target.
Quick Start
import synthbench
from synthbench import BenchPipeline, TreeDGP
dgp = TreeDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
print(result.X.shape) # (500, 10)
print(result.y.shape) # (500,)
print(list(result.metadata.keys()))
# Signal importances sum to 1.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values())) # 1.0
Parameters
| Parameter | Default | Description |
|---|---|---|
complexity |
"medium" |
Controls tree depth and split count: "low" = shallow, "high" = deep with many splits |
task_type |
"regression" |
"regression" for continuous target, "classification" for binary labels |
random_state |
0 |
Integer seed for reproducibility |
class_weight |
0.5 |
(Classification only) Fraction of samples in the positive class |
Notes
- Feature importances use depth-weighted split counts: each split at depth
dcontributes1/d, normalized to sum to 1.0. This reflects structural importance rather than data-driven gain. - Noise features (not used in any split) receive exactly
0.0importance. - The same
random_statealways produces an identical tree structure and dataset.