TreeDGP

TreeDGP generates datasets using a decision-tree-style splitting structure. The complexity parameter controls tree depth and the number of splits, producing increasingly complex piecewise-constant relationships between features and the target.

Quick Start

import synthbench
from synthbench import BenchPipeline, TreeDGP

dgp = TreeDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)
print(list(result.metadata.keys()))

# Signal importances sum to 1.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values()))  # 1.0

Parameters

Parameter	Default	Description
`complexity`	`"medium"`	Controls tree depth and split count: `"low"` = shallow, `"high"` = deep with many splits
`task_type`	`"regression"`	`"regression"` for continuous target, `"classification"` for binary labels
`random_state`	`0`	Integer seed for reproducibility
`class_weight`	`0.5`	(Classification only) Fraction of samples in the positive class

Notes

Feature importances use depth-weighted split counts: each split at depth d contributes 1/d, normalized to sum to 1.0. This reflects structural importance rather than data-driven gain.
Noise features (not used in any split) receive exactly 0.0 importance.
The same random_state always produces an identical tree structure and dataset.