Skip to content

TreeDGP

TreeDGP generates datasets using a decision-tree-style splitting structure. The complexity parameter controls tree depth and the number of splits, producing increasingly complex piecewise-constant relationships between features and the target.

Quick Start

import synthbench
from synthbench import BenchPipeline, TreeDGP

dgp = TreeDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(dgp)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.y.shape)   # (500,)
print(list(result.metadata.keys()))

# Signal importances sum to 1.0
importances = result.metadata["signal_feature_importances"]
print(sum(importances.values()))  # 1.0

Parameters

Parameter Default Description
complexity "medium" Controls tree depth and split count: "low" = shallow, "high" = deep with many splits
task_type "regression" "regression" for continuous target, "classification" for binary labels
random_state 0 Integer seed for reproducibility
class_weight 0.5 (Classification only) Fraction of samples in the positive class

Notes

  • Feature importances use depth-weighted split counts: each split at depth d contributes 1/d, normalized to sum to 1.0. This reflects structural importance rather than data-driven gain.
  • Noise features (not used in any split) receive exactly 0.0 importance.
  • The same random_state always produces an identical tree structure and dataset.