DGPs API Reference
Bases: ABC
Abstract base class for all data-generating processes.
Concrete subclasses must:
- Declare key="some_key" in their class signature to be auto-registered.
- Implement generate.
Example::
class LinearDGP(BaseDGP, key="linear"):
def generate(self, n_samples, n_features, **kwargs):
...
generate(n_samples, n_features, **kwargs)
abstractmethod
Generate a synthetic dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows in the output X. |
required |
n_features
|
int
|
Number of columns in the output X. |
required |
**kwargs
|
object
|
DGP-specific parameters (e.g. task_type, random_state, complexity). |
{}
|
Returns:
| Type | Description |
|---|---|
BenchResult
|
Container with X, y, and rich metadata. |
get_params()
Return the DGP's current configuration as a plain dict.
Concrete DGPs should override this to return their init parameters. Used by BenchPipeline to record component provenance in metadata.
Bases: BaseDGP
Linear data-generating process with controlled sparsity and noise.
Generates a dataset where the target is a linear combination of a subset of features plus Gaussian noise. Complexity controls both the sparsity of the coefficient vector and the noise level:
'low': sparse coefficients (≈ p/5 informative), low noise (σ=0.1)'medium': half informative (≈ p/2), moderate noise (σ=0.5)'high': all features informative, high noise (σ=1.0)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a synthetic linear dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
Polynomial data-generating process with controllable degree.
Generates a dataset where the target is a polynomial expansion of a subset of features plus Gaussian noise. Polynomial terms are built manually (not via sklearn PolynomialFeatures).
Complexity mappings (locked):
'low': degree 2, no interactions, n_informative = max(2, p//3)'medium': degree 3, selected cross-terms, n_informative = max(2, p//2)'high': degree 4, full cross-terms, n_informative = max(2, p)
Ground-truth importances: structural equal-weight per raw informative feature (each informative feature gets 1/n_informative). Noise features receive 0.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a synthetic polynomial dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
Piecewise-constant data-generating process via a hand-built random tree.
The tree is constructed entirely from an RNG — no data is fitted and no sklearn tree is used. Complexity controls tree depth and leaf noise:
'low': max_depth=2, leaf_noise_std=0.0 (clean splits)'medium': max_depth=3, leaf_noise_std=0.3 (small noise)'high': max_depth=5, leaf_noise_std=1.0 (noisy leaves)
Ground-truth feature importances are depth-weighted split counts:
importance[f] ∝ Σ (1 / depth) for each split that uses feature f.
Features never used in any split receive exactly 0.0 importance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a piecewise-constant synthetic dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
Classic Friedman benchmark data-generating process (functions 1, 2, 3).
Implements the three Friedman benchmark functions widely used in statistics and machine learning research. Extra features beyond the formula inputs are N(0,1) noise padding with exactly 0.0 importance.
Function formulas
1 (needs ≥ 5 features, x_i ~ U[0,1]):
y = 10*sin(pi*x0*x1) + 20*(x2-0.5)^2 + 10*x3 + 5*x4 + noise
2 (needs ≥ 4 features, specific ranges):
x0 ~ U[0,100], x1 ~ U[40*pi, 560*pi], x2 ~ U[0,1], x3 ~ U[1,11]
y = sqrt(x0^2 + (x1*x2 - 1/(x1*x3))^2) + noise
3 (needs ≥ 4 features, same ranges as #2):
y = arctan((x1*x2 - 1/(x1*x3)) / x0) + noise
Complexity controls the additive Gaussian noise level:
- 'low': noise_std = 0.0 (deterministic signal)
- 'medium': noise_std = 1.0
- 'high': noise_std = 3.0
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
function
|
int
|
Friedman function index: 1, 2, or 3. |
1
|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a Friedman benchmark dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. Must be ≥ minimum required by the chosen function (5 for function 1, 4 for functions 2 and 3). |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
GAM-style additive data-generating process.
Generates a dataset where the target is a weighted sum of univariate nonlinear functions applied to individual features:
y = sum_i(w_i * f_i(X[:,i])) + noise
Ground-truth importances are variance-based: Var(w_i * f_i(X[:,i])) normalised over all informative components. Noise features get 0.0.
Complexity mappings (locked):
'low': smooth functions (sin(pi*x), sqrt(|x|), x^2, x)'medium': mix of smooth + moderate functions'high': wiggly functions (sin(4pix), sqrt|x|sin(pix), sign)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a synthetic additive (GAM-style) dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
K-sparse linear data-generating process.
Exactly k features contribute to the target; the remaining
n_features - k features are pure noise. Complexity controls
signal strength (coefficient magnitude), not sparsity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Number of informative (signal) features. Must be ≤ |
5
|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a k-sparse synthetic dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. Must be ≥ |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
Geometric shapes classification data-generating process.
Generates binary classification datasets with structured geometric decision boundaries. Three shape families are supported:
'moons': two interleaving half-circles'circles': concentric inner/outer circles'spirals': two Archimedean spirals rotated by pi
This DGP is classification only. Calling
generate with
task_type='regression' raises ValueError.
The first two features encode the geometric structure; any additional
features are pure N(0,1) noise padding. Signal importances are always
feature_0: 0.5, feature_1: 0.5; noise features receive 0.0.
Complexity controls the additive Gaussian noise applied to geometric coordinates:
'low': noise_std=0.05 (clean boundary)'medium': noise_std=0.15'high': noise_std=0.30 (blurry boundary)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
str
|
One of |
'moons'
|
complexity
|
str
|
One of |
'medium'
|
task_type
|
str
|
Must be |
'classification'
|
random_state
|
int
|
Master seed for reproducibility. |
0
|
class_weight
|
float
|
Proportion of samples assigned to class 0 (approximately). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a geometric shapes classification dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows. |
required |
n_features
|
int
|
Number of columns in X. Must be >= 2. |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
get_params()
Return constructor parameters as a plain dict.
Bases: BaseDGP
Neural network data-generating process using a randomly-initialized MLP.
Generates synthetic datasets by passing Gaussian features through a small, randomly-initialized multilayer perceptron. Ground-truth feature importances are derived from the L2 column norms of the first hidden layer weight matrix, reflecting which input features have the strongest total influence on the network output.
The network is never trained — the random architecture itself defines
the data-generating process. This makes the DGP fully reproducible from
random_state and independent of any optimization procedure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_hidden_layers
|
int
|
Number of hidden (tanh-activated) layers. Default 2. |
2
|
hidden_size
|
int
|
Width of each hidden layer. Default 32. |
32
|
task_type
|
str
|
|
'regression'
|
random_state
|
int
|
Master seed for reproducibility. Controls both feature generation and network weight initialization. |
0
|
class_weight
|
float
|
Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes. |
0.5
|
generate(n_samples, n_features, **kwargs)
Generate a synthetic neural dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of rows in the output X. |
required |
n_features
|
int
|
Number of columns in the output X. |
required |
Returns:
| Type | Description |
|---|---|
BenchResult
|
|
get_params()
Return constructor parameters as a plain dict.