DGPs API Reference

Bases: ABC

Abstract base class for all data-generating processes.

Concrete subclasses must: - Declare key="some_key" in their class signature to be auto-registered. - Implement generate.

Example::

class LinearDGP(BaseDGP, key="linear"):
    def generate(self, n_samples, n_features, **kwargs):
        ...

`generate(n_samples, n_features, **kwargs)` `abstractmethod`

Generate a synthetic dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows in the output X.	required
`n_features`	`int`	Number of columns in the output X.	required
`**kwargs`	`object`	DGP-specific parameters (e.g. task_type, random_state, complexity).	`{}`

Returns:

Type	Description
`BenchResult`	Container with X, y, and rich metadata.

`get_params()`

Return the DGP's current configuration as a plain dict.

Concrete DGPs should override this to return their init parameters. Used by BenchPipeline to record component provenance in metadata.

Bases: BaseDGP

Linear data-generating process with controlled sparsity and noise.

Generates a dataset where the target is a linear combination of a subset of features plus Gaussian noise. Complexity controls both the sparsity of the coefficient vector and the noise level:

'low': sparse coefficients (≈ p/5 informative), low noise (σ=0.1)
'medium': half informative (≈ p/2), moderate noise (σ=0.5)
'high': all features informative, high noise (σ=1.0)

Parameters:

Name	Type	Description	Default
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`.	`'medium'`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a synthetic linear dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with `signal_feature_importances` summing to 1.0.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

Polynomial data-generating process with controllable degree.

Generates a dataset where the target is a polynomial expansion of a subset of features plus Gaussian noise. Polynomial terms are built manually (not via sklearn PolynomialFeatures).

Complexity mappings (locked):

'low': degree 2, no interactions, n_informative = max(2, p//3)
'medium': degree 3, selected cross-terms, n_informative = max(2, p//2)
'high': degree 4, full cross-terms, n_informative = max(2, p)

Ground-truth importances: structural equal-weight per raw informative feature (each informative feature gets 1/n_informative). Noise features receive 0.0.

Parameters:

Name	Type	Description	Default
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`.	`'medium'`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a synthetic polynomial dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with `signal_feature_importances` summing to 1.0.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

Piecewise-constant data-generating process via a hand-built random tree.

The tree is constructed entirely from an RNG — no data is fitted and no sklearn tree is used. Complexity controls tree depth and leaf noise:

'low': max_depth=2, leaf_noise_std=0.0 (clean splits)
'medium': max_depth=3, leaf_noise_std=0.3 (small noise)
'high': max_depth=5, leaf_noise_std=1.0 (noisy leaves)

Ground-truth feature importances are depth-weighted split counts: importance[f] ∝ Σ (1 / depth) for each split that uses feature f. Features never used in any split receive exactly 0.0 importance.

Parameters:

Name	Type	Description	Default
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`.	`'medium'`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a piecewise-constant synthetic dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with `signal_feature_importances` summing to 1.0.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

Classic Friedman benchmark data-generating process (functions 1, 2, 3).

Implements the three Friedman benchmark functions widely used in statistics and machine learning research. Extra features beyond the formula inputs are N(0,1) noise padding with exactly 0.0 importance.

Function formulas

1 (needs ≥ 5 features, x_i ~ U[0,1]):

y = 10*sin(pi*x0*x1) + 20*(x2-0.5)^2 + 10*x3 + 5*x4 + noise

2 (needs ≥ 4 features, specific ranges):

x0 ~ U[0,100], x1 ~ U[40*pi, 560*pi], x2 ~ U[0,1], x3 ~ U[1,11]
y = sqrt(x0^2 + (x1*x2 - 1/(x1*x3))^2) + noise

3 (needs ≥ 4 features, same ranges as #2):

y = arctan((x1*x2 - 1/(x1*x3)) / x0) + noise

Complexity controls the additive Gaussian noise level: - 'low': noise_std = 0.0 (deterministic signal) - 'medium': noise_std = 1.0 - 'high': noise_std = 3.0

Parameters:

Name	Type	Description	Default
`function`	`int`	Friedman function index: 1, 2, or 3.	`1`
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`.	`'medium'`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a Friedman benchmark dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X. Must be ≥ minimum required by the chosen function (5 for function 1, 4 for functions 2 and 3).	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with `signal_feature_importances` summing to 1.0.

Raises:

Type	Description
`ValueError`	If `n_features` is less than the minimum required by the function.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

GAM-style additive data-generating process.

Generates a dataset where the target is a weighted sum of univariate nonlinear functions applied to individual features:

y = sum_i(w_i * f_i(X[:,i])) + noise

Ground-truth importances are variance-based: Var(w_i * f_i(X[:,i])) normalised over all informative components. Noise features get 0.0.

Complexity mappings (locked):

'low': smooth functions (sin(pi*x), sqrt(|x|), x^2, x)
'medium': mix of smooth + moderate functions
'high': wiggly functions (sin(4pix), sqrt|x|sin(pix), sign)

Parameters:

Name	Type	Description	Default
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`.	`'medium'`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a synthetic additive (GAM-style) dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with variance-based `signal_feature_importances` summing to 1.0.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

K-sparse linear data-generating process.

Exactly k features contribute to the target; the remaining n_features - k features are pure noise. Complexity controls signal strength (coefficient magnitude), not sparsity.

Parameters:

Name	Type	Description	Default
`k`	`int`	Number of informative (signal) features. Must be ≤ `n_features` when generate is called.	`5`
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`. Controls the scale of the true coefficients (low = large scale = strong signal).	`'medium'`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a k-sparse synthetic dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X. Must be ≥ `self.k`.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with exactly `k` non-zero importances summing to 1.0.

Raises:

Type	Description
`ValueError`	If `self.k > n_features`.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

Geometric shapes classification data-generating process.

Generates binary classification datasets with structured geometric decision boundaries. Three shape families are supported:

'moons': two interleaving half-circles
'circles': concentric inner/outer circles
'spirals': two Archimedean spirals rotated by pi

This DGP is classification only. Calling generate with task_type='regression' raises ValueError.

The first two features encode the geometric structure; any additional features are pure N(0,1) noise padding. Signal importances are always feature_0: 0.5, feature_1: 0.5; noise features receive 0.0.

Complexity controls the additive Gaussian noise applied to geometric coordinates:

'low': noise_std=0.05 (clean boundary)
'medium': noise_std=0.15
'high': noise_std=0.30 (blurry boundary)

Parameters:

Name	Type	Description	Default
`shape`	`str`	One of `'moons'`, `'circles'`, `'spirals'`.	`'moons'`
`complexity`	`str`	One of `'low'`, `'medium'`, `'high'`.	`'medium'`
`task_type`	`str`	Must be `'classification'`; any other value causes generate to raise ValueError.	`'classification'`
`random_state`	`int`	Master seed for reproducibility.	`0`
`class_weight`	`float`	Proportion of samples assigned to class 0 (approximately). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a geometric shapes classification dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows.	required
`n_features`	`int`	Number of columns in X. Must be >= 2.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` in {0.0, 1.0}, metadata with `signal_feature_importances` summing to 1.0.

Raises:

Type	Description
`ValueError`	If `task_type != 'classification'` or `n_features < 2`.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseDGP

Neural network data-generating process using a randomly-initialized MLP.

Generates synthetic datasets by passing Gaussian features through a small, randomly-initialized multilayer perceptron. Ground-truth feature importances are derived from the L2 column norms of the first hidden layer weight matrix, reflecting which input features have the strongest total influence on the network output.

The network is never trained — the random architecture itself defines the data-generating process. This makes the DGP fully reproducible from random_state and independent of any optimization procedure.

Parameters:

Name	Type	Description	Default
`n_hidden_layers`	`int`	Number of hidden (tanh-activated) layers. Default 2.	`2`
`hidden_size`	`int`	Width of each hidden layer. Default 32.	`32`
`task_type`	`str`	`'regression'` or `'classification'`.	`'regression'`
`random_state`	`int`	Master seed for reproducibility. Controls both feature generation and network weight initialization.	`0`
`class_weight`	`float`	Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.	`0.5`

`generate(n_samples, n_features, **kwargs)`

Generate a synthetic neural dataset.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of rows in the output X.	required
`n_features`	`int`	Number of columns in the output X.	required

Returns:

Type	Description
`BenchResult`	`X` shape (n_samples, n_features), `y` shape (n_samples,), metadata with `signal_feature_importances` summing to 1.0 and a key recording the installed PyTorch version.

`get_params()`

Return constructor parameters as a plain dict.

DGPs API Reference

generate(n_samples, n_features, **kwargs) abstractmethod

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

1 (needs ≥ 5 features, x_i ~ U[0,1]):

2 (needs ≥ 4 features, specific ranges):

3 (needs ≥ 4 features, same ranges as #2):

generate(n_samples, n_features, **kwargs)

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

generate(n_samples, n_features, **kwargs)

get_params()

`generate(n_samples, n_features, **kwargs)` `abstractmethod`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`

`generate(n_samples, n_features, **kwargs)`

`get_params()`