Skip to content

DGPs API Reference

Bases: ABC

Abstract base class for all data-generating processes.

Concrete subclasses must: - Declare key="some_key" in their class signature to be auto-registered. - Implement generate.

Example::

class LinearDGP(BaseDGP, key="linear"):
    def generate(self, n_samples, n_features, **kwargs):
        ...

generate(n_samples, n_features, **kwargs) abstractmethod

Generate a synthetic dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows in the output X.

required
n_features int

Number of columns in the output X.

required
**kwargs object

DGP-specific parameters (e.g. task_type, random_state, complexity).

{}

Returns:

Type Description
BenchResult

Container with X, y, and rich metadata.

get_params()

Return the DGP's current configuration as a plain dict.

Concrete DGPs should override this to return their init parameters. Used by BenchPipeline to record component provenance in metadata.

Bases: BaseDGP

Linear data-generating process with controlled sparsity and noise.

Generates a dataset where the target is a linear combination of a subset of features plus Gaussian noise. Complexity controls both the sparsity of the coefficient vector and the noise level:

  • 'low': sparse coefficients (≈ p/5 informative), low noise (σ=0.1)
  • 'medium': half informative (≈ p/2), moderate noise (σ=0.5)
  • 'high': all features informative, high noise (σ=1.0)

Parameters:

Name Type Description Default
complexity str

One of 'low', 'medium', 'high'.

'medium'
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a synthetic linear dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with signal_feature_importances summing to 1.0.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

Polynomial data-generating process with controllable degree.

Generates a dataset where the target is a polynomial expansion of a subset of features plus Gaussian noise. Polynomial terms are built manually (not via sklearn PolynomialFeatures).

Complexity mappings (locked):

  • 'low': degree 2, no interactions, n_informative = max(2, p//3)
  • 'medium': degree 3, selected cross-terms, n_informative = max(2, p//2)
  • 'high': degree 4, full cross-terms, n_informative = max(2, p)

Ground-truth importances: structural equal-weight per raw informative feature (each informative feature gets 1/n_informative). Noise features receive 0.0.

Parameters:

Name Type Description Default
complexity str

One of 'low', 'medium', 'high'.

'medium'
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a synthetic polynomial dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with signal_feature_importances summing to 1.0.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

Piecewise-constant data-generating process via a hand-built random tree.

The tree is constructed entirely from an RNG — no data is fitted and no sklearn tree is used. Complexity controls tree depth and leaf noise:

  • 'low': max_depth=2, leaf_noise_std=0.0 (clean splits)
  • 'medium': max_depth=3, leaf_noise_std=0.3 (small noise)
  • 'high': max_depth=5, leaf_noise_std=1.0 (noisy leaves)

Ground-truth feature importances are depth-weighted split counts: importance[f] ∝ Σ (1 / depth) for each split that uses feature f. Features never used in any split receive exactly 0.0 importance.

Parameters:

Name Type Description Default
complexity str

One of 'low', 'medium', 'high'.

'medium'
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a piecewise-constant synthetic dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with signal_feature_importances summing to 1.0.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

Classic Friedman benchmark data-generating process (functions 1, 2, 3).

Implements the three Friedman benchmark functions widely used in statistics and machine learning research. Extra features beyond the formula inputs are N(0,1) noise padding with exactly 0.0 importance.

Function formulas

1 (needs ≥ 5 features, x_i ~ U[0,1]):

y = 10*sin(pi*x0*x1) + 20*(x2-0.5)^2 + 10*x3 + 5*x4 + noise

2 (needs ≥ 4 features, specific ranges):

x0 ~ U[0,100], x1 ~ U[40*pi, 560*pi], x2 ~ U[0,1], x3 ~ U[1,11]
y = sqrt(x0^2 + (x1*x2 - 1/(x1*x3))^2) + noise

3 (needs ≥ 4 features, same ranges as #2):

y = arctan((x1*x2 - 1/(x1*x3)) / x0) + noise

Complexity controls the additive Gaussian noise level: - 'low': noise_std = 0.0 (deterministic signal) - 'medium': noise_std = 1.0 - 'high': noise_std = 3.0

Parameters:

Name Type Description Default
function int

Friedman function index: 1, 2, or 3.

1
complexity str

One of 'low', 'medium', 'high'.

'medium'
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a Friedman benchmark dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X. Must be ≥ minimum required by the chosen function (5 for function 1, 4 for functions 2 and 3).

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with signal_feature_importances summing to 1.0.

Raises:

Type Description
ValueError

If n_features is less than the minimum required by the function.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

GAM-style additive data-generating process.

Generates a dataset where the target is a weighted sum of univariate nonlinear functions applied to individual features:

y = sum_i(w_i * f_i(X[:,i])) + noise

Ground-truth importances are variance-based: Var(w_i * f_i(X[:,i])) normalised over all informative components. Noise features get 0.0.

Complexity mappings (locked):

  • 'low': smooth functions (sin(pi*x), sqrt(|x|), x^2, x)
  • 'medium': mix of smooth + moderate functions
  • 'high': wiggly functions (sin(4pix), sqrt|x|sin(pix), sign)

Parameters:

Name Type Description Default
complexity str

One of 'low', 'medium', 'high'.

'medium'
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a synthetic additive (GAM-style) dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with variance-based signal_feature_importances summing to 1.0.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

K-sparse linear data-generating process.

Exactly k features contribute to the target; the remaining n_features - k features are pure noise. Complexity controls signal strength (coefficient magnitude), not sparsity.

Parameters:

Name Type Description Default
k int

Number of informative (signal) features. Must be ≤ n_features when generate is called.

5
complexity str

One of 'low', 'medium', 'high'. Controls the scale of the true coefficients (low = large scale = strong signal).

'medium'
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a k-sparse synthetic dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X. Must be ≥ self.k.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with exactly k non-zero importances summing to 1.0.

Raises:

Type Description
ValueError

If self.k > n_features.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

Geometric shapes classification data-generating process.

Generates binary classification datasets with structured geometric decision boundaries. Three shape families are supported:

  • 'moons': two interleaving half-circles
  • 'circles': concentric inner/outer circles
  • 'spirals': two Archimedean spirals rotated by pi

This DGP is classification only. Calling generate with task_type='regression' raises ValueError.

The first two features encode the geometric structure; any additional features are pure N(0,1) noise padding. Signal importances are always feature_0: 0.5, feature_1: 0.5; noise features receive 0.0.

Complexity controls the additive Gaussian noise applied to geometric coordinates:

  • 'low': noise_std=0.05 (clean boundary)
  • 'medium': noise_std=0.15
  • 'high': noise_std=0.30 (blurry boundary)

Parameters:

Name Type Description Default
shape str

One of 'moons', 'circles', 'spirals'.

'moons'
complexity str

One of 'low', 'medium', 'high'.

'medium'
task_type str

Must be 'classification'; any other value causes generate to raise ValueError.

'classification'
random_state int

Master seed for reproducibility.

0
class_weight float

Proportion of samples assigned to class 0 (approximately). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a geometric shapes classification dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows.

required
n_features int

Number of columns in X. Must be >= 2.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y in {0.0, 1.0}, metadata with signal_feature_importances summing to 1.0.

Raises:

Type Description
ValueError

If task_type != 'classification' or n_features < 2.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseDGP

Neural network data-generating process using a randomly-initialized MLP.

Generates synthetic datasets by passing Gaussian features through a small, randomly-initialized multilayer perceptron. Ground-truth feature importances are derived from the L2 column norms of the first hidden layer weight matrix, reflecting which input features have the strongest total influence on the network output.

The network is never trained — the random architecture itself defines the data-generating process. This makes the DGP fully reproducible from random_state and independent of any optimization procedure.

Parameters:

Name Type Description Default
n_hidden_layers int

Number of hidden (tanh-activated) layers. Default 2.

2
hidden_size int

Width of each hidden layer. Default 32.

32
task_type str

'regression' or 'classification'.

'regression'
random_state int

Master seed for reproducibility. Controls both feature generation and network weight initialization.

0
class_weight float

Prior probability of the positive class (classification only). Must be in (0, 1). Default 0.5 gives balanced classes.

0.5

generate(n_samples, n_features, **kwargs)

Generate a synthetic neural dataset.

Parameters:

Name Type Description Default
n_samples int

Number of rows in the output X.

required
n_features int

Number of columns in the output X.

required

Returns:

Type Description
BenchResult

X shape (n_samples, n_features), y shape (n_samples,), metadata with signal_feature_importances summing to 1.0 and a key recording the installed PyTorch version.

get_params()

Return constructor parameters as a plain dict.