Skip to content

Corruptors API Reference

Bases: ABC

Abstract base class for all corruptors.

Concrete subclasses must: - Declare key="some_key" in their class signature to be auto-registered. - Implement corrupt.

Corruptors transform X only; y is never passed in or mutated.

Example::

class CollinearityCorruptor(BaseCorruptor, key="collinearity"):
    def corrupt(self, X, metadata, random_state):
        ...

corrupt(X, metadata, random_state) abstractmethod

Apply a structural transformation to X.

Parameters:

Name Type Description Default
X ndarray

Feature matrix of shape (n_samples, n_features). Must not be modified in-place; return a new array.

required
metadata dict

The BenchResult metadata dict produced by the DGP. Corruptors may add keys (e.g. effective_feature_importances) but must not remove or overwrite existing keys set by the DGP.

required
random_state int

Integer seed. Each corruptor derives its own RNG from this value so that results are fully reproducible.

required

Returns:

Name Type Description
X_corrupted ndarray

Transformed feature matrix, same shape as X.

updated_metadata dict

Metadata dict with any corruptor-specific fields added.

get_params()

Return the corruptor's current configuration as a plain dict.

Concrete corruptors should override this to return their init parameters. Used by BenchPipeline to record component provenance.

Bases: ABC

Abstract base class for all label-space corruptors.

Concrete subclasses must: - Declare key="some_key" in their class signature to be auto-registered. - Implement corrupt_labels.

Label corruptors transform y only; X is passed as a read-only reference and must never be mutated.

Example::

class LabelNoiseCorruptor(BaseLabelCorruptor, key="label_noise"):
    def corrupt_labels(self, X, y, metadata, random_state):
        ...

corrupt_labels(X, y, metadata, random_state) abstractmethod

Apply a label-space transformation to y.

Parameters:

Name Type Description Default
X ndarray

Feature matrix of shape (n_samples, n_features). Read-only reference; must not be modified in-place.

required
y ndarray

Target array of shape (n_samples,). Must not be modified in-place; return a new array.

required
metadata dict

The BenchResult metadata dict. Label corruptors may add keys but must not remove or overwrite existing keys.

required
random_state int

Integer seed. Each corruptor derives its own RNG from this value so that results are fully reproducible.

required

Returns:

Name Type Description
y_corrupted ndarray

Transformed target array, same shape as y.

updated_metadata dict

Metadata dict with any label-corruptor-specific fields added.

get_params()

Return the corruptor's current configuration as a plain dict.

Concrete corruptors should override this to return their init parameters. Used by BenchPipeline to record component provenance.

Bases: BaseCorruptor

Add Gaussian measurement noise to selected feature columns.

For each targeted column c, adds independent noise drawn from N(0, noise_level) to every sample. The effective importance of each noisy column is scaled by the signal-to-noise ratio::

scale = Var(X[:, c]) / (Var(X[:, c]) + noise_level**2)

All importances are then re-normalised to sum to 1.

Parameters:

Name Type Description Default
severity str

Preset noise level — "low" (0.1), "medium" (0.5), "high" (1.5).

'medium'
noise_level float | None

Override the severity default. Must be positive.

None
columns list[int] | None

Indices of columns to corrupt. None targets all columns.

None

Bases: BaseCorruptor

Introduce missing NaN values using MCAR, MAR, or MNAR mechanisms.

For MCAR (Missing Completely At Random), a random subset of int(proportion * n_samples) rows is set to NaN per targeted column.

For MAR (Missing At Random), missingness probability is driven by a pivot column via a logistic function. The pivot column itself is never corrupted unless it appears in columns.

For MNAR (Missing Not At Random), each column's missingness probability is driven by the column's own pre-corruption values via a logistic function — higher values are more likely to be missing.

The effective importance of each corrupted column is scaled by the realised missing proportion and all importances are re-normalised.

Parameters:

Name Type Description Default
severity str

Preset missing rate — "low" (5 %), "medium" (15 %), "high" (30 %).

'medium'
proportion float | None

Fraction of values to replace with NaN per column. Overrides severity default.

None
columns list[int] | None

Column indices to corrupt. None targets all columns.

None
mechanism str

Missing data mechanism — "mcar", "mar", or "mnar". Defaults to "mcar" for full backward compatibility.

'mcar'
pivot_col int | None

Column index used as the pivot for MAR missingness. Ignored for MCAR and MNAR. Defaults to column 0 when None.

None

Bases: BaseCorruptor

Inject outliers into selected feature columns.

For each targeted column c, a random subset of rows (of size int(proportion * n_samples)) are replaced with extreme values::

x_new = x_orig + rng.uniform(mag_low, mag_high) * std(col) * sign

where sign is ±1 drawn uniformly. The effective importance of each corrupted column is scaled by (1 - proportion) and all importances are then re-normalised.

Parameters:

Name Type Description Default
severity str

Preset — "low" (2 %, 5–10×), "medium" (5 %, 10–20×), "high" (10 %, 20–50×).

'medium'
proportion float | None

Fraction of samples to corrupt per column. Overrides severity.

None
magnitude_range tuple[float, float] | None

(mag_low, mag_high) multiplier range on the column std. Overrides severity default.

None
columns list[int] | None

Column indices to corrupt. None targets all columns.

None

Bases: BaseCorruptor

Converts continuous features to integer-encoded bins.

For each targeted column, quantile-based bin edges are computed and np.digitize is used to assign each sample to a bin index 0, 1, ..., n_bins-1. The column values are replaced with these integer indices cast to float.

Feature importances are discounted by the factor (1 - 1/n_bins) for each targeted column, then re-normalized to sum to 1.

Parameters:

Name Type Description Default
severity str

Controls default n_bins when n_bins is not provided: "low" -> 10, "medium" -> 5, "high" -> 2.

'medium'
n_bins int | None

If provided, overrides the severity-derived number of bins.

None
columns list[int] | None

Indices of columns to target. None targets all columns.

None

corrupt(X, metadata, random_state)

Bin targeted columns and update importances.

Parameters:

Name Type Description Default
X ndarray

Feature matrix of shape (n_samples, n_features).

required
metadata dict

Metadata dict from the DGP. Must contain either effective_feature_importances or signal_feature_importances.

required
random_state int

Integer seed. Binning is fully deterministic given X, so random_state does not affect output; it is accepted for interface consistency.

required

Returns:

Name Type Description
X_corrupted ndarray

Same shape as X; targeted columns contain integer-valued floats.

updated_metadata dict

Metadata with updated effective_feature_importances.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseCorruptor

Appends proxy columns that are noisy copies of targeted features.

For each targeted column c, a proxy column is generated as::

proxy = X[:, c] * scale + N(0, noise_std, n_samples)

The proxy is appended at the end of X, expanding the feature matrix from (n_samples, n_features) to (n_samples, n_features + n_targeted).

Feature importances are split between the original and proxy columns using the coefficient of determination (r^2) of the proxy relative to the original signal.

Parameters:

Name Type Description Default
severity str

Controls default noise_std when noise_std is not provided: "low" -> 0.05, "medium" -> 0.3, "high" -> 0.8.

'medium'
noise_std float | None

If provided, overrides the severity-derived noise standard deviation.

None
scale float

Multiplicative factor applied to the source column when generating the proxy. Defaults to 1.0.

1.0
columns list[int] | None

Indices of columns to target. None targets all columns.

None

corrupt(X, metadata, random_state)

Append proxy columns and update importances.

Parameters:

Name Type Description Default
X ndarray

Feature matrix of shape (n_samples, n_features).

required
metadata dict

Metadata dict from the DGP. Must contain either effective_feature_importances or signal_feature_importances.

required
random_state int

Integer seed for reproducibility.

required

Returns:

Name Type Description
X_corrupted ndarray

Shape (n_samples, n_features + n_targeted).

updated_metadata dict

Metadata with updated effective_feature_importances and proxy_source_map.

get_params()

Return constructor parameters as a plain dict.

Bases: BaseLabelCorruptor

Injects label noise into classification or regression targets.

For binary classification: flips floor(noise_rate * n) labels uniformly at random (binary flip: 0 -> 1, 1 -> 0).

For regression: adds N(0, noise_std) Gaussian noise to all targets.

Parameters:

Name Type Description Default
noise_rate float

Fraction of classification labels to flip (binary only). Ignored for regression.

0.05
noise_std float

Standard deviation of Gaussian noise added to regression targets. Ignored for classification.

0.1

Examples:

>>> corruptor = LabelNoiseCorruptor(noise_rate=0.1)
>>> y_out, meta = corruptor.corrupt_labels(X, y, metadata, random_state=42)

corrupt_labels(X, y, metadata, random_state)

Apply label noise to y.

Parameters:

Name Type Description Default
X ndarray

Feature matrix (read-only reference, not mutated).

required
y ndarray

Target array of shape (n_samples,).

required
metadata dict

BenchResult metadata dict; must contain metadata["dgp_params"]["task_type"].

required
random_state int

Integer seed for reproducible corruption.

required

Returns:

Name Type Description
y_corrupted ndarray

Corrupted target array.

updated_metadata dict

Shallow copy of metadata with "label_noise" key added.

Raises:

Type Description
ValueError

If task_type is "classification" and y contains more than 2 unique values.

get_params()

Return corruptor parameters as a plain dict.