Corruptors API Reference

Bases: ABC

Abstract base class for all corruptors.

Concrete subclasses must: - Declare key="some_key" in their class signature to be auto-registered. - Implement corrupt.

Corruptors transform X only; y is never passed in or mutated.

Example::

class CollinearityCorruptor(BaseCorruptor, key="collinearity"):
    def corrupt(self, X, metadata, random_state):
        ...

`corrupt(X, metadata, random_state)` `abstractmethod`

Apply a structural transformation to X.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix of shape (n_samples, n_features). Must not be modified in-place; return a new array.	required
`metadata`	`dict`	The BenchResult metadata dict produced by the DGP. Corruptors may add keys (e.g. effective_feature_importances) but must not remove or overwrite existing keys set by the DGP.	required
`random_state`	`int`	Integer seed. Each corruptor derives its own RNG from this value so that results are fully reproducible.	required

Returns:

Name	Type	Description
`X_corrupted`	`ndarray`	Transformed feature matrix, same shape as X.
`updated_metadata`	`dict`	Metadata dict with any corruptor-specific fields added.

`get_params()`

Return the corruptor's current configuration as a plain dict.

Concrete corruptors should override this to return their init parameters. Used by BenchPipeline to record component provenance.

Bases: ABC

Abstract base class for all label-space corruptors.

Concrete subclasses must: - Declare key="some_key" in their class signature to be auto-registered. - Implement corrupt_labels.

Label corruptors transform y only; X is passed as a read-only reference and must never be mutated.

Example::

class LabelNoiseCorruptor(BaseLabelCorruptor, key="label_noise"):
    def corrupt_labels(self, X, y, metadata, random_state):
        ...

`corrupt_labels(X, y, metadata, random_state)` `abstractmethod`

Apply a label-space transformation to y.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix of shape (n_samples, n_features). Read-only reference; must not be modified in-place.	required
`y`	`ndarray`	Target array of shape (n_samples,). Must not be modified in-place; return a new array.	required
`metadata`	`dict`	The BenchResult metadata dict. Label corruptors may add keys but must not remove or overwrite existing keys.	required
`random_state`	`int`	Integer seed. Each corruptor derives its own RNG from this value so that results are fully reproducible.	required

Returns:

Name	Type	Description
`y_corrupted`	`ndarray`	Transformed target array, same shape as y.
`updated_metadata`	`dict`	Metadata dict with any label-corruptor-specific fields added.

`get_params()`

Return the corruptor's current configuration as a plain dict.

Concrete corruptors should override this to return their init parameters. Used by BenchPipeline to record component provenance.

Bases: BaseCorruptor

Add Gaussian measurement noise to selected feature columns.

For each targeted column c, adds independent noise drawn from N(0, noise_level) to every sample. The effective importance of each noisy column is scaled by the signal-to-noise ratio::

scale = Var(X[:, c]) / (Var(X[:, c]) + noise_level**2)

All importances are then re-normalised to sum to 1.

Parameters:

Name	Type	Description	Default
`severity`	`str`	Preset noise level — `"low"` (0.1), `"medium"` (0.5), `"high"` (1.5).	`'medium'`
`noise_level`	`float \| None`	Override the severity default. Must be positive.	`None`
`columns`	`list[int] \| None`	Indices of columns to corrupt. `None` targets all columns.	`None`

Bases: BaseCorruptor

Introduce missing NaN values using MCAR, MAR, or MNAR mechanisms.

For MCAR (Missing Completely At Random), a random subset of int(proportion * n_samples) rows is set to NaN per targeted column.

For MAR (Missing At Random), missingness probability is driven by a pivot column via a logistic function. The pivot column itself is never corrupted unless it appears in columns.

For MNAR (Missing Not At Random), each column's missingness probability is driven by the column's own pre-corruption values via a logistic function — higher values are more likely to be missing.

The effective importance of each corrupted column is scaled by the realised missing proportion and all importances are re-normalised.

Parameters:

Name	Type	Description	Default
`severity`	`str`	Preset missing rate — `"low"` (5 %), `"medium"` (15 %), `"high"` (30 %).	`'medium'`
`proportion`	`float \| None`	Fraction of values to replace with NaN per column. Overrides severity default.	`None`
`columns`	`list[int] \| None`	Column indices to corrupt. `None` targets all columns.	`None`
`mechanism`	`str`	Missing data mechanism — `"mcar"`, `"mar"`, or `"mnar"`. Defaults to `"mcar"` for full backward compatibility.	`'mcar'`
`pivot_col`	`int \| None`	Column index used as the pivot for MAR missingness. Ignored for MCAR and MNAR. Defaults to column 0 when `None`.	`None`

Bases: BaseCorruptor

Inject outliers into selected feature columns.

For each targeted column c, a random subset of rows (of size int(proportion * n_samples)) are replaced with extreme values::

x_new = x_orig + rng.uniform(mag_low, mag_high) * std(col) * sign

where sign is ±1 drawn uniformly. The effective importance of each corrupted column is scaled by (1 - proportion) and all importances are then re-normalised.

Parameters:

Name	Type	Description	Default
`severity`	`str`	Preset — `"low"` (2 %, 5–10×), `"medium"` (5 %, 10–20×), `"high"` (10 %, 20–50×).	`'medium'`
`proportion`	`float \| None`	Fraction of samples to corrupt per column. Overrides severity.	`None`
`magnitude_range`	`tuple[float, float] \| None`	`(mag_low, mag_high)` multiplier range on the column std. Overrides severity default.	`None`
`columns`	`list[int] \| None`	Column indices to corrupt. `None` targets all columns.	`None`

Bases: BaseCorruptor

Converts continuous features to integer-encoded bins.

For each targeted column, quantile-based bin edges are computed and np.digitize is used to assign each sample to a bin index 0, 1, ..., n_bins-1. The column values are replaced with these integer indices cast to float.

Feature importances are discounted by the factor (1 - 1/n_bins) for each targeted column, then re-normalized to sum to 1.

Parameters:

Name	Type	Description	Default
`severity`	`str`	Controls default `n_bins` when `n_bins` is not provided: `"low"` -> 10, `"medium"` -> 5, `"high"` -> 2.	`'medium'`
`n_bins`	`int \| None`	If provided, overrides the severity-derived number of bins.	`None`
`columns`	`list[int] \| None`	Indices of columns to target. `None` targets all columns.	`None`

`corrupt(X, metadata, random_state)`

Bin targeted columns and update importances.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix of shape `(n_samples, n_features)`.	required
`metadata`	`dict`	Metadata dict from the DGP. Must contain either `effective_feature_importances` or `signal_feature_importances`.	required
`random_state`	`int`	Integer seed. Binning is fully deterministic given X, so `random_state` does not affect output; it is accepted for interface consistency.	required

Returns:

Name	Type	Description
`X_corrupted`	`ndarray`	Same shape as X; targeted columns contain integer-valued floats.
`updated_metadata`	`dict`	Metadata with updated `effective_feature_importances`.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseCorruptor

Appends proxy columns that are noisy copies of targeted features.

For each targeted column c, a proxy column is generated as::

proxy = X[:, c] * scale + N(0, noise_std, n_samples)

The proxy is appended at the end of X, expanding the feature matrix from (n_samples, n_features) to (n_samples, n_features + n_targeted).

Feature importances are split between the original and proxy columns using the coefficient of determination (r^2) of the proxy relative to the original signal.

Parameters:

Name	Type	Description	Default
`severity`	`str`	Controls default `noise_std` when `noise_std` is not provided: `"low"` -> 0.05, `"medium"` -> 0.3, `"high"` -> 0.8.	`'medium'`
`noise_std`	`float \| None`	If provided, overrides the severity-derived noise standard deviation.	`None`
`scale`	`float`	Multiplicative factor applied to the source column when generating the proxy. Defaults to 1.0.	`1.0`
`columns`	`list[int] \| None`	Indices of columns to target. `None` targets all columns.	`None`

`corrupt(X, metadata, random_state)`

Append proxy columns and update importances.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix of shape `(n_samples, n_features)`.	required
`metadata`	`dict`	Metadata dict from the DGP. Must contain either `effective_feature_importances` or `signal_feature_importances`.	required
`random_state`	`int`	Integer seed for reproducibility.	required

Returns:

Name	Type	Description
`X_corrupted`	`ndarray`	Shape `(n_samples, n_features + n_targeted)`.
`updated_metadata`	`dict`	Metadata with updated `effective_feature_importances` and `proxy_source_map`.

`get_params()`

Return constructor parameters as a plain dict.

Bases: BaseLabelCorruptor

Injects label noise into classification or regression targets.

For binary classification: flips floor(noise_rate * n) labels uniformly at random (binary flip: 0 -> 1, 1 -> 0).

For regression: adds N(0, noise_std) Gaussian noise to all targets.

Parameters:

Name	Type	Description	Default
`noise_rate`	`float`	Fraction of classification labels to flip (binary only). Ignored for regression.	`0.05`
`noise_std`	`float`	Standard deviation of Gaussian noise added to regression targets. Ignored for classification.	`0.1`

Examples:

>>> corruptor = LabelNoiseCorruptor(noise_rate=0.1)
>>> y_out, meta = corruptor.corrupt_labels(X, y, metadata, random_state=42)

`corrupt_labels(X, y, metadata, random_state)`

Apply label noise to y.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix (read-only reference, not mutated).	required
`y`	`ndarray`	Target array of shape (n_samples,).	required
`metadata`	`dict`	BenchResult metadata dict; must contain `metadata["dgp_params"]["task_type"]`.	required
`random_state`	`int`	Integer seed for reproducible corruption.	required

Returns:

Name	Type	Description
`y_corrupted`	`ndarray`	Corrupted target array.
`updated_metadata`	`dict`	Shallow copy of metadata with `"label_noise"` key added.

Raises:

Type	Description
`ValueError`	If task_type is `"classification"` and y contains more than 2 unique values.

`get_params()`

Return corruptor parameters as a plain dict.

Corruptors API Reference

corrupt(X, metadata, random_state) abstractmethod

get_params()

corrupt_labels(X, y, metadata, random_state) abstractmethod

get_params()

corrupt(X, metadata, random_state)

get_params()

corrupt(X, metadata, random_state)

get_params()

corrupt_labels(X, y, metadata, random_state)

get_params()

`corrupt(X, metadata, random_state)` `abstractmethod`

`get_params()`

`corrupt_labels(X, y, metadata, random_state)` `abstractmethod`

`get_params()`

`corrupt(X, metadata, random_state)`

`get_params()`

`corrupt(X, metadata, random_state)`

`get_params()`

`corrupt_labels(X, y, metadata, random_state)`

`get_params()`