Corruptors API Reference
Bases: ABC
Abstract base class for all corruptors.
Concrete subclasses must:
- Declare key="some_key" in their class signature to be auto-registered.
- Implement corrupt.
Corruptors transform X only; y is never passed in or mutated.
Example::
class CollinearityCorruptor(BaseCorruptor, key="collinearity"):
def corrupt(self, X, metadata, random_state):
...
corrupt(X, metadata, random_state)
abstractmethod
Apply a structural transformation to X.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix of shape (n_samples, n_features). Must not be modified in-place; return a new array. |
required |
metadata
|
dict
|
The BenchResult metadata dict produced by the DGP. Corruptors may add keys (e.g. effective_feature_importances) but must not remove or overwrite existing keys set by the DGP. |
required |
random_state
|
int
|
Integer seed. Each corruptor derives its own RNG from this value so that results are fully reproducible. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_corrupted |
ndarray
|
Transformed feature matrix, same shape as X. |
updated_metadata |
dict
|
Metadata dict with any corruptor-specific fields added. |
get_params()
Return the corruptor's current configuration as a plain dict.
Concrete corruptors should override this to return their init parameters. Used by BenchPipeline to record component provenance.
Bases: ABC
Abstract base class for all label-space corruptors.
Concrete subclasses must:
- Declare key="some_key" in their class signature to be auto-registered.
- Implement
corrupt_labels.
Label corruptors transform y only; X is passed as a read-only reference and must never be mutated.
Example::
class LabelNoiseCorruptor(BaseLabelCorruptor, key="label_noise"):
def corrupt_labels(self, X, y, metadata, random_state):
...
corrupt_labels(X, y, metadata, random_state)
abstractmethod
Apply a label-space transformation to y.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix of shape (n_samples, n_features). Read-only reference; must not be modified in-place. |
required |
y
|
ndarray
|
Target array of shape (n_samples,). Must not be modified in-place; return a new array. |
required |
metadata
|
dict
|
The BenchResult metadata dict. Label corruptors may add keys but must not remove or overwrite existing keys. |
required |
random_state
|
int
|
Integer seed. Each corruptor derives its own RNG from this value so that results are fully reproducible. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
y_corrupted |
ndarray
|
Transformed target array, same shape as y. |
updated_metadata |
dict
|
Metadata dict with any label-corruptor-specific fields added. |
get_params()
Return the corruptor's current configuration as a plain dict.
Concrete corruptors should override this to return their init parameters. Used by BenchPipeline to record component provenance.
Bases: BaseCorruptor
Add Gaussian measurement noise to selected feature columns.
For each targeted column c, adds independent noise drawn from
N(0, noise_level) to every sample. The effective importance of
each noisy column is scaled by the signal-to-noise ratio::
scale = Var(X[:, c]) / (Var(X[:, c]) + noise_level**2)
All importances are then re-normalised to sum to 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
str
|
Preset noise level — |
'medium'
|
noise_level
|
float | None
|
Override the severity default. Must be positive. |
None
|
columns
|
list[int] | None
|
Indices of columns to corrupt. |
None
|
Bases: BaseCorruptor
Introduce missing NaN values using MCAR, MAR, or MNAR mechanisms.
For MCAR (Missing Completely At Random), a random subset of
int(proportion * n_samples) rows is set to NaN per targeted column.
For MAR (Missing At Random), missingness probability is driven by a pivot
column via a logistic function. The pivot column itself is never corrupted
unless it appears in columns.
For MNAR (Missing Not At Random), each column's missingness probability is driven by the column's own pre-corruption values via a logistic function — higher values are more likely to be missing.
The effective importance of each corrupted column is scaled by the realised missing proportion and all importances are re-normalised.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
str
|
Preset missing rate — |
'medium'
|
proportion
|
float | None
|
Fraction of values to replace with NaN per column. Overrides severity default. |
None
|
columns
|
list[int] | None
|
Column indices to corrupt. |
None
|
mechanism
|
str
|
Missing data mechanism — |
'mcar'
|
pivot_col
|
int | None
|
Column index used as the pivot for MAR missingness. Ignored for
MCAR and MNAR. Defaults to column 0 when |
None
|
Bases: BaseCorruptor
Inject outliers into selected feature columns.
For each targeted column c, a random subset of rows (of size
int(proportion * n_samples)) are replaced with extreme values::
x_new = x_orig + rng.uniform(mag_low, mag_high) * std(col) * sign
where sign is ±1 drawn uniformly. The effective importance of
each corrupted column is scaled by (1 - proportion) and all
importances are then re-normalised.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
str
|
Preset — |
'medium'
|
proportion
|
float | None
|
Fraction of samples to corrupt per column. Overrides severity. |
None
|
magnitude_range
|
tuple[float, float] | None
|
|
None
|
columns
|
list[int] | None
|
Column indices to corrupt. |
None
|
Bases: BaseCorruptor
Converts continuous features to integer-encoded bins.
For each targeted column, quantile-based bin edges are computed and
np.digitize is used to assign each sample to a bin index
0, 1, ..., n_bins-1. The column values are replaced with these
integer indices cast to float.
Feature importances are discounted by the factor (1 - 1/n_bins)
for each targeted column, then re-normalized to sum to 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
str
|
Controls default |
'medium'
|
n_bins
|
int | None
|
If provided, overrides the severity-derived number of bins. |
None
|
columns
|
list[int] | None
|
Indices of columns to target. |
None
|
corrupt(X, metadata, random_state)
Bin targeted columns and update importances.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix of shape |
required |
metadata
|
dict
|
Metadata dict from the DGP. Must contain either
|
required |
random_state
|
int
|
Integer seed. Binning is fully deterministic given X, so
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_corrupted |
ndarray
|
Same shape as X; targeted columns contain integer-valued floats. |
updated_metadata |
dict
|
Metadata with updated |
get_params()
Return constructor parameters as a plain dict.
Bases: BaseCorruptor
Appends proxy columns that are noisy copies of targeted features.
For each targeted column c, a proxy column is generated as::
proxy = X[:, c] * scale + N(0, noise_std, n_samples)
The proxy is appended at the end of X, expanding the feature matrix
from (n_samples, n_features) to
(n_samples, n_features + n_targeted).
Feature importances are split between the original and proxy columns using the coefficient of determination (r^2) of the proxy relative to the original signal.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
str
|
Controls default |
'medium'
|
noise_std
|
float | None
|
If provided, overrides the severity-derived noise standard deviation. |
None
|
scale
|
float
|
Multiplicative factor applied to the source column when generating the proxy. Defaults to 1.0. |
1.0
|
columns
|
list[int] | None
|
Indices of columns to target. |
None
|
corrupt(X, metadata, random_state)
Append proxy columns and update importances.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix of shape |
required |
metadata
|
dict
|
Metadata dict from the DGP. Must contain either
|
required |
random_state
|
int
|
Integer seed for reproducibility. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_corrupted |
ndarray
|
Shape |
updated_metadata |
dict
|
Metadata with updated |
get_params()
Return constructor parameters as a plain dict.
Bases: BaseLabelCorruptor
Injects label noise into classification or regression targets.
For binary classification: flips floor(noise_rate * n) labels
uniformly at random (binary flip: 0 -> 1, 1 -> 0).
For regression: adds N(0, noise_std) Gaussian noise to all targets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
noise_rate
|
float
|
Fraction of classification labels to flip (binary only). Ignored for regression. |
0.05
|
noise_std
|
float
|
Standard deviation of Gaussian noise added to regression targets. Ignored for classification. |
0.1
|
Examples:
>>> corruptor = LabelNoiseCorruptor(noise_rate=0.1)
>>> y_out, meta = corruptor.corrupt_labels(X, y, metadata, random_state=42)
corrupt_labels(X, y, metadata, random_state)
Apply label noise to y.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix (read-only reference, not mutated). |
required |
y
|
ndarray
|
Target array of shape (n_samples,). |
required |
metadata
|
dict
|
BenchResult metadata dict; must contain
|
required |
random_state
|
int
|
Integer seed for reproducible corruption. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
y_corrupted |
ndarray
|
Corrupted target array. |
updated_metadata |
dict
|
Shallow copy of metadata with |
Raises:
| Type | Description |
|---|---|
ValueError
|
If task_type is |
get_params()
Return corruptor parameters as a plain dict.