Skip to content

Corruptors

Corruptors transform the feature matrix X to introduce structural messiness (noise, outliers, missing values, collinearity, and categorical encoding) without modifying the target y. Each corruptor also updates effective_feature_importances in the metadata to track how much the corruption degrades each feature's information content.

Example: Corruptor Chain

import synthbench
from synthbench import (
    BenchPipeline,
    LinearDGP,
    MeasurementNoiseCorruptor,
    OutlierCorruptor,
    MissingDataCorruptor,
)

dgp = LinearDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(
    dgp,
    corruptors=[
        MeasurementNoiseCorruptor(noise_level=0.5, severity=0.3),
        OutlierCorruptor(proportion=0.05, severity=0.5),
        MissingDataCorruptor(proportion=0.1, severity=0.4),
    ],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)   # (500, 10)
print(result.metadata["corruptor_order"])
print(result.metadata["effective_feature_importances"])

Corruptor Reference

Corruptor Key Parameters Effect
MeasurementNoiseCorruptor noise_level (float), severity (0-1) Adds Gaussian noise to features; scales importance by signal-to-noise ratio Var(X)/(Var(X) + noise_level^2)
OutlierCorruptor proportion (0-1), severity (0-1) Replaces a fraction of values with outliers; scales importance by (1 - proportion)
MissingDataCorruptor proportion (0-1), severity (0-1) Introduces NaN values; scales importance by (1 - proportion)
CollinearityCorruptor severity (0-1) Adds proxy columns correlated with informative features; splits importance between original and proxy using r^2
CategoricalCorruptor n_bins (int), severity (0-1) Discretizes continuous features into bins; discounts importance by (1 - 1/n_bins)

Canonical Application Order

When multiple corruptors are passed to BenchPipeline, they are always applied in this order regardless of the order in which they were provided:

  1. CollinearityCorruptor
  2. CategoricalCorruptor
  3. MeasurementNoiseCorruptor
  4. OutlierCorruptor
  5. MissingDataCorruptor

If your corruptors are not already in canonical order, BenchPipeline will reorder them and emit a UserWarning. To suppress the warning, pass corruptors in canonical order.

Missing Data Mechanisms

MissingDataCorruptor supports three missing data mechanisms controlled by the mechanism parameter.

MCAR (Missing Completely At Random)

mechanism="mcar" is the default and matches all pre-v1.1 behavior. A uniformly random subset of rows is set to NaN in each targeted column. Missingness is independent of all feature values.

from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor

pipeline = BenchPipeline(
    LinearDGP(),
    corruptors=[MissingDataCorruptor(proportion=0.2)],
)
result = pipeline.run(n_samples=500, random_state=42)

MAR (Missing At Random)

mechanism="mar" makes missingness depend on an observed pivot column. The probability of being missing is a logistic function of the pivot column's value, calibrated so that the realised missing proportion matches the target proportion. The pivot column itself is not corrupted unless it is explicitly included in columns.

from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor

pipeline = BenchPipeline(
    LinearDGP(),
    corruptors=[MissingDataCorruptor(proportion=0.2, mechanism="mar")],
)
result = pipeline.run(n_samples=500, random_state=42)

Use the pivot_col parameter to select a specific column index as the MAR driver (defaults to column 0):

MissingDataCorruptor(proportion=0.2, mechanism="mar", pivot_col=2)

MNAR (Missing Not At Random)

mechanism="mnar" implements self-masking: each column's missingness probability is driven by its own pre-corruption values. Higher values are more likely to be missing. Missingness probability is calibrated via a logistic function so that the realised proportion matches the target.

from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor

pipeline = BenchPipeline(
    LinearDGP(),
    corruptors=[MissingDataCorruptor(proportion=0.2, mechanism="mnar")],
)
result = pipeline.run(n_samples=500, random_state=42)

Mechanism Summary

Mechanism Missingness depends on Default
"mcar" Nothing (uniform random) Yes
"mar" Observed pivot column No
"mnar" The column's own values No

Label Noise

LabelNoiseCorruptor operates on the target y rather than the feature matrix X. It is passed via the label_corruptors parameter of BenchPipeline, not via corruptors.

Classification: Label Flipping

For binary classification, noise_rate controls the fraction of labels that are flipped (0 to 1, 1 to 0). Only binary targets are supported; a ValueError is raised for multiclass targets.

from synthbench import BenchPipeline, LinearDGP, LabelNoiseCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="classification"),
    label_corruptors=[LabelNoiseCorruptor(noise_rate=0.1)],
)
result = pipeline.run(n_samples=500, random_state=42)

Regression: Additive Gaussian Noise

For regression, noise_std controls the standard deviation of the additive Gaussian noise added to all target values. The noise_rate parameter is ignored.

from synthbench import BenchPipeline, LinearDGP, LabelNoiseCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="regression"),
    label_corruptors=[LabelNoiseCorruptor(noise_std=0.5)],
)
result = pipeline.run(n_samples=500, random_state=42)

Metadata

LabelNoiseCorruptor records its effect under result.metadata["label_noise"]:

Key Classification Regression
noise_rate Fraction of labels flipped None
noise_std None Standard deviation used
affected_indices List of flipped sample indices None (all samples affected)
meta = result.metadata["label_noise"]
print(meta["noise_rate"])         # e.g. 0.1
print(meta["affected_indices"])   # e.g. [3, 17, 42, ...]

Combining Label Corruptors with Feature Corruptors

Label corruptors and feature corruptors can be combined freely:

from synthbench import (
    BenchPipeline,
    LinearDGP,
    MissingDataCorruptor,
    LabelNoiseCorruptor,
)

pipeline = BenchPipeline(
    LinearDGP(task_type="classification"),
    corruptors=[MissingDataCorruptor(proportion=0.1, mechanism="mar")],
    label_corruptors=[LabelNoiseCorruptor(noise_rate=0.05)],
)
result = pipeline.run(n_samples=500, random_state=42)

Worked Examples

The following examples show each corruptor family used end-to-end with BenchPipeline. All examples use only synthbench and stdlib (no pandas).

Missing data (MCAR / MAR / MNAR)

from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor

# MCAR: proportion=0.15 means 15% of values set to NaN, columns chosen uniformly
pipeline = BenchPipeline(
    LinearDGP(task_type="regression"),
    corruptors=[MissingDataCorruptor(proportion=0.15, mechanism="mcar")],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=0)
print(result.X[:3, :])           # some NaN values present

# MAR: missingness depends on a pivot column (column 0 by default)
pipeline_mar = BenchPipeline(
    LinearDGP(task_type="regression"),
    corruptors=[MissingDataCorruptor(proportion=0.15, mechanism="mar")],
)
result_mar = pipeline_mar.run(n_samples=500, n_features=10, random_state=0)

# MNAR: missingness is self-masking (the value drives its own mask)
pipeline_mnar = BenchPipeline(
    LinearDGP(task_type="regression"),
    corruptors=[MissingDataCorruptor(proportion=0.15, mechanism="mnar")],
)
result_mnar = pipeline_mnar.run(n_samples=500, n_features=10, random_state=0)

Measurement noise

from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="regression"),
    corruptors=[MeasurementNoiseCorruptor(severity="medium")],
)
result = pipeline.run(n_samples=300, n_features=8, random_state=42)
print(result.metadata["effective_rank"])   # reduced due to noise

Outliers

from synthbench import BenchPipeline, LinearDGP, OutlierCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="classification"),
    corruptors=[OutlierCorruptor(severity="high")],
)
result = pipeline.run(n_samples=400, n_features=10, random_state=1)

Label noise (classification)

from synthbench import BenchPipeline, LinearDGP, LabelNoiseCorruptor

pipeline = BenchPipeline(
    LinearDGP(task_type="classification"),
    label_corruptors=[LabelNoiseCorruptor(noise_rate=0.10)],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=0)
noise_meta = result.metadata["label_noise"]
print(noise_meta["noise_rate"])          # 0.1
print(len(noise_meta["affected_indices"]))  # ~50 flipped labels

label_corruptors vs corruptors

Label corruptors modify y only; feature corruptors modify X only. Pass label corruptors in the label_corruptors= argument to BenchPipeline; feature corruptors go in corruptors=. Both lists can be used together.

Combining multiple corruptors

from synthbench import (
    BenchPipeline,
    LinearDGP,
    MissingDataCorruptor,
    MeasurementNoiseCorruptor,
    LabelNoiseCorruptor,
)

pipeline = BenchPipeline(
    LinearDGP(task_type="classification"),
    corruptors=[
        MeasurementNoiseCorruptor(severity="low"),
        MissingDataCorruptor(proportion=0.10, mechanism="mar"),
    ],
    label_corruptors=[LabelNoiseCorruptor(noise_rate=0.05)],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=7)
print(result.metadata["corruptor_order"])       # feature corruptors in order
print(result.metadata["label_corruptor_order"]) # label corruptors in order