Corruptors
Corruptors transform the feature matrix X to introduce structural messiness (noise, outliers, missing values, collinearity, and categorical encoding) without modifying the target y. Each corruptor also updates effective_feature_importances in the metadata to track how much the corruption degrades each feature's information content.
Example: Corruptor Chain
import synthbench
from synthbench import (
BenchPipeline,
LinearDGP,
MeasurementNoiseCorruptor,
OutlierCorruptor,
MissingDataCorruptor,
)
dgp = LinearDGP(complexity="medium", task_type="regression", random_state=0)
pipeline = BenchPipeline(
dgp,
corruptors=[
MeasurementNoiseCorruptor(noise_level=0.5, severity=0.3),
OutlierCorruptor(proportion=0.05, severity=0.5),
MissingDataCorruptor(proportion=0.1, severity=0.4),
],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)
print(result.X.shape) # (500, 10)
print(result.metadata["corruptor_order"])
print(result.metadata["effective_feature_importances"])
Corruptor Reference
| Corruptor | Key Parameters | Effect |
|---|---|---|
MeasurementNoiseCorruptor |
noise_level (float), severity (0-1) |
Adds Gaussian noise to features; scales importance by signal-to-noise ratio Var(X)/(Var(X) + noise_level^2) |
OutlierCorruptor |
proportion (0-1), severity (0-1) |
Replaces a fraction of values with outliers; scales importance by (1 - proportion) |
MissingDataCorruptor |
proportion (0-1), severity (0-1) |
Introduces NaN values; scales importance by (1 - proportion) |
CollinearityCorruptor |
severity (0-1) |
Adds proxy columns correlated with informative features; splits importance between original and proxy using r^2 |
CategoricalCorruptor |
n_bins (int), severity (0-1) |
Discretizes continuous features into bins; discounts importance by (1 - 1/n_bins) |
Canonical Application Order
When multiple corruptors are passed to BenchPipeline, they are always applied in this order regardless of the order in which they were provided:
CollinearityCorruptorCategoricalCorruptorMeasurementNoiseCorruptorOutlierCorruptorMissingDataCorruptor
If your corruptors are not already in canonical order, BenchPipeline will reorder them and emit a UserWarning. To suppress the warning, pass corruptors in canonical order.
Missing Data Mechanisms
MissingDataCorruptor supports three missing data mechanisms controlled by the mechanism parameter.
MCAR (Missing Completely At Random)
mechanism="mcar" is the default and matches all pre-v1.1 behavior. A uniformly random subset of rows is set to NaN in each targeted column. Missingness is independent of all feature values.
from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor
pipeline = BenchPipeline(
LinearDGP(),
corruptors=[MissingDataCorruptor(proportion=0.2)],
)
result = pipeline.run(n_samples=500, random_state=42)
MAR (Missing At Random)
mechanism="mar" makes missingness depend on an observed pivot column. The probability of being missing is a logistic function of the pivot column's value, calibrated so that the realised missing proportion matches the target proportion. The pivot column itself is not corrupted unless it is explicitly included in columns.
from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor
pipeline = BenchPipeline(
LinearDGP(),
corruptors=[MissingDataCorruptor(proportion=0.2, mechanism="mar")],
)
result = pipeline.run(n_samples=500, random_state=42)
Use the pivot_col parameter to select a specific column index as the MAR driver (defaults to column 0):
MNAR (Missing Not At Random)
mechanism="mnar" implements self-masking: each column's missingness probability is driven by its own pre-corruption values. Higher values are more likely to be missing. Missingness probability is calibrated via a logistic function so that the realised proportion matches the target.
from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor
pipeline = BenchPipeline(
LinearDGP(),
corruptors=[MissingDataCorruptor(proportion=0.2, mechanism="mnar")],
)
result = pipeline.run(n_samples=500, random_state=42)
Mechanism Summary
| Mechanism | Missingness depends on | Default |
|---|---|---|
"mcar" |
Nothing (uniform random) | Yes |
"mar" |
Observed pivot column | No |
"mnar" |
The column's own values | No |
Label Noise
LabelNoiseCorruptor operates on the target y rather than the feature matrix X. It is passed via the label_corruptors parameter of BenchPipeline, not via corruptors.
Classification: Label Flipping
For binary classification, noise_rate controls the fraction of labels that are flipped (0 to 1, 1 to 0). Only binary targets are supported; a ValueError is raised for multiclass targets.
from synthbench import BenchPipeline, LinearDGP, LabelNoiseCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="classification"),
label_corruptors=[LabelNoiseCorruptor(noise_rate=0.1)],
)
result = pipeline.run(n_samples=500, random_state=42)
Regression: Additive Gaussian Noise
For regression, noise_std controls the standard deviation of the additive Gaussian noise added to all target values. The noise_rate parameter is ignored.
from synthbench import BenchPipeline, LinearDGP, LabelNoiseCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="regression"),
label_corruptors=[LabelNoiseCorruptor(noise_std=0.5)],
)
result = pipeline.run(n_samples=500, random_state=42)
Metadata
LabelNoiseCorruptor records its effect under result.metadata["label_noise"]:
| Key | Classification | Regression |
|---|---|---|
noise_rate |
Fraction of labels flipped | None |
noise_std |
None |
Standard deviation used |
affected_indices |
List of flipped sample indices | None (all samples affected) |
meta = result.metadata["label_noise"]
print(meta["noise_rate"]) # e.g. 0.1
print(meta["affected_indices"]) # e.g. [3, 17, 42, ...]
Combining Label Corruptors with Feature Corruptors
Label corruptors and feature corruptors can be combined freely:
from synthbench import (
BenchPipeline,
LinearDGP,
MissingDataCorruptor,
LabelNoiseCorruptor,
)
pipeline = BenchPipeline(
LinearDGP(task_type="classification"),
corruptors=[MissingDataCorruptor(proportion=0.1, mechanism="mar")],
label_corruptors=[LabelNoiseCorruptor(noise_rate=0.05)],
)
result = pipeline.run(n_samples=500, random_state=42)
Worked Examples
The following examples show each corruptor family used end-to-end with
BenchPipeline. All examples use only synthbench and stdlib (no pandas).
Missing data (MCAR / MAR / MNAR)
from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor
# MCAR: proportion=0.15 means 15% of values set to NaN, columns chosen uniformly
pipeline = BenchPipeline(
LinearDGP(task_type="regression"),
corruptors=[MissingDataCorruptor(proportion=0.15, mechanism="mcar")],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=0)
print(result.X[:3, :]) # some NaN values present
# MAR: missingness depends on a pivot column (column 0 by default)
pipeline_mar = BenchPipeline(
LinearDGP(task_type="regression"),
corruptors=[MissingDataCorruptor(proportion=0.15, mechanism="mar")],
)
result_mar = pipeline_mar.run(n_samples=500, n_features=10, random_state=0)
# MNAR: missingness is self-masking (the value drives its own mask)
pipeline_mnar = BenchPipeline(
LinearDGP(task_type="regression"),
corruptors=[MissingDataCorruptor(proportion=0.15, mechanism="mnar")],
)
result_mnar = pipeline_mnar.run(n_samples=500, n_features=10, random_state=0)
Measurement noise
from synthbench import BenchPipeline, LinearDGP, MeasurementNoiseCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="regression"),
corruptors=[MeasurementNoiseCorruptor(severity="medium")],
)
result = pipeline.run(n_samples=300, n_features=8, random_state=42)
print(result.metadata["effective_rank"]) # reduced due to noise
Outliers
from synthbench import BenchPipeline, LinearDGP, OutlierCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="classification"),
corruptors=[OutlierCorruptor(severity="high")],
)
result = pipeline.run(n_samples=400, n_features=10, random_state=1)
Label noise (classification)
from synthbench import BenchPipeline, LinearDGP, LabelNoiseCorruptor
pipeline = BenchPipeline(
LinearDGP(task_type="classification"),
label_corruptors=[LabelNoiseCorruptor(noise_rate=0.10)],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=0)
noise_meta = result.metadata["label_noise"]
print(noise_meta["noise_rate"]) # 0.1
print(len(noise_meta["affected_indices"])) # ~50 flipped labels
label_corruptors vs corruptors
Label corruptors modify y only; feature corruptors modify X only.
Pass label corruptors in the label_corruptors= argument to BenchPipeline;
feature corruptors go in corruptors=. Both lists can be used together.
Combining multiple corruptors
from synthbench import (
BenchPipeline,
LinearDGP,
MissingDataCorruptor,
MeasurementNoiseCorruptor,
LabelNoiseCorruptor,
)
pipeline = BenchPipeline(
LinearDGP(task_type="classification"),
corruptors=[
MeasurementNoiseCorruptor(severity="low"),
MissingDataCorruptor(proportion=0.10, mechanism="mar"),
],
label_corruptors=[LabelNoiseCorruptor(noise_rate=0.05)],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=7)
print(result.metadata["corruptor_order"]) # feature corruptors in order
print(result.metadata["label_corruptor_order"]) # label corruptors in order