Corruptors¶
synth-bench ships five structural corruptors and one label corruptor. This notebook shows each corruptor applied individually, the MCAR/MAR/MNAR distinctions in MissingDataCorruptor, and how to chain multiple corruptors in a BenchPipeline.
In [1]:
Copied!
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from synthbench import (
BenchPipeline,
CategoricalCorruptor,
CollinearityCorruptor,
LabelNoiseCorruptor,
LinearDGP,
MeasurementNoiseCorruptor,
MissingDataCorruptor,
OutlierCorruptor,
)
plt.rcParams["figure.dpi"] = 72
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from synthbench import (
BenchPipeline,
CategoricalCorruptor,
CollinearityCorruptor,
LabelNoiseCorruptor,
LinearDGP,
MeasurementNoiseCorruptor,
MissingDataCorruptor,
OutlierCorruptor,
)
plt.rcParams["figure.dpi"] = 72
Base dataset¶
In [2]:
Copied!
dgp = LinearDGP(task_type="classification", complexity="medium")
base = BenchPipeline(dgp).run(n_samples=300, n_features=8, random_state=0)
print(f"Clean dataset: X={base.X.shape}, y={base.y.shape}")
print(f"Missing values: {np.isnan(base.X).sum()}")
dgp = LinearDGP(task_type="classification", complexity="medium")
base = BenchPipeline(dgp).run(n_samples=300, n_features=8, random_state=0)
print(f"Clean dataset: X={base.X.shape}, y={base.y.shape}")
print(f"Missing values: {np.isnan(base.X).sum()}")
Clean dataset: X=(300, 8), y=(300,) Missing values: 0
MissingDataCorruptor — MCAR (three severity levels)¶
In [3]:
Copied!
rows = []
for sev in ["low", "medium", "high"]:
dgp = LinearDGP(task_type="classification", complexity="medium")
c = MissingDataCorruptor(severity=sev, mechanism="mcar")
result = BenchPipeline(dgp, corruptors=[c]).run(
n_samples=300, n_features=8, random_state=0
)
missing_frac = np.isnan(result.X).mean()
rows.append({"severity": sev, "missing_fraction": round(missing_frac, 3)})
pd.DataFrame(rows)
rows = []
for sev in ["low", "medium", "high"]:
dgp = LinearDGP(task_type="classification", complexity="medium")
c = MissingDataCorruptor(severity=sev, mechanism="mcar")
result = BenchPipeline(dgp, corruptors=[c]).run(
n_samples=300, n_features=8, random_state=0
)
missing_frac = np.isnan(result.X).mean()
rows.append({"severity": sev, "missing_fraction": round(missing_frac, 3)})
pd.DataFrame(rows)
Out[3]:
| severity | missing_fraction | |
|---|---|---|
| 0 | low | 0.05 |
| 1 | medium | 0.15 |
| 2 | high | 0.30 |
MissingDataCorruptor — MAR vs MNAR¶
MAR (Missing At Random): missingness of each feature is driven by the values of a pivot column. MNAR (Missing Not At Random): each feature's missingness is driven by its own values.
In [4]:
Copied!
dgp = LinearDGP(task_type="classification", complexity="medium")
# MAR: pivot_col=0 drives missingness probability via logistic function
c_mar = MissingDataCorruptor(severity="medium", mechanism="mar", pivot_col=0)
r_mar = BenchPipeline(dgp, corruptors=[c_mar]).run(
n_samples=300, n_features=8, random_state=1
)
# MNAR: each column's own values drive its missingness
c_mnar = MissingDataCorruptor(severity="medium", mechanism="mnar")
r_mnar = BenchPipeline(dgp, corruptors=[c_mnar]).run(
n_samples=300, n_features=8, random_state=1
)
print(f"MAR — missing fraction: {np.isnan(r_mar.X).mean():.3f}")
print(f"MNAR — missing fraction: {np.isnan(r_mnar.X).mean():.3f}")
dgp = LinearDGP(task_type="classification", complexity="medium")
# MAR: pivot_col=0 drives missingness probability via logistic function
c_mar = MissingDataCorruptor(severity="medium", mechanism="mar", pivot_col=0)
r_mar = BenchPipeline(dgp, corruptors=[c_mar]).run(
n_samples=300, n_features=8, random_state=1
)
# MNAR: each column's own values drive its missingness
c_mnar = MissingDataCorruptor(severity="medium", mechanism="mnar")
r_mnar = BenchPipeline(dgp, corruptors=[c_mnar]).run(
n_samples=300, n_features=8, random_state=1
)
print(f"MAR — missing fraction: {np.isnan(r_mar.X).mean():.3f}")
print(f"MNAR — missing fraction: {np.isnan(r_mnar.X).mean():.3f}")
MAR — missing fraction: 0.145 MNAR — missing fraction: 0.150
OutlierCorruptor¶
In [5]:
Copied!
rows = []
for sev in ["low", "medium", "high"]:
dgp = LinearDGP(task_type="classification", complexity="medium")
c = OutlierCorruptor(severity=sev)
result = BenchPipeline(dgp, corruptors=[c]).run(
n_samples=300, n_features=8, random_state=0
)
rows.append(
{"severity": sev, "bayes_error": round(result.metadata["bayes_error"], 4)}
)
pd.DataFrame(rows)
rows = []
for sev in ["low", "medium", "high"]:
dgp = LinearDGP(task_type="classification", complexity="medium")
c = OutlierCorruptor(severity=sev)
result = BenchPipeline(dgp, corruptors=[c]).run(
n_samples=300, n_features=8, random_state=0
)
rows.append(
{"severity": sev, "bayes_error": round(result.metadata["bayes_error"], 4)}
)
pd.DataFrame(rows)
Out[5]:
| severity | bayes_error | |
|---|---|---|
| 0 | low | 0.42 |
| 1 | medium | 0.49 |
| 2 | high | 0.52 |
MeasurementNoiseCorruptor¶
In [6]:
Copied!
rows = []
for sev in ["low", "medium", "high"]:
dgp = LinearDGP(task_type="classification", complexity="medium")
c = MeasurementNoiseCorruptor(severity=sev)
result = BenchPipeline(dgp, corruptors=[c]).run(
n_samples=300, n_features=8, random_state=0
)
rows.append(
{"severity": sev, "bayes_error": round(result.metadata["bayes_error"], 4)}
)
pd.DataFrame(rows)
rows = []
for sev in ["low", "medium", "high"]:
dgp = LinearDGP(task_type="classification", complexity="medium")
c = MeasurementNoiseCorruptor(severity=sev)
result = BenchPipeline(dgp, corruptors=[c]).run(
n_samples=300, n_features=8, random_state=0
)
rows.append(
{"severity": sev, "bayes_error": round(result.metadata["bayes_error"], 4)}
)
pd.DataFrame(rows)
Out[6]:
| severity | bayes_error | |
|---|---|---|
| 0 | low | 0.4500 |
| 1 | medium | 0.4267 |
| 2 | high | 0.5267 |
CategoricalCorruptor and CollinearityCorruptor¶
In [7]:
Copied!
# CategoricalCorruptor: discretises numeric features into bins
dgp = LinearDGP(task_type="classification", complexity="medium")
c_cat = CategoricalCorruptor(severity="medium")
r_cat = BenchPipeline(dgp, corruptors=[c_cat]).run(
n_samples=300, n_features=8, random_state=0
)
n_unique = len(np.unique(r_cat.X[:, 0]))
print(f"CategoricalCorruptor — unique values in X[:,0]: {n_unique}")
# CollinearityCorruptor: adds redundant correlated features
c_col = CollinearityCorruptor(severity="medium")
r_col = BenchPipeline(dgp, corruptors=[c_col]).run(
n_samples=300, n_features=8, random_state=0
)
er = r_col.metadata["effective_rank"]
print(f"CollinearityCorruptor — effective rank: {er:.2f}")
# CategoricalCorruptor: discretises numeric features into bins
dgp = LinearDGP(task_type="classification", complexity="medium")
c_cat = CategoricalCorruptor(severity="medium")
r_cat = BenchPipeline(dgp, corruptors=[c_cat]).run(
n_samples=300, n_features=8, random_state=0
)
n_unique = len(np.unique(r_cat.X[:, 0]))
print(f"CategoricalCorruptor — unique values in X[:,0]: {n_unique}")
# CollinearityCorruptor: adds redundant correlated features
c_col = CollinearityCorruptor(severity="medium")
r_col = BenchPipeline(dgp, corruptors=[c_col]).run(
n_samples=300, n_features=8, random_state=0
)
er = r_col.metadata["effective_rank"]
print(f"CollinearityCorruptor — effective rank: {er:.2f}")
CategoricalCorruptor — unique values in X[:,0]: 5
CollinearityCorruptor — effective rank: 11.64
LabelNoiseCorruptor (classification)¶
In [8]:
Copied!
rows = []
for noise_rate in [0.05, 0.15, 0.30]:
dgp = LinearDGP(task_type="classification", complexity="medium")
lc = LabelNoiseCorruptor(noise_rate=noise_rate)
result = BenchPipeline(dgp, label_corruptors=[lc]).run(
n_samples=300, n_features=8, random_state=0
)
rows.append(
{
"noise_rate": noise_rate,
"bayes_error": round(result.metadata["bayes_error"], 4),
}
)
pd.DataFrame(rows)
rows = []
for noise_rate in [0.05, 0.15, 0.30]:
dgp = LinearDGP(task_type="classification", complexity="medium")
lc = LabelNoiseCorruptor(noise_rate=noise_rate)
result = BenchPipeline(dgp, label_corruptors=[lc]).run(
n_samples=300, n_features=8, random_state=0
)
rows.append(
{
"noise_rate": noise_rate,
"bayes_error": round(result.metadata["bayes_error"], 4),
}
)
pd.DataFrame(rows)
Out[8]:
| noise_rate | bayes_error | |
|---|---|---|
| 0 | 0.05 | 0.4867 |
| 1 | 0.15 | 0.5067 |
| 2 | 0.30 | 0.5100 |
Chained pipeline: three corruptors together¶
In [9]:
Copied!
dgp = LinearDGP(task_type="classification", complexity="medium")
pipeline = BenchPipeline(
dgp,
corruptors=[
MissingDataCorruptor(severity="low", mechanism="mcar"),
OutlierCorruptor(severity="low"),
MeasurementNoiseCorruptor(severity="low"),
],
)
result = pipeline.run(n_samples=300, n_features=8, random_state=42)
print(f"Chained pipeline output: X={result.X.shape}")
missing = np.isnan(result.X).mean()
print(f"Missing fraction after chaining: {missing:.3f}")
be = result.metadata["bayes_error"]
print(f"Bayes error: {be:.4f}" if be is not None else "Bayes error: N/A")
# Show before/after DataFrames (first 5 rows)
clean = BenchPipeline(dgp).run(n_samples=300, n_features=8, random_state=42)
print("")
print("Clean X (first 5 rows, first 4 cols):")
print(pd.DataFrame(clean.X[:5, :4]).round(3))
print("")
print("Corrupted X (first 5 rows, first 4 cols):")
print(pd.DataFrame(result.X[:5, :4]).round(3))
dgp = LinearDGP(task_type="classification", complexity="medium")
pipeline = BenchPipeline(
dgp,
corruptors=[
MissingDataCorruptor(severity="low", mechanism="mcar"),
OutlierCorruptor(severity="low"),
MeasurementNoiseCorruptor(severity="low"),
],
)
result = pipeline.run(n_samples=300, n_features=8, random_state=42)
print(f"Chained pipeline output: X={result.X.shape}")
missing = np.isnan(result.X).mean()
print(f"Missing fraction after chaining: {missing:.3f}")
be = result.metadata["bayes_error"]
print(f"Bayes error: {be:.4f}" if be is not None else "Bayes error: N/A")
# Show before/after DataFrames (first 5 rows)
clean = BenchPipeline(dgp).run(n_samples=300, n_features=8, random_state=42)
print("")
print("Clean X (first 5 rows, first 4 cols):")
print(pd.DataFrame(clean.X[:5, :4]).round(3))
print("")
print("Corrupted X (first 5 rows, first 4 cols):")
print(pd.DataFrame(result.X[:5, :4]).round(3))
C:\Users\Work\AppData\Local\Temp\ipykernel_14188\1150745775.py:2: UserWarning: Corruptors have been reordered to match canonical order. Provided: ['MissingDataCorruptor', 'OutlierCorruptor', 'MeasurementNoiseCorruptor'] Canonical: ['MeasurementNoiseCorruptor', 'OutlierCorruptor', 'MissingDataCorruptor'] To suppress this warning, pass corruptors in canonical order: ['collinearity', 'categorical', 'measurement_noise', 'outlier', 'missing_data']. pipeline = BenchPipeline(
Chained pipeline output: X=(300, 8) Missing fraction after chaining: 0.050 Bayes error: N/A
Clean X (first 5 rows, first 4 cols):
0 1 2 3
0 -0.164 1.418 1.458 -0.184
1 0.241 0.055 -0.177 -1.916
2 -1.408 -0.695 1.054 0.077
3 -0.752 0.018 0.562 -0.300
4 -0.235 -0.740 -2.097 0.018
Corrupted X (first 5 rows, first 4 cols):
0 1 2 3
0 -0.186 1.220 1.440 -0.219
1 0.402 0.155 -0.250 -2.122
2 -1.468 -0.626 1.130 0.041
3 -0.816 0.023 0.689 NaN
4 -0.126 -0.702 -1.999 0.119