Model Evaluation, Validation, Leakage & Calibration
AI/ML Engineering Track | Complexity: Intermediate | Time: 5-6 hours Prerequisites: Module 1.1,
Pipeline,ColumnTransformer,cross_validate,train_test_split, and comfortable NumPy/pandas.
Learning Outcomes
Section titled “Learning Outcomes”By the end of this module, a practitioner will be able to:
- Design a cross-validation strategy for tabular, grouped, and time-series data without leakage by selecting the right
model_selectionsplitter and arranging it correctly inside aPipeline. - Detect five or more taxonomic leakage failure modes - preprocessing, target, group, temporal, and threshold/metric leakage - by reading existing pipelines and naming the contaminated statistic.
- Evaluate a candidate metric for a problem given class balance, business cost, and threshold sensitivity, and explain why ROC-AUC, PR-AUC, log-loss, and Brier can rank the same models differently.
- Compare Platt scaling versus isotonic regression for a miscalibrated classifier and choose between them based on dataset size, monotonicity assumptions, and the shape of the empirical reliability curve.
- Implement an end-to-end calibration workflow using
CalibratedClassifierCVand verify it with a reliability diagram and Brier/log-loss diagnostics.
These outcomes extend Module 1.1. Module 1.1 taught the sklearn estimator contract. This module teaches the evaluation contract around it. The difference matters. A model can obey the API perfectly and still report a score that is not honest. That happens when the split is wrong, the metric asks the wrong question, or the probabilities mean something different from what the downstream system assumes.
Why This Module Matters
Section titled “Why This Module Matters”The scikit-learn common pitfalls guide defines data leakage as using information during model building that would not be available at prediction time. It also states the practical consequence: performance estimates become overly optimistic, and the model performs worse on genuinely novel data. The official recommendation is direct: split first, fit preprocessing only on training data, and use Pipeline so cross-validation and hyperparameter tuning apply the correct fit and transform calls
to the correct rows. That is the starting point, not the finish line. Leakage is not only StandardScaler.fit_transform(X) before a split. It is also a target encoder that computes category means from validation labels. It is an oversampler that duplicates minority-class examples before folds are created. It is a patient identifier appearing in both train and validation folds. It is a random shuffle on a forecasting problem where future rows explain past rows. It is tuning a classification
threshold on the test set and then reporting that same test score as final evidence. It is selecting the calibration set only after seeing which split makes a model look good. The shape is always the same. Some statistic learned from data outside the permitted training scope influences either the fitted model, the selected model, the selected threshold, or the reported metric. This module gives you names for those failures. Names make reviews sharper. Instead of saying “this feels suspicious”,
you can say “the validation fold contaminates the scaling mean” or “the group boundary is broken”. That is the difference between general caution and an evaluation discipline.
Section 1: The Test Set Is Touched Once
Section titled “Section 1: The Test Set Is Touched Once”The simplest honest split has three functions. Training data fits model parameters. Validation data chooses model settings. Test data estimates final generalization after the choice has already been made.
All labeled data│├── train fit weights, trees, coefficients, scaler means├── validation choose hyperparameters, threshold, calibration method└── test final estimate, read once at the endThe test set is not a debugging surface. It is not a leaderboard. It is not where thresholds are tuned. It is not where feature choices are compared. Every time the test score changes your decision, test information becomes part of model selection. That is leakage through human iteration rather than through code. The scikit-learn cross-validation guide makes the same point when it warns that tuning estimator settings against the test set leaks knowledge about that test set into the model-selection process.
A Three-Way Split in Code
Section titled “A Three-Way Split in Code”The following example uses generated toy data and keeps the functions separate. The first split creates a final test set. The second split divides the remaining development data into train and validation.
import numpy as npfrom sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import ( average_precision_score, log_loss, roc_auc_score,)from sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler
X, y = make_classification( n_samples=1200, n_features=12, n_informative=6, weights=[0.9, 0.1], random_state=0,)
X_dev, X_test, y_dev, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=0,)
X_train, X_valid, y_train, y_valid = train_test_split( X_dev, y_dev, test_size=0.25, stratify=y_dev, random_state=1,)
model = Pipeline([ ("scale", StandardScaler()), ("clf", LogisticRegression(max_iter=1000)),])
model.fit(X_train, y_train)valid_proba = model.predict_proba(X_valid)[:, 1]
print("validation roc_auc:", roc_auc_score(y_valid, valid_proba))print("validation pr_auc :", average_precision_score(y_valid, valid_proba))print("validation logloss:", log_loss(y_valid, valid_proba))
# Only after the workflow is frozen:test_proba = model.predict_proba(X_test)[:, 1]print("final test roc_auc:", roc_auc_score(y_test, test_proba))The validation numbers are allowed to influence design. The final test number is evidence, not a steering wheel. If it surprises you, you do not keep tuning against it. You write down the surprise, return to the development protocol, and reserve a new final evaluation only when the experimental question has changed enough to justify new held-out evidence.
When Cross-Validation Replaces a Separate Validation Set
Section titled “When Cross-Validation Replaces a Separate Validation Set”A fixed validation set is easy to reason about. It is also sample-inefficient. The cross-validation guide explains the usual tradeoff: a separate validation split reduces the amount of data available for learning, and results can depend on one random partition. Cross-validation solves this inside the development data. The test set still remains outside the loop.
Outer holdout:
development data final test┌──────────────────────────────────────────────┐ ┌─────────┐│ CV fold 1: train train train train validate │ │ untouched││ CV fold 2: train train train validate train │ │ once ││ CV fold 3: train train validate train train │ │ at end ││ CV fold 4: train validate train train train │ │ ││ CV fold 5: validate train train train train │ │ │└──────────────────────────────────────────────┘ └─────────┘Use cross-validation when you need reliable model selection and the data is not large enough to afford a single validation split. Use a fixed validation set when the evaluation unit is naturally temporal, grouped, or operationally fixed. Use nested cross-validation when the project needs an unbiased estimate of a full model-selection procedure and a single final test set is not available. For most applied pipelines, the practical pattern is:
- Split off a final test set once.
- Run CV on the development data for model and hyperparameter selection.
- Refit the selected workflow on the full development data.
- Evaluate once on the final test set.
The key word is workflow. If preprocessing, feature selection, resampling, threshold tuning, or calibration are part of selection, they belong inside the selection protocol.
Section 2: When KFold Lies to You
Section titled “Section 2: When KFold Lies to You”KFold is honest only when rows are exchangeable enough for the problem. That assumption is stronger than it sounds. The cross-validation guide states the usual iid assumption and then warns that practical data often has time dependence or group structure. If rows are generated by patients, users, sessions, stores, devices, or time windows, a random fold can look statistically tidy while violating the way the model will be used.
Splitter Decision Table
Section titled “Splitter Decision Table”| Splitter | Use when | What it preserves | What goes wrong if misused |
|---|---|---|---|
KFold | Regression or balanced classification with independent rows | Each row appears in one validation fold | Class imbalance or grouped repetition can make folds unrepresentative |
StratifiedKFold | Classification, especially imbalanced classes | Approximate class proportions per fold | It does not prevent the same group from appearing in train and validation |
GroupKFold | Multiple rows share an entity that must stay together | Group disjointness | Class proportions can be skewed when groups carry different label rates |
StratifiedGroupKFold | Classification with both group boundaries and imbalance | Group disjointness plus approximate class balance | It may still struggle when group sizes or label distributions are extreme |
TimeSeriesSplit | Ordered observations where future data must not train past predictions | Forward-only training windows | Randomized CV trains on future rows and validates on past rows |
KFold Timeline
Section titled “KFold Timeline”KFold on iid rows:
row order: 0 1 2 3 4 5 6 7 8 9fold A: [V][V][T][T][T][T][T][T][T][T]fold B: [T][T][V][V][T][T][T][T][T][T]fold C: [T][T][T][T][V][V][T][T][T][T]fold D: [T][T][T][T][T][T][V][V][T][T]fold E: [T][T][T][T][T][T][T][T][V][V]
T = training row, V = validation rowThat is fine when row order carries no meaning. It is wrong when row order is time. It is also wrong when adjacent rows are duplicate measurements from the same entity.
GroupKFold Keeps Entities Apart
Section titled “GroupKFold Keeps Entities Apart”Rows: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9Group ID: A A B B C C D D E E
fold 1: [V][V][T][T][T][T][T][T][T][T] validation groups: Afold 2: [T][T][V][V][T][T][T][T][T][T] validation groups: Bfold 3: [T][T][T][T][V][V][T][T][T][T] validation groups: CThe important invariant is not equal row counts. The invariant is entity separation. If the same patient, user, session, device, or household appears in both train and validation, the validation fold may test recognition instead of generalization. For imbalanced classification with groups, prefer StratifiedGroupKFold when feasible. It tries to preserve class proportions while keeping groups intact, which is exactly the double constraint described in the sklearn cross-validation guide.
TimeSeriesSplit Moves Forward Only
Section titled “TimeSeriesSplit Moves Forward Only”TimeSeriesSplit:
time: t0 t1 t2 t3 t4 t5 t6 t7 t8 t9fold 1: [T][T][T][V][.][.][.][.][.][.]fold 2: [T][T][T][T][V][.][.][.][.][.]fold 3: [T][T][T][T][T][V][.][.][.][.]fold 4: [T][T][T][T][T][T][V][.][.][.]
T = past used for trainingV = next future block used for validation. = not used in that foldThe guide describes TimeSeriesSplit as a variation where successive training sets are supersets of earlier ones. That matches many forecasting and monitoring workflows. It does not magically fix features that themselves peek into the future. A rolling mean computed with future rows is still temporal leakage, even if the splitter is correct.
Worked Example: Choosing a Splitter
Section titled “Worked Example: Choosing a Splitter”import numpy as npfrom sklearn.model_selection import ( GroupKFold, KFold, StratifiedGroupKFold, StratifiedKFold, TimeSeriesSplit,)
X = np.arange(20).reshape(10, 2)y = np.array([0, 0, 1, 1, 0, 0, 1, 1, 0, 1])groups = np.array(["u1", "u1", "u2", "u2", "u3", "u3", "u4", "u4", "u5", "u5"])
splitters = { "iid_regression_or_balanced": KFold(n_splits=5), "imbalanced_classification": StratifiedKFold(n_splits=5, shuffle=True, random_state=0), "grouped": GroupKFold(n_splits=5), "grouped_and_imbalanced": StratifiedGroupKFold(n_splits=5), "time_ordered": TimeSeriesSplit(n_splits=3),}
for name, splitter in splitters.items(): if name in {"grouped", "grouped_and_imbalanced"}: first_train, first_valid = next(splitter.split(X, y, groups=groups)) else: first_train, first_valid = next(splitter.split(X, y)) print(name, "train:", first_train, "valid:", first_valid)Pause and predict - You have clickstream rows where each user can contribute hundreds of events, and the label is whether that user later churned. Which splitter is the first one you should consider? (Answer: a group-aware splitter such as
GroupKFold, orStratifiedGroupKFoldif class balance is also a concern, because the same user must not span train and validation.)
The pause question is not about syntax. It is about identifying the evaluation unit. Rows are not always the unit that must generalize. Sometimes the unit is a user. Sometimes it is a future week. Sometimes it is a manufacturing lot. When the unit changes, the splitter changes.
Section 3: A Taxonomy of Leakage
Section titled “Section 3: A Taxonomy of Leakage”Leakage is easier to detect when you name the contaminated statistic. Ask one question: what information is this line learning, and would that information be available at prediction time? If the answer is no, you have a leak.
Preprocessing Leakage
Section titled “Preprocessing Leakage”Preprocessing leakage occurs when a transformation learns statistics from validation or test rows. The common pitfalls guide uses preprocessing and feature selection as the canonical examples. The same rule applies to scaling, imputation, PCA, feature selection, discretization, and any learned transformer.
from sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler
X, y = make_classification(n_samples=500, n_features=8, random_state=0)
scaler = StandardScaler()X_scaled = scaler.fit_transform(X) # leaked: mean_ and scale_ used all rows
X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.2, stratify=y, random_state=0,)
model = LogisticRegression(max_iter=1000)model.fit(X_train, y_train)print(model.score(X_test, y_test))The contaminated statistics are StandardScaler.mean_ and StandardScaler.scale_. The fix is the Module 1.1 pattern: put the scaler inside a Pipeline.
from sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler
X, y = make_classification(n_samples=500, n_features=8, random_state=0)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=0,)
model = Pipeline([ ("scale", StandardScaler()), ("clf", LogisticRegression(max_iter=1000)),])
model.fit(X_train, y_train)print(model.score(X_test, y_test))Inside cross-validation, the same construction fits the scaler separately for each training fold. That is why Pipeline is not style. It is a boundary enforcement mechanism.
Target Leakage
Section titled “Target Leakage”Target leakage occurs when the feature construction uses labels, outcomes, or post-outcome information unavailable at prediction time. A target encoder is the classic legitimate-but-dangerous example. It replaces a category with a statistic derived from y. That can be useful only if the statistic is learned inside the training fold.
import pandas as pd
df = pd.DataFrame({ "city": ["A", "A", "B", "B", "C", "C", "D", "D"], "clicked": [1, 1, 0, 0, 1, 0, 0, 1],})
city_rate = df.groupby("city")["clicked"].mean()df["city_encoded"] = df["city"].map(city_rate)
print(df)The contaminated statistic is city_rate. It was computed from every label before any split. If a validation row’s own label contributes to its encoded feature, the validation score is no longer testing unseen behavior. In Module 1.4, you will treat target encoding as a feature-engineering method. Here the rule is simpler: any transformer that reads y must be cross-validation-aware and must be fitted only on the training portion of each fold.
Oversampling Leakage
Section titled “Oversampling Leakage”Oversampling before the split is leakage because duplicated or synthetic minority examples can cross fold boundaries. Even without an external imbalanced-learning library, the shape is visible.
import numpy as np
X = np.array([ [0.0, 0.0], [0.1, 0.0], [1.0, 1.0], [1.1, 1.0], [1.2, 1.1],])y = np.array([1, 1, 0, 0, 0])
minority_idx = np.flatnonzero(y == 1)X_oversampled = np.concatenate([X, X[minority_idx]], axis=0)y_oversampled = np.concatenate([y, y[minority_idx]], axis=0)
print(X_oversampled)print(y_oversampled)If you split after this operation, a copied minority example can be in training while its duplicate is in validation. The contaminated statistic is the sampling distribution, and sometimes the duplicated row itself. The correct pattern is to place resampling inside the fold-specific training workflow. With plain sklearn, that usually means class weights, threshold choice, or a model that can handle imbalance. With external resampling libraries, it means their pipeline object must participate inside CV rather than being run once on the full dataset.
Group Leakage
Section titled “Group Leakage”Group leakage happens when related rows cross fold boundaries. The model may learn a fingerprint rather than a generalizable rule.
import numpy as npfrom sklearn.model_selection import GroupKFold, KFold
X = np.arange(12).reshape(6, 2)y = np.array([0, 0, 1, 1, 0, 1])groups = np.array(["p1", "p1", "p2", "p2", "p3", "p3"])
kfold = KFold(n_splits=3)bad_train, bad_valid = next(kfold.split(X, y))print("KFold validation groups:", groups[bad_valid])
group_cv = GroupKFold(n_splits=3)good_train, good_valid = next(group_cv.split(X, y, groups=groups))print("GroupKFold validation groups:", groups[good_valid])print("train groups:", groups[good_train])The contaminated statistic is entity identity. The fix is to pass groups to a group-aware splitter and use that splitter consistently in cross_validate, GridSearchCV, and calibration workflows.
Temporal Leakage
Section titled “Temporal Leakage”Temporal leakage occurs when training uses information from the future. Random shuffling is the obvious cause. Feature construction can be just as dangerous.
import pandas as pd
df = pd.DataFrame({ "day": [1, 2, 3, 4, 5, 6], "sales": [10, 12, 11, 15, 16, 14],})
df["bad_centered_mean"] = df["sales"].rolling(window=3, center=True).mean()df["good_past_mean"] = df["sales"].shift(1).rolling(window=3).mean()
print(df)The contaminated statistic is the centered rolling mean. For day 3, it can include day 4. That feature is not available when predicting day 3. TimeSeriesSplit handles the fold order; it does not inspect your feature logic.
Threshold and Metric Leakage
Section titled “Threshold and Metric Leakage”Classification models often output scores or probabilities. Business decisions often require a threshold. Choosing the threshold on the test set leaks test labels into the final decision.
import numpy as npfrom sklearn.metrics import f1_score
y_test = np.array([0, 0, 0, 1, 1, 1])p_test = np.array([0.05, 0.20, 0.45, 0.35, 0.70, 0.95])
for threshold in [0.25, 0.50, 0.75]: pred = (p_test >= threshold).astype(int) print(threshold, f1_score(y_test, pred))This code is fine for a classroom demonstration. It is not fine as a final evaluation protocol. The contaminated statistic is the selected threshold. Tune thresholds on validation data or inside cross-validation. Then read the test metric once using the frozen threshold.
Reporting Train Metrics as Generalization
Section titled “Reporting Train Metrics as Generalization”Training metrics answer “did the estimator fit the data it saw?” They do not answer “will the estimator generalize?” cross_validate(return_train_score=True) is useful because it shows the gap. It is not permission to report train scores as deployment evidence.
from sklearn.datasets import make_classificationfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import StratifiedKFold, cross_validate
X, y = make_classification(n_samples=600, n_features=20, random_state=0)cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)model = RandomForestClassifier(random_state=0)
scores = cross_validate( model, X, y, cv=cv, scoring="roc_auc", return_train_score=True,)
print("train:", scores["train_score"].mean())print("valid:", scores["test_score"].mean())The contaminated claim is the report, not the model. A train score can diagnose overfitting. It cannot certify generalization.
Section 4: Metrics Ask Different Questions
Section titled “Section 4: Metrics Ask Different Questions”Metrics are not interchangeable labels for “model quality”. They encode different claims. Some require a hard class prediction. Some evaluate ranking. Some evaluate probabilities. Some are sensitive to class prevalence. Some ignore calibration entirely.
Classification Metric Decision Table
Section titled “Classification Metric Decision Table”| Metric | Input | Threshold-dependent? | Best use | Main trap |
|---|---|---|---|---|
| Accuracy | class labels | Yes | Balanced classes with symmetric mistake costs | Imbalanced data can look strong at the no-skill rate |
| Precision | class labels | Yes | False positives are expensive | Can be improved by predicting fewer positives |
| Recall | class labels | Yes | False negatives are expensive | Can be improved by predicting more positives |
| F1 | class labels | Yes | Need a single precision/recall compromise | Ignores true negatives and hides threshold choice |
| MCC | class labels | Yes | Imbalanced binary or multiclass classification | Still depends on the chosen threshold |
| ROC-AUC | scores | No | Ranking positives above negatives across thresholds | Can look strong when early precision is poor |
| PR-AUC / AP | scores | No | Rare positives where positive-class retrieval matters | Baseline changes with positive prevalence |
| Log-loss | probabilities | No fixed threshold | Probabilistic prediction with harsh penalty for confident errors | A ranking-good model can lose if probabilities are overconfident |
| Brier | probabilities | No fixed threshold | Squared probability error and calibration diagnostics | Lower Brier is not purely “better calibration” unless decomposed |
Accuracy is the most intuitive metric and the easiest to misuse. If the positive class rate is 5%, a classifier that always predicts negative has 95% accuracy. That no-skill rate must be the first comparison, not an afterthought. The sklearn model-evaluation guide includes a large metrics catalog because no single metric answers every operational question.
Worked Example: Same Labels, Different Story
Section titled “Worked Example: Same Labels, Different Story”import numpy as npfrom sklearn.metrics import ( accuracy_score, f1_score, matthews_corrcoef, precision_score, recall_score,)
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])y_pred = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
print("accuracy :", accuracy_score(y_true, y_pred))print("precision:", precision_score(y_true, y_pred))print("recall :", recall_score(y_true, y_pred))print("f1 :", f1_score(y_true, y_pred))print("mcc :", matthews_corrcoef(y_true, y_pred))The model gets most rows right. It also misses half the positives. Whether that is acceptable is not a metric question. It is a problem-definition question.
ROC-AUC and PR-AUC Can Disagree
Section titled “ROC-AUC and PR-AUC Can Disagree”ROC-AUC asks how well positives are ranked above negatives across thresholds. Average precision summarizes precision-recall behavior and is especially sensitive to the quality of the top-ranked positive predictions. The sklearn guide notes that average precision with random predictions equals the fraction of positive samples. That makes PR-AUC much more prevalence-aware than ROC-AUC. Here is a small example where two score vectors answer the ranking question differently from the early-retrieval question.
import numpy as npfrom sklearn.metrics import average_precision_score, roc_auc_score
y_true = np.array([1, 1, 0, 0, 0, 0, 0, 0])
model_a = np.array([0.90, 0.55, 0.80, 0.70, 0.60, 0.40, 0.30, 0.20])model_b = np.array([0.70, 0.69, 0.68, 0.67, 0.66, 0.65, 0.10, 0.09])
for name, scores in {"A": model_a, "B": model_b}.items(): print( name, "roc_auc=", round(roc_auc_score(y_true, scores), 3), "average_precision=", round(average_precision_score(y_true, scores), 3), )One model can rank the positive class acceptably in pairwise terms while placing too many negatives near the top of the list. Another can do better at early retrieval for the positives that matter most. That is why rare-event triage often starts with PR-AUC and precision/recall at an operating threshold rather than accuracy.
Pause and predict - A classifier for a 2% positive class reports 98% accuracy and no other metrics. What is the first diagnostic question? (Answer: compare it to the no-skill classifier that always predicts the majority class, then inspect recall, precision, PR-AUC, and the confusion matrix at the chosen threshold.)
Log-Loss and Brier Care About Probability Quality
Section titled “Log-Loss and Brier Care About Probability Quality”ROC-AUC does not care whether a positive row receives 0.51 or 0.99 if the ranking is unchanged. Log-loss and Brier do. Log-loss penalizes confident wrong probabilities sharply. Brier is the mean squared error between predicted probabilities and outcomes in binary classification. The sklearn calibration guide warns that proper scoring rules such as log-loss and Brier combine reliability, resolution, and uncertainty. So a lower Brier score is useful, but it is not a calibration proof by itself unless you decompose or inspect the reliability curve.
import numpy as npfrom sklearn.metrics import brier_score_loss, log_loss
y_true = np.array([0, 0, 1, 1])calm = np.array([0.20, 0.30, 0.70, 0.80])overconfident = np.array([0.01, 0.99, 0.99, 0.99])
for name, proba in {"calm": calm, "overconfident": overconfident}.items(): print(name, "log_loss:", round(log_loss(y_true, proba), 3)) print(name, "brier :", round(brier_score_loss(y_true, proba), 3))The overconfident model is punished hard for the confident wrong second row. That is the behavior you want when probabilities drive downstream risk.
Section 5: Regression Metrics Penalize Different Errors
Section titled “Section 5: Regression Metrics Penalize Different Errors”Regression evaluation has the same problem in quieter clothing. MAE, RMSE, MAPE, and R2 are not synonyms. They encode different error preferences.
| Metric | What it measures | Penalizes | Good when | Trap |
|---|---|---|---|---|
| MAE | Mean absolute error | Linear error size | You want errors in target units and robustness to outliers | Does not punish rare large misses as strongly as RMSE |
| RMSE | Square root of mean squared error | Large errors strongly | Large misses are disproportionately costly | Can be dominated by outliers |
| MAPE | Relative absolute error | Error relative to target magnitude | Stakeholders reason in relative terms | Breaks or explodes near zero; sklearn returns relative values, not 0-100 percentages |
| R2 | Variance explained relative to a constant baseline | Poor explanatory fit | You need a baseline-relative score | Can be negative when the model is worse than the constant mean predictor |
The sklearn guide describes R2 as the proportion of target variance explained by the independent variables. It also notes that the best value is 1.0 and lower is worse. A negative R2 is not a formatting bug. It means the model is worse than the baseline implied by predicting the target mean on the evaluated data.
Worked Example: Same Predictions, Different Penalties
Section titled “Worked Example: Same Predictions, Different Penalties”import numpy as npfrom sklearn.metrics import ( mean_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score,)
y_true = np.array([100.0, 110.0, 120.0, 130.0])model_a = np.array([100.0, 111.0, 119.0, 180.0])model_b = np.array([112.0, 112.0, 112.0, 112.0])
for name, pred in {"A": model_a, "B": model_b}.items(): print(name, "mae :", round(mean_absolute_error(y_true, pred), 3)) print(name, "rmse:", round(mean_squared_error(y_true, pred) ** 0.5, 3)) print(name, "mape:", round(mean_absolute_percentage_error(y_true, pred), 3)) print(name, "r2 :", round(r2_score(y_true, pred), 3))Model A is almost perfect on most rows and badly wrong on one. Model B is mediocre everywhere. MAE and RMSE will not rank those patterns the same in every dataset. That is not a problem with the functions. It is the point of having different functions.
When MAPE Breaks
Section titled “When MAPE Breaks”MAPE divides by the true target magnitude, with an epsilon guard in sklearn to avoid undefined division. That makes it scale-relative. It also makes it unstable when target values are zero or near zero.
from sklearn.metrics import mean_absolute_percentage_error
y_true = [0.0, 1.0, 10.0]y_pred = [0.1, 1.2, 12.0]
print(mean_absolute_percentage_error(y_true, y_pred))The output is finite because sklearn guards the denominator. The interpretation is still usually poor. When zeros are meaningful, consider MAE, RMSE, pinball loss for quantiles, or a transformed target that matches the operational question.
Section 6: Calibration Is Not Ranking
Section titled “Section 6: Calibration Is Not Ranking”Calibration asks a probability question. Among samples where the model says 0.8, do about 80% actually belong to the positive class? That is different from ranking. A model can have strong ROC-AUC and still produce probabilities that are too extreme or too timid. The sklearn calibration guide defines a well-calibrated classifier in exactly this confidence-frequency sense. It also explains reliability diagrams: bin predicted probabilities on the x-axis and plot the fraction of positives in each bin on the y-axis.
Reliability diagram:
fractionpositive 1.0 | perfect calibration | / 0.8 | x / | x / 0.6 | x / | x / 0.4 | x / | x / 0.2 | x / | x / 0.0 +---------------------------- 0.0 0.2 0.4 0.6 0.8 1.0 mean predicted probabilityPoints below the diagonal are overconfident for the positive class. Points above the diagonal are underconfident. The histogram under many reliability plots matters because a smooth-looking line based on empty bins is not evidence.
Platt Scaling Versus Isotonic Regression
Section titled “Platt Scaling Versus Isotonic Regression”CalibratedClassifierCV supports method="sigmoid" and method="isotonic". Sigmoid calibration is often called Platt scaling. It fits a logistic mapping from raw classifier scores to probabilities. It is low-capacity and tends to be safer with smaller calibration sets. Isotonic regression fits a non-parametric monotone step function. It can correct more general monotonic distortions. The sklearn calibration guide warns that isotonic is more prone to overfitting, especially on small datasets,
and states that isotonic tends to perform as well as or better than sigmoid when there is enough data, roughly more than 1000 samples.
| Method | sklearn value | Shape assumption | Strength | Risk |
|---|---|---|---|---|
| Platt scaling | method="sigmoid" | S-shaped correction | Lower variance, preserves ranking as a strictly monotonic transform | Cannot fix arbitrary calibration curve shapes |
| Isotonic regression | method="isotonic" | Monotone non-decreasing correction | More flexible; can fix broader monotonic distortions | Higher overfitting risk, introduces ties that can affect AUC |
Use sigmoid when the calibration set is small, the reliability curve looks roughly S-shaped, or you need to preserve ranking metrics as much as possible. Use isotonic when you have enough calibration data and the empirical reliability curve shows a monotonic but non-sigmoid distortion. Do not choose the method after reading the final test set. That is calibration-set leakage.
CalibratedClassifierCV Without Prefit Leakage
Section titled “CalibratedClassifierCV Without Prefit Leakage”CalibratedClassifierCV uses cross-validation to obtain unbiased predictions for fitting calibrators. With ensemble=True, it trains a clone of the base estimator on each training fold, predicts on that fold’s held-out data, and fits a calibrator to those predictions. With ensemble=False, it uses cross-validated predictions to train one calibrator, then fits the base estimator on all data. The documentation also describes calibrating an already fitted estimator with FrozenEstimator, while
placing responsibility on the user to keep model-fitting data disjoint from calibration data. The safest beginner workflow is to let CalibratedClassifierCV own the internal CV.
import numpy as npfrom sklearn.calibration import CalibratedClassifierCV, calibration_curvefrom sklearn.datasets import make_classificationfrom sklearn.ensemble import HistGradientBoostingClassifierfrom sklearn.metrics import brier_score_loss, log_loss, roc_auc_scorefrom sklearn.model_selection import StratifiedKFold, train_test_split
X, y = make_classification( n_samples=2000, n_features=20, n_informative=8, weights=[0.85, 0.15], random_state=0,)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, stratify=y, random_state=0,)
base = HistGradientBoostingClassifier(random_state=0)cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
calibrated = CalibratedClassifierCV( estimator=base, method="sigmoid", cv=cv, ensemble=True,)
calibrated.fit(X_train, y_train)proba = calibrated.predict_proba(X_test)[:, 1]
fraction_positive, mean_predicted = calibration_curve( y_test, proba, n_bins=10, strategy="uniform",)
print("roc_auc:", round(roc_auc_score(y_test, proba), 3))print("brier :", round(brier_score_loss(y_test, proba), 3))print("logloss:", round(log_loss(y_test, proba), 3))print("reliability points:")for x_value, y_value in zip(mean_predicted, fraction_positive): print(round(x_value, 3), round(y_value, 3))This code prints reliability-curve coordinates. In a notebook, plot mean_predicted on the x-axis and fraction_positive on the y-axis with the diagonal y=x. The diagram matters more than a single scalar when the task is calibration diagnosis.
Expected Calibration Error
Section titled “Expected Calibration Error”Expected Calibration Error, often abbreviated ECE, summarizes the average gap between predicted confidence and observed frequency across bins. It is not a core sklearn metric function, but it is easy to compute from bin assignments. Use it as a diagnostic, not as the only truth. Different binning strategies can produce different values.
import numpy as np
def expected_calibration_error(y_true, y_proba, n_bins=10): y_true = np.asarray(y_true) y_proba = np.asarray(y_proba) edges = np.linspace(0.0, 1.0, n_bins + 1) ece = 0.0
for left, right in zip(edges[:-1], edges[1:]): if right == 1.0: in_bin = (y_proba >= left) & (y_proba <= right) else: in_bin = (y_proba >= left) & (y_proba < right) if not np.any(in_bin): continue bin_weight = in_bin.mean() confidence = y_proba[in_bin].mean() accuracy = y_true[in_bin].mean() ece += bin_weight * abs(confidence - accuracy)
return ece
y_true = np.array([0, 0, 1, 1, 1, 0])y_proba = np.array([0.05, 0.20, 0.55, 0.80, 0.90, 0.60])print(round(expected_calibration_error(y_true, y_proba, n_bins=3), 3))ECE compresses a curve into one number. Compression is useful for dashboards. It can hide whether miscalibration happens at low, middle, or high probabilities. Keep the reliability diagram.
Brier Decomposition in Plain Language
Section titled “Brier Decomposition in Plain Language”The calibration guide notes that Brier and log-loss are strictly proper scoring rules but mix several effects. For Brier, the classic decomposition separates:
| Term | Plain meaning | Interpretation |
|---|---|---|
| Reliability | Are predicted probabilities close to observed frequencies? | Calibration error |
| Resolution | Do different bins have meaningfully different event rates? | Discrimination/refinement |
| Uncertainty | How inherently variable is the outcome? | Dataset base-rate difficulty |
This is why a model can have a lower Brier score without being better calibrated. It may have better resolution. It may separate easy positives and negatives so well that the total score improves despite local calibration defects. For calibration work, report Brier and log-loss, but inspect the reliability curve and the binned gaps.
Calibration Workflow Checklist
Section titled “Calibration Workflow Checklist”- Split off final test data before model selection.
- Choose the model family and metric on development data.
- Choose the calibration method using training or validation evidence, not final test results.
- Fit
CalibratedClassifierCVwith a splitter that matches the data structure. - Evaluate probability metrics and reliability curves on held-out data.
- Freeze the threshold only after probabilities are calibrated, unless the downstream system consumes probabilities directly.
Pause and predict - A model has ROC-AUC 0.92, but in the 0.8 probability bin only about half the examples are positive. Is the model “good”? (Answer: it may rank examples well, but it is overconfident and poorly calibrated in that probability region. Calibration is a different property from ranking.)
Did You Know?
Section titled “Did You Know?”- The scikit-learn cross-validation guide explicitly warns that tuning hyperparameters on the test set lets knowledge about that test set leak into the model-selection process: https://scikit-learn.org/stable/modules/cross_validation.html
- The common pitfalls guide recommends
Pipelinebecause it ensures the appropriate method is performed on the correct data subset during fitting and prediction: https://scikit-learn.org/stable/common_pitfalls.html CalibratedClassifierCVcan fit sigmoid or isotonic calibrators, and its cross-validation procedure is designed so calibrators are trained from unbiased predictions: https://scikit-learn.org/stable/modules/calibration.html- The calibration example in the sklearn gallery compares classifiers with reliability curves and probability histograms, which is why calibration review should include both curve shape and bin support: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
Common Mistakes
Section titled “Common Mistakes”| Mistake | Contaminated statistic | Why it fails | Safer pattern |
|---|---|---|---|
Calling StandardScaler.fit_transform(X) before splitting | Feature means and scales | Validation/test rows influence preprocessing | Put StandardScaler inside Pipeline |
| Fitting target encoding before CV | Category target means | A row’s label can influence its own validation feature | Fit encoders inside a CV-aware pipeline |
| Oversampling before train/test split | Sampling distribution and duplicated examples | Copies or synthetic neighbors can cross split boundaries | Resample inside training folds or use class weights |
Using KFold when the same entity appears many times | Entity identity | Model recognizes groups instead of generalizing | Use GroupKFold or StratifiedGroupKFold |
| Randomly shuffling time-series rows | Future observations | Model trains on future information | Use TimeSeriesSplit and past-only features |
| Tuning threshold on the final test set | Selected threshold | Test labels become part of model selection | Tune threshold on validation/CV data |
| Reporting train metrics as deployment evidence | Evaluation claim | The model is scored on rows it already saw | Report held-out or CV metrics, with train scores only as diagnostics |
| Selecting a calibration set after seeing test results | Calibration split choice | Test results influence probability correction | Reserve calibration data before final testing |
1. A pipeline scores well with random KFold, but each customer appears in many rows.
Section titled “1. A pipeline scores well with random KFold, but each customer appears in many rows.”What do you check first?
Answer
Check whether the same customer ID appears in both training and validation folds. If it does, this is group leakage. Switch to GroupKFold or StratifiedGroupKFold and pass the customer ID as groups. The earlier score may have measured customer recognition rather than generalization to unseen customers.
2. A model for a rare event reports 96% accuracy and no confusion matrix.
Section titled “2. A model for a rare event reports 96% accuracy and no confusion matrix.”What is the first metric critique?
Answer
Compare the score to the no-skill majority-class baseline. If the positive class is rare, high accuracy may mean the model predicts almost everything negative. Ask for precision, recall, F1 or MCC at the chosen threshold, plus PR-AUC or average precision from scores.
3. A teammate scales the full matrix, then calls cross_validate.
Section titled “3. A teammate scales the full matrix, then calls cross_validate.”Why does CV not save the evaluation?
Answer
The fold splitter runs after the scaling statistic has already been learned from all rows. Each validation fold influenced mean_ and scale_. The scaler must be inside a Pipeline so each fold fits preprocessing only on that fold’s training rows.
4. A time-series model uses TimeSeriesSplit, but a feature is a centered rolling average.
Section titled “4. A time-series model uses TimeSeriesSplit, but a feature is a centered rolling average.”Is the evaluation safe?
Answer
No. The splitter is forward-only, but the feature itself can include future observations. Temporal safety requires both a temporal splitter and feature logic that only uses information available at prediction time.
5. Two classifiers have the same ROC-AUC, but one has much worse log-loss.
Section titled “5. Two classifiers have the same ROC-AUC, but one has much worse log-loss.”What does that suggest?
Answer
Their ranking ability may be similar, but their probability estimates differ. The worse log-loss model may be overconfident on wrong examples or generally miscalibrated. Inspect Brier score, reliability curves, and the probability distribution.
6. Isotonic calibration improves validation Brier score on a tiny calibration set.
Section titled “6. Isotonic calibration improves validation Brier score on a tiny calibration set.”What risk should you name?
Answer
Isotonic regression is flexible and can overfit small calibration sets. Compare it against sigmoid calibration, inspect the reliability curve, and avoid selecting the method using the final test set.
7. A regression model has negative R2 but acceptable MAE.
Section titled “7. A regression model has negative R2 but acceptable MAE.”How can both be true?
Answer
MAE reports average absolute error in target units. R2 compares the model against a constant baseline on variance explained. A model can have errors that look tolerable in units but still perform worse than predicting the evaluation-set mean.
8. A team calibrates a fitted model using data that was also used to train the base estimator.
Section titled “8. A team calibrates a fitted model using data that was also used to train the base estimator.”What is the leakage?
Answer
The calibrator sees predictions from a model evaluated on rows it was trained on. Those predictions are biased toward the training data. Use CalibratedClassifierCV with internal CV, or ensure the fitted base estimator and calibration set are disjoint when using a frozen fitted estimator.
Hands-On Exercise: Build an Honest Calibrated Classifier
Section titled “Hands-On Exercise: Build an Honest Calibrated Classifier”You will build a complete evaluation workflow on generated toy data. The point is not to maximize a score. The point is to prove that every learned statistic belongs to the correct split.
Use sklearn.datasets.make_classification to create an imbalanced binary dataset. Create at least 2000 rows, at least 15 features, and a positive class that is clearly the minority. Keep the code in one notebook or one Python file so you can rerun it from scratch.
Step 1: Reserve the final test set
Section titled “Step 1: Reserve the final test set”- Split once into development and final test sets with
train_test_split. - Use
stratify=y. - Do not inspect final test metrics until the workflow is frozen.
- Write down the positive-class rate in both splits and compare it to the overall rate.
Step 2: Choose and justify the splitter
Section titled “Step 2: Choose and justify the splitter”- Start with
StratifiedKFoldfor ordinary imbalanced tabular data. - Replace it with
GroupKFoldorStratifiedGroupKFoldif you add a synthetic group ID. - Replace it with
TimeSeriesSplitif you reorder the toy data into a temporal simulation. - In one paragraph, name what your splitter preserves.
Step 3: Build the uncalibrated pipeline
Section titled “Step 3: Build the uncalibrated pipeline”- Put
StandardScalerinside aPipeline. - Use a real sklearn classifier with
predict_proba, such asLogisticRegressionorHistGradientBoostingClassifier. - Evaluate with
cross_validateusing at least ROC-AUC, average precision, and negative log-loss. - If you request train scores, label them as diagnostics rather than generalization evidence.
Step 4: Calibrate without touching the final test set
Section titled “Step 4: Calibrate without touching the final test set”- Wrap the base classifier with
CalibratedClassifierCV. - Run
method="sigmoid"first. - Try
method="isotonic"only as a comparison, and decide based on development evidence. - Keep the
cvstrategy aligned with your data structure.
Step 5: Draw the reliability diagram
Section titled “Step 5: Draw the reliability diagram”- Use
calibration_curveto computefraction_positiveandmean_predicted. - Plot the diagonal reference line.
- Add a histogram or count table for predicted-probability bins.
- Explain where the model is overconfident or underconfident.
Step 6: Report probability metrics
Section titled “Step 6: Report probability metrics”- Compute
brier_score_losson held-out data. - Compute
log_losson held-out data. - Compare these to ROC-AUC and average precision.
- Explain whether ranking quality and probability quality tell the same story.
Step 7: Freeze a threshold
Section titled “Step 7: Freeze a threshold”- Pick a threshold using validation or CV evidence.
- Report precision, recall, F1, MCC, and a confusion matrix at that threshold.
- Evaluate the frozen threshold once on the final test set.
- State the no-skill accuracy baseline beside the final accuracy.
Completion Check
Section titled “Completion Check”- No preprocessing is fitted outside a pipeline.
- No threshold is selected on the final test set.
- No calibration method is selected after reading final test metrics.
- The final report separates ranking metrics, threshold metrics, and probability metrics.
- The reliability diagram is interpreted in words, not just displayed.
Sources
Section titled “Sources”- Cross-validation: evaluating estimator performance
- Metrics and scoring: quantifying the quality of predictions
- Probability calibration
- Common pitfalls and recommended practices
- Probability Calibration curves
- Custom refit strategy of a grid search with cross-validation
StratifiedKFoldGroupKFoldStratifiedGroupKFoldTimeSeriesSplitlog_lossbrier_score_lossmatthews_corrcoefcalibration_curveCalibratedClassifierCV- On Calibration of Modern Neural Networks
Next Module
Section titled “Next Module”Continue to Module 1.4: Feature Engineering & Preprocessing.