Module 5.11: Drift-Triggered Auto-Retraining Loop
Цей контент ще не доступний вашою мовою.
Discipline Track | Complexity:
[COMPLEX]| Time: 60-70 minPrerequisites: Module 5.5: Model Monitoring & Observability, Module 5.6: ML Pipelines & Automation, Module 5.8: Great Expectations Data Quality, and Module 5.10: Production Model-Serving Traffic Patterns
Prerequisites
Section titled “Prerequisites”Before starting this module, make sure you can connect the earlier MLOps pieces into one operational story rather than treating monitoring, pipelines, data gates, and serving rollout as isolated tools:
- Module 5.5: Model Monitoring & Observability for drift vocabulary, delayed-label monitoring, Prometheus metrics, and alert-response design
- Module 5.6: ML Pipelines & Automation for Kubeflow Pipelines, Argo Workflows, idempotency, checkpoints, resource controls, and promotion gates
- Module 5.8: Great Expectations Data Quality for data-quality checks before training and reviewed baselines instead of auto-accepted profiles
- Module 5.10: Production Model-Serving Traffic Patterns for KServe canary, shadow, rollback, and traffic-split machinery
- KServe, Argo Workflows, Argo Events, Kubeflow Pipelines, and MLflow basics
- Comfort reading Kubernetes manifests with custom resources, service accounts, RBAC, and ConfigMaps
- Comfort reading Prometheus PromQL, Alertmanager webhooks, and metric labels
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to make and defend production decisions about when retraining should start, when it should stop, and when a candidate model should reach users:
- Analyze drift, delayed-label, calibration, and business-performance signals to decide whether a retraining run is justified or whether investigation should stop earlier.
- Design a Kubernetes trigger path that converts Prometheus, Kafka, or scheduled evidence into a deduplicated retraining workflow without flooding the cluster.
- Implement a continuous-training DAG that snapshots data, validates it with Great Expectations, trains a challenger, registers it in MLflow, and sends it through shadow or canary gates.
- Evaluate candidate models against production champions using holdout deltas, slice guarantees, shadow comparison, KServe smoke tests, latency, cost, and human approval policy.
- Estimate the cost and rollback risk of retraining frequency, GPU capacity, retained data snapshots, cross-region feature-store reads, and failed promotion attempts.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: the following situation is invented to show the operational stakes clearly; it is not a claim about a specific company or public incident.
A subscription business runs a churn model that still returns low-latency predictions, still serves through KServe, and still has a green dashboard. No deployment failed. The API error rate is flat. The p99 latency sits below the objective. The team therefore assumes the model is healthy until monthly finance reporting shows that retention offers went to the wrong customers for almost three weeks. The model did not crash. The customer population changed, the old training window stopped representing current behavior, and the monitoring system never connected drift evidence to a governed retraining loop.
That is the expensive failure this module addresses. Module 5.5 gave you the vocabulary for feature drift, label drift, concept drift, delayed labels, proxy metrics, and Prometheus alerting. Module 5.6 showed why an ML pipeline must preserve evidence instead of hiding training inside notebooks or cron scripts. Module 5.8 made data validation a gate, not a dashboard decoration. Module 5.10 showed how a candidate reaches production through shadow, canary, and rollback controls instead of a full-traffic leap of faith.
Auto-retraining is where those pieces meet. It is tempting to describe it as “monitoring triggers training,” but that phrase hides the hard part. A drift score is not a root cause. A training run is not a promotion decision. A better offline metric is not permission to replace the champion. A successful Workflow is not proof that the new model is stable under live traffic. The closed loop must observe, detect, trigger, train, validate, promote, and rollback with separate evidence at each boundary.
This module treats automated retraining as a control loop, not a magic button. You will design the trigger path, the training path, and the promotion path as separate pieces with clear ownership. You will see how Evidently can compute drift scores, how NannyML can estimate performance when labels arrive late, how Prometheus and Alertmanager can fire a signal, how Argo Events or KEDA can turn the signal into work, how Kubeflow Pipelines and Argo Workflows can run the work, and how MLflow plus KServe can hold the candidate until it earns traffic.
1. The Closed Retraining Lifecycle
Section titled “1. The Closed Retraining Lifecycle”Automated retraining begins with a simple loop, but the loop only stays safe when each step has a different responsibility. Observability watches the serving world. Detection summarizes evidence. Trigger plumbing decides whether the evidence deserves work. The training pipeline produces a challenger. Validation decides whether the challenger deserves exposure. Promotion changes traffic. Rollback remains available after promotion because the first minutes or hours of production exposure are still evidence, not a victory lap.
sequenceDiagram autonumber participant Serve as KServe champion participant Logs as Prediction logs participant Drift as Evidently / NannyML participant Prom as Prometheus participant Trigger as Argo Events or KEDA participant Pipe as Kubeflow / Argo pipeline participant Reg as MLflow Registry participant Traffic as Shadow / Canary participant Rollback as Separate rollback owner
Serve->>Logs: request, features, prediction, model version Logs->>Drift: batch window or stream chunk Drift->>Prom: bounded drift and performance gauges Prom->>Trigger: alert after threshold and duration Trigger->>Pipe: deduplicated retraining run Pipe->>Pipe: snapshot, validate, train, evaluate Pipe->>Reg: register candidate with evidence tags Reg->>Traffic: expose challenger through shadow or canary Traffic->>Prom: live quality, latency, cost, slice metrics Prom->>Rollback: independent circuit-breaker signal Rollback->>Serve: restore champion traffic if gates failThe three trigger flavors have different jobs. A schedule is a guardrail that checks whether the world has changed on a known cadence, such as nightly or weekly. A drift signal starts work because current data no longer resembles the reference window closely enough. A performance gate starts work because the model appears worse, either through realized labels, estimated performance, or trusted business proxies. In a mature platform these triggers coexist, but they should not all mean “train immediately and promote automatically.”
Scheduled retraining is easy to reason about and easy to budget. It is useful for seasonal models, batch recommendations, and teams that want a predictable window for fresh data, GPU reservations, review, and release notes. Its weakness is latency: the model may decay hours after the scheduled window, and the platform will wait until the next run unless monitoring also pages or opens a ticket. The schedule is a seatbelt, not the whole safety system.
Drift-triggered retraining is more responsive but also noisier. A feature can drift because the population changed, because a data pipeline broke, because a promotion campaign brought new traffic, because an upstream team changed an encoding, or because the baseline window is wrong. Retaining the lesson from Module 5.5, drift is a symptom. The trigger should start a decision pipeline that validates data and trains a challenger only when the evidence survives additional gates.
Performance-triggered retraining is the strongest signal when labels are available quickly and accurately. If the champion’s realized F1, calibration, or business outcome falls below a threshold, a challenger should usually be prepared. The challenge is delay. Fraud labels may arrive days later. Churn labels may arrive after a billing cycle. Loan defaults may arrive much later. That is why the loop often combines realized labels with NannyML-style estimated performance, prediction-distribution shifts, and business proxies.
Think of the loop as a series of locks. One key opens the training run. Another key opens registry promotion. Another key opens live traffic. A different key closes traffic again if the candidate behaves badly. When one service controls every key, the system becomes brittle; the thing being evaluated can also hide or delay the decision that should stop it.
+-------------------+ +--------------------+ +------------------+| observe |----->| detect |----->| trigger || logs, labels, | | drift, estimates, | | schedule, drift, || business metrics | | realized outcomes | | performance gate |+-------------------+ +--------------------+ +---------+--------+ | v+-------------------+ +--------------------+ +------------------+| rollback |<-----| promote |<-----| validate || split traffic, | | shadow, canary, | | GE, holdout, || restore champion | | registry stage | | slices, cost |+-------------------+ +--------------------+ +---------+--------+ ^ | +----------+-------+ | train challenger | | fresh snapshot, | | reproducible DAG | +------------------+The diagram deliberately does not show a direct arrow from “detect” to “promote.” A detector can be wrong. A retrained model can be worse. A fresh training dataset can contain corrupted labels. A training image can pull a different dependency. A candidate can pass holdout evaluation and then fail on premium users under shadow comparison. The loop is useful because it turns these risks into named gates rather than because it eliminates risk.
Pause and predict: A detector reports PSI above
0.2on a high-importance feature for ten minutes, but the Great Expectations checkpoint fails because the same feature has a sudden null-rate spike. Should the platform train on the fresh data, open an upstream-data incident, or do both? What evidence would you require before allowing training to continue?
The conservative answer is to halt training and treat the failing data-quality gate as an upstream incident until proven otherwise. A drift trigger can create an investigation workflow, but Module 5.8 exists because training on malformed data often turns a recoverable data incident into a new bad model. The loop may still snapshot the data for forensics, but it should not register a candidate from data that violates a critical validation rule.
2. Drift Signals Worth Retraining On
Section titled “2. Drift Signals Worth Retraining On”Not all drift deserves retraining. A categorical browser-version feature can shift because a new mobile release rolled out. A timestamp-derived feature can shift every weekend. A feature with low model importance can move without changing model decisions. A feature with high importance can move because the business successfully entered a new market. The retraining loop must separate “the world is different” from “the model is probably less trustworthy” and “the data is broken.”
Covariate drift, also called feature distribution shift, means the input
distribution P(x) changed. Population Stability Index, the Kolmogorov-Smirnov
test, Wasserstein distance, and Jensen-Shannon distance are common ways to
compare current and reference windows. PSI is popular in regulated tabular
workflows because it is easy to review, but it depends on binning choices and
can hide segment-specific movement. KS works well for continuous one-dimensional
features, Wasserstein gives a distance-like view of numeric movement, and
Jensen-Shannon is useful for probability-like distributions because it is
symmetric and bounded.
Label drift means the target distribution changed. A fraud model can see the share of confirmed fraud rise. A churn model can see cancellations fall after a pricing change. Label drift is often more meaningful than feature drift, but it arrives late when labels are delayed. It also does not tell you whether the relationship between features and target changed; it may reflect a real business shift that the current model can still rank well.
Concept drift means P(y|x) changed. The same feature vector now implies a
different target probability than it did during training. This is the hardest
production case because it is usually invisible from inputs alone. A model can
receive stable-looking features and stable-looking prediction volumes while the
meaning of behavior changes underneath it. You detect concept drift through
realized performance, delayed labels, calibrated probabilities, segment
outcomes, or trusted proxies.
Performance regression is the clearest retraining signal when labels exist. The champion is no longer meeting its quality objective. In a classification model, that might be precision, recall, ROC AUC, F1, calibration error, or business value. In a ranking model, it might be NDCG or click-through-rate by cohort. In a regression model, it might be RMSE or MAE. The retraining loop should compare the candidate to the current champion, not merely to a fixed minimum threshold, because a model can pass an old threshold and still be worse than production.
Calibration drift deserves its own mention because many automated decisions depend on probability quality, not only class choice. A model that says “0.8” should be correct roughly eight times out of ten over a calibrated window. If the probabilities become overconfident, downstream thresholds, risk tiers, queues, and human review policies may all become unsafe even when raw accuracy appears acceptable. Calibration drift is also a warning sign for NannyML CBPE because CBPE relies on useful confidence information.
The next snippet shows a pragmatic Evidently workflow for producing a drift report and computing PSI per feature for the trigger path. The Evidently report is the human and machine-readable evaluation. The small PSI function is included because many teams still emit one bounded gauge per monitored feature to Prometheus, and that metric must remain stable even if report JSON structure changes between library versions.
from __future__ import annotations
import jsonfrom pathlib import Path
import numpy as npimport pandas as pdfrom evidently import Reportfrom evidently.presets import DataDriftPreset
FEATURES = ["age_days", "orders_30d", "avg_cart_value"]
def psi(reference: pd.Series, current: pd.Series, bins: int = 10) -> float: """Compute population stability index for one numeric feature.""" quantiles = np.linspace(0, 1, bins + 1) edges = np.unique(reference.quantile(quantiles).to_numpy()) if len(edges) < 3: edges = np.linspace(reference.min(), reference.max(), bins + 1) ref_counts, _ = np.histogram(reference, bins=edges) cur_counts, _ = np.histogram(current, bins=edges) ref_pct = np.maximum(ref_counts / max(ref_counts.sum(), 1), 1e-6) cur_pct = np.maximum(cur_counts / max(cur_counts.sum(), 1), 1e-6) return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))
def main() -> None: reference = pd.read_parquet("reference.parquet") current = pd.read_parquet("current.parquet")
report = Report([DataDriftPreset()]) evaluation = report.run(current_data=current, reference_data=reference) evaluation.save_json("evidently-data-drift.json")
scores = { feature: round(psi(reference[feature], current[feature]), 6) for feature in FEATURES } Path("psi_scores.json").write_text(json.dumps(scores, indent=2), encoding="utf-8") print(json.dumps(scores, indent=2))
if __name__ == "__main__": main()That file uses Evidently’s current report API, where a Report receives a
DataDriftPreset and run() compares current data to reference data. The
custom PSI function is intentionally small and reviewable. In production you
would pin the Evidently version, test the extraction format, and store the full
JSON report as an artifact beside the trigger decision so an engineer can
explain why the workflow ran.
Prometheus should receive bounded, low-cardinality metrics. Do not label metrics by request ID, customer ID, raw feature value, or model input hash. Those belong in logs or object storage. The trigger metric needs stable labels such as model name, model version, feature name, segment, and window. This is the same cardinality discipline from Module 5.5, now used to start work rather than only show a dashboard.
from __future__ import annotations
import jsonimport timefrom pathlib import Path
from prometheus_client import Gauge, start_http_server
PSI_SCORE = Gauge( "ml_feature_psi_score", "Population stability index for a model feature window", ["model_name", "model_version", "feature_name", "segment"],)
DATASET_DRIFT = Gauge( "ml_dataset_drift_detected", "Dataset-level drift decision for the current monitoring window", ["model_name", "model_version", "segment"],)
def publish(scores: dict[str, float]) -> None: model_name = "churn" model_version = "champion-2026-05-01" segment = "all" drifted = 0 for feature_name, score in scores.items(): PSI_SCORE.labels(model_name, model_version, feature_name, segment).set(score) if score > 0.2: drifted += 1 DATASET_DRIFT.labels(model_name, model_version, segment).set(1 if drifted >= 2 else 0)
def main() -> None: start_http_server(9108) while True: scores = json.loads(Path("psi_scores.json").read_text(encoding="utf-8")) publish(scores) time.sleep(30)
if __name__ == "__main__": main()The corresponding PromQL should require duration and context. A single scrape above threshold is a clue, not a trigger. A sustained PSI score on one high-importance feature might open a human investigation. A sustained score on multiple important features plus prediction-distribution movement might start the retraining workflow. A sustained score plus a failed data-quality gate should start an upstream incident instead.
max_over_time( ml_feature_psi_score{ model_name="churn", segment="all", feature_name=~"orders_30d|avg_cart_value" }[5m]) > 0.2Delayed labels require a different technique. NannyML’s CBPE estimates classification performance from prediction confidence when ground truth is not available yet, and DLE estimates regression loss through an auxiliary model. These estimates do not replace realized performance; they buy time while the truth is delayed. The loop should treat estimated performance as a signal that can start a challenger, then require realized or shadow evidence before broad promotion when the model is high stakes.
from __future__ import annotations
import nannyml as nml
def estimate_classification_performance() -> None: reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
estimator = nml.CBPE( y_pred_proba="y_pred_proba", y_pred="y_pred", y_true="repaid", timestamp_column_name="timestamp", metrics=["roc_auc", "f1"], chunk_size=5000, problem_type="classification_binary", ) estimator.fit(reference_df) estimated = estimator.estimate(analysis_df) print("Estimated Performance (CBPE)") print(estimated.filter(period="analysis").to_df().tail(3))
with_targets = analysis_df.merge(analysis_targets_df, left_index=True, right_index=True) calculator = nml.PerformanceCalculator( y_pred_proba="y_pred_proba", y_pred="y_pred", y_true="repaid", timestamp_column_name="timestamp", metrics=["roc_auc", "f1"], chunk_size=5000, problem_type="classification_binary", ) calculator.fit(reference_df) realized = calculator.calculate(with_targets) print("Realized Performance") print(realized.filter(period="analysis").to_df().tail(3))
def estimate_regression_performance() -> None: reference_df, analysis_df, _ = nml.load_synthetic_car_price_dataset() estimator = nml.DLE( feature_column_names=[ "car_age", "km_driven", "price_new", "accident_count", "door_count", "fuel", "transmission", ], y_pred="y_pred", y_true="y_true", timestamp_column_name="timestamp", metrics=["rmse", "mae"], chunk_size=6000, ) estimator.fit(reference_df) estimated = estimator.estimate(analysis_df) print("Estimated Performance (DLE)") print(estimated.filter(period="analysis").to_df().tail(3))
if __name__ == "__main__": estimate_classification_performance() estimate_regression_performance()The NannyML output is useful because it gives you rows by time chunk, metric, estimated value, thresholds, and alert flags. The retraining loop can translate those rows into Prometheus gauges or directly into a workflow parameter. The important design point is that CBPE and DLE are not “autopromote” evidence. They are early warning signals, especially valuable when the cost of waiting for labels is high and the cost of a challenger training run is acceptable.
Seldon Alibi Detect is another useful library when you need specialized drift detectors, such as KS tests, maximum mean discrepancy, classifier-based drift, or detectors for images and embeddings. In this module, Evidently is the primary reporting layer because it integrates naturally with tabular reports and automated checks. Alibi Detect is a strong alternative when the detector itself is the main engineering problem, especially for non-tabular features.
3. Trigger Plumbing on Kubernetes
Section titled “3. Trigger Plumbing on Kubernetes”Once the signal exists, the platform has to start exactly the right amount of work. Trigger plumbing is easy to underestimate because the YAML looks smaller than the model code. In practice it is where many retraining systems fail. A single drift spike can fire repeated Alertmanager notifications, a Kafka topic can contain many messages for the same model window, and a scheduled workflow can overlap with a drift-triggered workflow. Without debounce and deduplication, the platform can launch a dozen expensive training runs for one incident.
The canonical stack in this module uses Prometheus and Alertmanager for the signal, Argo Events for webhook-to-WorkflowTemplate triggering, KEDA ScaledJob as an alternative queue-less trigger, and Argo CronWorkflow as a scheduled guardrail. The same pattern works with Kafka if your prediction logs or drift events already land on a topic. The goal is not to crown one trigger tool. The goal is to make trigger ownership explicit and keep training idempotent.
Argo Events has three moving pieces. An EventSource receives events, such as
an Alertmanager webhook. A Sensor filters or transforms the event. A trigger
creates a Kubernetes resource, often an Argo Workflow derived from a
WorkflowTemplate. The EventSource should not contain training logic. The Sensor
should not decide promotion. It should submit a workflow with enough parameters
for the pipeline to fetch evidence, claim a deduplication key, and make the
next decision.
apiVersion: argoproj.io/v1alpha1kind: EventSourcemetadata: name: drift-alert-webhook namespace: mlopsspec: service: ports: - port: 12000 targetPort: 12000 webhook: drift: port: "12000" endpoint: /drift method: POSTAlertmanager can send drift alerts to the EventSource service. In production, you would authenticate this path through network policy, a private service, or a gateway that verifies a shared secret. The example keeps the URL boring on purpose so no realistic secret appears in the curriculum.
apiVersion: v1kind: ConfigMapmetadata: name: alertmanager-drift-route namespace: monitoringdata: alertmanager.yml: | route: receiver: argo-events-drift group_by: ["alertname", "model_name", "segment"] group_wait: 30s group_interval: 10m repeat_interval: 2h receivers: - name: argo-events-drift webhook_configs: - url: http://drift-alert-webhook-eventsource-svc.mlops.svc:12000/drift send_resolved: falseThe Sensor below submits a Workflow that references a WorkflowTemplate. It
passes the model name, segment, trigger source, and a deduplication key. The
deduplication key should be derived from stable fields such as model, segment,
alert name, and monitoring-window start. Do not use the webhook delivery time
as the only key, because every retry would look unique.
apiVersion: argoproj.io/v1alpha1kind: Sensormetadata: name: drift-retraining-sensor namespace: mlopsspec: template: serviceAccountName: operate-workflows dependencies: - name: drift-alert eventSourceName: drift-alert-webhook eventName: drift triggers: - template: name: submit-retraining-workflow k8s: operation: create source: resource: apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: drift-retrain- namespace: mlops labels: app.kubernetes.io/name: drift-retraining spec: workflowTemplateRef: name: drift-retraining-template arguments: parameters: - name: model-name value: churn - name: segment value: all - name: trigger-source value: prometheus-alertmanager - name: dedupe-key value: churn-all-drift-window parameters: - src: dependencyName: drift-alert dataKey: body.alerts.0.labels.model_name dest: spec.arguments.parameters.0.value - src: dependencyName: drift-alert dataKey: body.alerts.0.labels.segment dest: spec.arguments.parameters.1.value - src: dependencyName: drift-alert dataKey: body.groupKey dest: spec.arguments.parameters.3.valueThe Sensor is intentionally small because the WorkflowTemplate should own the retraining decision. A team that embeds all decision logic in Sensor transformations will eventually create a trigger layer that is hard to test, hard to version, and hard to replay. Keep Sensor logic at the edge: receive, filter, pass evidence, and create work.
KEDA gives another pattern when you want a Kubernetes-native job to appear only
while a Prometheus query remains above threshold. A ScaledJob is especially
useful for small retraining decision jobs, such as “claim the incident and
submit a proper workflow.” It is less expressive than Argo Events for webhook
payloads, but it is simple and works well for metrics-driven work.
apiVersion: keda.sh/v1alpha1kind: ScaledJobmetadata: name: drift-retraining-trigger namespace: mlopsspec: pollingInterval: 30 successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 5 maxReplicaCount: 1 jobTargetRef: backoffLimit: 1 template: metadata: labels: app.kubernetes.io/name: drift-retraining-trigger spec: serviceAccountName: submit-retraining restartPolicy: Never containers: - name: submit image: ghcr.io/example/mlops-trigger:0.1.0 imagePullPolicy: IfNotPresent env: - name: MODEL_NAME value: churn - name: SEGMENT value: all command: - /bin/sh - -c - | argo submit \ --from workflowtemplate/drift-retraining-template \ -n mlops \ -p model-name="${MODEL_NAME}" \ -p segment="${SEGMENT}" \ -p trigger-source="keda-prometheus" \ -p dedupe-key="${MODEL_NAME}-${SEGMENT}-prometheus" triggers: - type: prometheus metadata: serverAddress: http://prometheus-operated.monitoring.svc:9090 metricName: churn_feature_psi_above_threshold query: | max_over_time(ml_feature_psi_score{model_name="churn",segment="all"}[5m]) threshold: "0.2"The maxReplicaCount: 1 setting is not a complete debounce strategy, but it
prevents a single scaler from creating several simultaneous jobs. You still
need an idempotent claim inside the workflow because another trigger source can
submit the same work. The retraining system should assume events can be
delivered at least once and workflows can be retried.
Argo CronWorkflow is the guardrail. It runs on a cadence, inspects recent monitoring windows, and may decide to train or to exit cleanly. The schedule is valuable even when drift triggers exist because it catches missed alerts, recovers from monitor downtime, and gives operators a predictable daily evidence artifact.
apiVersion: argoproj.io/v1alpha1kind: CronWorkflowmetadata: name: nightly-drift-check namespace: mlopsspec: schedule: "15 2 * * *" timezone: Etc/UTC concurrencyPolicy: Forbid successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 3 workflowSpec: workflowTemplateRef: name: drift-retraining-template arguments: parameters: - name: model-name value: churn - name: segment value: all - name: trigger-source value: cron-guardrail - name: dedupe-key value: churn-all-nightlyThe WorkflowTemplate should own deduplication. Argo Workflows supports
synchronization.mutex, which prevents two workflows with the same mutex from
running the critical section concurrently. That is necessary but not sufficient
for long debounce windows. A mutex stops overlap; a sentinel claim says “we
already handled this drift window recently.” A ConfigMap, Lease, database row,
or small internal API can hold that sentinel.
apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: drift-retraining-template namespace: mlopsspec: entrypoint: retrain-or-exit synchronization: mutex: name: churn-retraining-mutex arguments: parameters: - name: model-name - name: segment - name: trigger-source - name: dedupe-key templates: - name: retrain-or-exit dag: tasks: - name: claim-window template: claim-window - name: run-retraining dependencies: - claim-window when: "{{tasks.claim-window.outputs.parameters.claimed}} == true" template: run-retraining - name: claim-window container: image: bitnami/kubectl:1.35 command: - /bin/sh - -c args: - | key="{{workflow.parameters.dedupe-key}}" cm="drift-claim-${key}" if kubectl get configmap "${cm}" -n mlops >/dev/null 2>&1; then echo -n false > /tmp/claimed exit 0 fi kubectl create configmap "${cm}" \ -n mlops \ --from-literal=createdAt="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ --from-literal=model="{{workflow.parameters.model-name}}" \ --from-literal=segment="{{workflow.parameters.segment}}" echo -n true > /tmp/claimed outputs: parameters: - name: claimed valueFrom: path: /tmp/claimed - name: run-retraining container: image: ghcr.io/example/retraining-runner:0.1.0 command: - /app/run-retraining args: - --model-name - "{{workflow.parameters.model-name}}" - --segment - "{{workflow.parameters.segment}}" - --trigger-source - "{{workflow.parameters.trigger-source}}"This ConfigMap claim is deliberately simple so the moving parts are visible. Production teams often use a database or a service because they need expiry, multi-cluster coordination, richer state, and audit. The principle is the same: a trigger must be safe to deliver twice, and a workflow must be safe to retry without creating a second candidate for the same window.
4. The Retraining Pipeline Itself
Section titled “4. The Retraining Pipeline Itself”The retraining pipeline is not “train on latest data.” It is a reproducible candidate factory. It fetches fresh data, saves an immutable snapshot, validates that snapshot, trains a challenger, evaluates it against the champion, registers the candidate, runs shadow or canary checks, and promotes or discards the candidate. Each step should leave evidence. If a candidate wins, you need to explain why. If it loses, you need to know whether it lost because data was bad, training was unstable, evaluation was too strict, or production behavior diverged.
+------------------+ +------------------+ +------------------+| fetch fresh data |---->| DVC snapshot |---->| GX checkpoint || logs, labels, | | immutable hash, | | schema, nulls, || feature store | | training window | | ranges, meaning |+------------------+ +------------------+ +--------+---------+ | v+------------------+ +------------------+ +------------------+| promote/discard |<----| shadow/canary |<----| train + evaluate || MLflow stage, | | paired outputs, | | champion delta, || KServe split | | latency, slices | | slices, cost |+------------------+ +------------------+ +------------------+The DVC step connects directly to Module 5.7. Even though that module is not one of the prerequisite links in the metadata, the data snapshot matters here. A drift-triggered run that trains on “whatever the warehouse returned at 03:12” is difficult to reproduce. A snapshot with a DVC hash, query version, feature-store read time, label cutoff, and code commit can be reviewed, retrained, compared, and retained for rollback analysis.
The Great Expectations step connects directly to Module 5.8. The trigger might be a drift score, but training should still fail if required columns are missing, null rates exceed reviewed limits, categorical values are outside approved sets, or timestamp ranges do not match the intended window. Drift tells you data moved. GX tells you whether the new data is usable enough to train on. Those are different checks.
Kubeflow Pipelines is a good authoring layer for the candidate factory because it makes the DAG visible and compiles the workflow into a package. The example below uses lightweight components that pass paths and scalar values. In a real platform, each component image would be pinned, resource requests would be set, and artifact storage would be configured by the cluster profile.
from __future__ import annotations
from kfp import compiler, dsl
@dsl.component(base_image="python:3.12", packages_to_install=["pandas==2.2.3"])def fetch_fresh_data(model_name: str, output_path: dsl.OutputPath(str)) -> None: from pathlib import Path
Path(output_path).write_text(f"snapshot for {model_name}\n", encoding="utf-8")
@dsl.component(base_image="python:3.12", packages_to_install=["great-expectations==1.5.11"])def validate_with_gx(snapshot_path: dsl.InputPath(str), validation_flag: dsl.OutputPath(str)) -> None: from pathlib import Path
# In production this runs a named GX Checkpoint from reviewed configuration. text = Path(snapshot_path).read_text(encoding="utf-8") if "snapshot" not in text: raise ValueError("fresh data snapshot did not pass the minimal exercise gate") Path(validation_flag).write_text("passed", encoding="utf-8")
@dsl.component(base_image="python:3.12", packages_to_install=["scikit-learn==1.5.2"])def train(validation_flag: dsl.InputPath(str), model_uri: dsl.OutputPath(str)) -> None: from pathlib import Path
if Path(validation_flag).read_text(encoding="utf-8") != "passed": raise ValueError("training blocked by data validation") Path(model_uri).write_text("runs:/candidate/model", encoding="utf-8")
@dsl.component(base_image="python:3.12")def evaluate(model_uri: dsl.InputPath(str), accuracy_delta: dsl.OutputPath(float)) -> None: from pathlib import Path
_ = Path(model_uri).read_text(encoding="utf-8") Path(accuracy_delta).write_text("0.018", encoding="utf-8")
@dsl.component(base_image="python:3.12", packages_to_install=["mlflow==2.17.2"])def register_candidate(model_uri: dsl.InputPath(str), accuracy_delta: dsl.InputPath(float)) -> None: from pathlib import Path
print("registering", Path(model_uri).read_text(encoding="utf-8").strip()) print("accuracy_delta", Path(accuracy_delta).read_text(encoding="utf-8").strip())
@dsl.component(base_image="python:3.12")def run_shadow() -> str: return "shadow_passed"
@dsl.component(base_image="python:3.12")def promote_or_discard(shadow_status: str) -> None: if shadow_status != "shadow_passed": print("discard candidate") return print("promote candidate after registry and traffic gates")
@dsl.pipeline(name="drift-triggered-retraining")def drift_retraining_pipeline(model_name: str = "churn") -> None: fetched = fetch_fresh_data(model_name=model_name) validated = validate_with_gx(snapshot_path=fetched.output) trained = train(validation_flag=validated.output) evaluated = evaluate(model_uri=trained.output) registered = register_candidate( model_uri=trained.output, accuracy_delta=evaluated.output, ) shadow = run_shadow().after(registered) promote_or_discard(shadow_status=shadow.output)
if __name__ == "__main__": compiler.Compiler().compile( pipeline_func=drift_retraining_pipeline, package_path="drift-retraining-pipeline.yaml", )The important shape is the DAG, not the toy training logic. The validation task must precede training. Evaluation must compare against the champion. Registry registration must include evidence tags. Shadow or canary must run before live promotion. Promotion must be able to discard a candidate cleanly. That is the pipeline vocabulary from Module 5.6 applied to a drift-triggered lifecycle.
MLflow Model Registry gives the candidate a lifecycle separate from the
training run. Many teams still model the path as None -> Staging -> Production
and attach tags that prove validation happened. Newer MLflow usage may prefer
aliases and richer tags for some workflows, but the core rule is unchanged:
the registry should record what evidence allowed the model to move.
from __future__ import annotations
import mlflowfrom mlflow import MlflowClient
def promote_candidate( registered_model_name: str, run_id: str, model_artifact_path: str, accuracy_delta: float, shadow_windows_passed: int,) -> None: client = MlflowClient() model_uri = f"runs:/{run_id}/{model_artifact_path}" result = mlflow.register_model(model_uri, registered_model_name)
client.set_model_version_tag( name=registered_model_name, version=result.version, key="validation.accuracy_delta", value=f"{accuracy_delta:.4f}", ) client.set_model_version_tag( name=registered_model_name, version=result.version, key="validation.shadow_windows_passed", value=str(shadow_windows_passed), ) client.set_model_version_tag( name=registered_model_name, version=result.version, key="validation.signature_checked", value="true", )
if accuracy_delta <= 0 or shadow_windows_passed < 3: client.set_model_version_tag( name=registered_model_name, version=result.version, key="promotion.decision", value="discarded", ) return
client.transition_model_version_stage( name=registered_model_name, version=result.version, stage="Staging", archive_existing_versions=False, ) client.set_model_version_tag( name=registered_model_name, version=result.version, key="promotion.decision", value="ready_for_canary", )Champion/challenger continuous training keeps the current production model serving while a challenger trains on fresh data. The challenger wins only if it beats the champion by a defined amount over a defined number of evaluation windows. The threshold should be larger than noise. A candidate that improves global accuracy by a tiny amount while degrading a critical segment is not a winner. A candidate that wins once and loses the next two windows is not stable enough for full promotion.
This pattern is different from online learning. Champion/challenger continuous training produces new batch-trained artifacts and promotes them through gates. Online learning updates the model continuously or near-continuously as events arrive. Online learning can be powerful for domains with rapid feedback and low-stakes corrections, but it is harder to audit, harder to roll back, and harder to reproduce. Most Kubernetes platform teams should master governed challenger promotion before allowing online mutation of production models.
5. Validation Gates Before Promotion
Section titled “5. Validation Gates Before Promotion”Validation before promotion is the second-most-important part of the loop after the trigger itself. A drift trigger tells you that the champion may be stale. It does not tell you that the challenger is better. The challenger must pass offline evidence, data evidence, serving evidence, slice evidence, cost evidence, and sometimes human review before it earns production traffic. This is where automated retraining becomes engineering discipline rather than training enthusiasm.
The first gate is holdout delta against the current production model. Do not evaluate only against an absolute threshold such as “accuracy greater than 0.8.” A stale threshold can let worse candidates pass. Compare the challenger and champion on the same holdout window, record the metric delta, and require a minimum improvement or a documented reason to accept parity. For imbalanced classification, use metrics that match business risk, such as recall at a precision floor or cost-weighted business value.
The second gate is slice and segment guarantees. A challenger that improves the global metric while hurting a small but important segment is not safe. Define the segments before training: region, product, tenant tier, language, device, risk band, class label, or protected category where legally and ethically appropriate. The gate should say something like “no critical slice may regress by more than two percent relative unless a human approves the exception with context.”
The third gate is shadow traffic comparison, forward-linked to Module 5.10. Shadow mode sends the same production request to the challenger while the champion still serves the answer. The challenger response is logged and discarded. This catches feature encoding, serving-runtime, latency, output shape, and offline-online divergence before users depend on the response. It also adds cost, so shadow windows should be sampled and time-bounded.
The fourth gate is a smoke test against the KServe predictor with synthetic inputs. This is not model validation. It checks that the model loads, accepts the production request shape, returns the expected response shape, respects timeouts, and records the expected metadata. Smoke tests should run before shadow and canary so you do not spend live comparison budget on a broken container, missing runtime, or invalid model signature.
kubectl -n ml-serving get inferenceservice churn-challenger
SERVICE_HOSTNAME="$(kubectl -n ml-serving get inferenceservice churn-challenger \ -o jsonpath='{.status.url}' | sed 's#https\?://##')"
kubectl -n istio-system port-forward svc/istio-ingressgateway 8080:80curl -sS \ -H "Host: ${SERVICE_HOSTNAME}" \ -H "Content-Type: application/json" \ http://127.0.0.1:8080/v1/models/churn-challenger:predict \ -d '{ "instances": [ { "age_days": 183, "orders_30d": 4, "avg_cart_value": 38.5 } ] }'The fifth gate is latency and cost regression. A challenger can be more accurate and still unacceptable if it doubles GPU memory, increases p99 beyond the objective, calls a remote feature store in another region, or requires a larger serving node pool. The validation report should include both model quality and operational cost. This is especially important when drift-triggered workflows can run outside ordinary release windows and when the cost owner is not the data scientist who started the run.
The sixth gate is human approval when consequences justify it. Regulated domains, high-stakes models, legal exposure, safety decisions, protected-class impact, and ambiguous slice results should require a person to review evidence before production promotion. Low-stakes, well-instrumented models can skip manual approval when the trigger, validation, shadow, canary, rollback, and retention policies are mature. The decision should be in policy, not improvised at 02:00 during an alert.
apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: promotion-gated-retraining namespace: mlopsspec: entrypoint: gated-promotion templates: - name: gated-promotion steps: - - name: evaluate-offline template: evaluate-offline - - name: run-shadow template: run-shadow when: "{{steps.evaluate-offline.outputs.parameters.offline-pass}} == true" - - name: approval template: human-approval when: "{{steps.run-shadow.outputs.parameters.needs-approval}} == true" - - name: promote template: promote when: "{{steps.run-shadow.outputs.parameters.shadow-pass}} == true" - name: human-approval suspend: {} - name: evaluate-offline container: image: ghcr.io/example/model-evaluator:0.1.0 command: ["/app/evaluate"] outputs: parameters: - name: offline-pass valueFrom: path: /tmp/offline-pass - name: run-shadow container: image: ghcr.io/example/shadow-evaluator:0.1.0 command: ["/app/shadow"] outputs: parameters: - name: shadow-pass valueFrom: path: /tmp/shadow-pass - name: needs-approval valueFrom: path: /tmp/needs-approval - name: promote container: image: ghcr.io/example/model-promoter:0.1.0 command: ["/app/promote"]Before running this in a real cluster: Which gate would fail first if the challenger was trained on a valid snapshot but its serving image expected a different JSON field name than the champion? Which logs and artifacts would you inspect before blaming the model weights?
The smoke test and shadow comparison should fail before broad traffic moves. The model weights may be perfectly reasonable; the serving wrapper, feature transformer, model signature, or request encoder may be wrong. This is why promotion gates must test the deployed predictor and not only a notebook or training container.
6. Rollback, Circuit-Breakers, and Forensic Evidence
Section titled “6. Rollback, Circuit-Breakers, and Forensic Evidence”Rollback cannot be owned by the retraining service alone. A failing retraining service may be unavailable, overloaded, or convinced its candidate is fine because the same flawed metrics fed both promotion and rollback. The rollback signal should live in a separate Argo Workflow, GitOps controller, feature flag system, platform API, or KServe traffic-split operation. Module 5.10 called out the anti-pattern directly: the candidate model must not control the only kill switch that can remove it from traffic.
Auto-rollback triggers should combine infrastructure and model behavior. Error rate jumps, p99 latency jumps, pod restarts, queue depth, and GPU memory pressure are serving-health signals. Calibration-drift jumps, prediction distribution shifts, champion/challenger disagreement, slice regression, and business-metric movement are model-health signals. A severe infrastructure failure can roll back immediately. A model-quality signal may first pause traffic expansion, hold canary percentage steady, or require human review, depending on stakes and confidence.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: rollback-churn- namespace: mlopsspec: entrypoint: rollback serviceAccountName: traffic-operator arguments: parameters: - name: inferenceservice value: churn-router - name: namespace value: ml-serving templates: - name: rollback container: image: bitnami/kubectl:1.35 command: - /bin/sh - -c args: - | kubectl patch inferenceservice "{{workflow.parameters.inferenceservice}}" \ -n "{{workflow.parameters.namespace}}" \ --type merge \ -p '{"spec":{"predictor":{"canaryTrafficPercent":0}}}' kubectl annotate inferenceservice "{{workflow.parameters.inferenceservice}}" \ -n "{{workflow.parameters.namespace}}" \ "mlops.kubedojo.dev/rollback-at=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ --overwriteThe rollback window is a policy decision. A low-stakes recommender may declare a model stable after several hours of canary traffic and a full business cycle of sampled shadow comparison. A fraud model may need multiple delayed-label windows. A regulated credit model may need formal review before and after promotion. The window should be long enough to observe the metrics that can actually move and short enough that old rollback artifacts are still warm, valid, and affordable.
Forensic evidence is not optional. Keep the champion and challenger model versions, training data snapshot IDs, validation reports, drift report, feature schema, label cutoff, code commit, image digest, KServe revision, traffic split history, canary cohorts, paired shadow outputs, and rollback decision. This evidence lets the team answer whether the candidate was bad, the data was bad, the gate was weak, the trigger was noisy, or the business changed faster than the model lifecycle.
The retraining loop should also have a “stop training” circuit-breaker. If three consecutive retraining runs fail data validation, do not keep launching GPU jobs. If two candidates lose to the champion in the same drift incident, stop and escalate to data science review. If the feature store is cross-region and egress cost spikes, pause non-urgent retraining. A circuit-breaker protects both reliability and budget.
7. Cost Lens for Closed-Loop Retraining
Section titled “7. Cost Lens for Closed-Loop Retraining”Automated retraining has a direct cost and an opportunity cost. The direct cost is compute, storage, orchestration, monitoring, duplicated serving, registry artifacts, and engineer review. The opportunity cost is model staleness: wrong predictions, missed conversions, unnecessary retention offers, fraud loss, manual review load, or user harm. A serious design compares those costs instead of assuming “fresh is always better.”
Consider a small worked example. A training run uses one on-demand GPU node
that costs 2.40 USD per hour and runs for three hours. Validation and shadow
comparison add 1.10 USD in CPU, storage, and duplicated inference. One run
therefore costs about 8.30 USD before human review. Daily retraining costs
about 249 USD in a 30-day month. Hourly retraining would cost about 5,976 USD before failed runs and shadow traffic. If the stale model costs about
3,000 USD per month in measurable business loss, hourly retraining is hard
to justify unless it prevents rare but severe incidents.
Now invert the example. A fraud model’s stale recall can allow 15,000 USD of
avoidable loss during one high-risk weekend. A weekend drift-triggered training
run plus shadow comparison that costs 60 USD is easy to justify if the
validation gates are strong and rollback is ready. The economics are not
“training is expensive” or “training is cheap.” The economics are frequency,
trigger precision, model stakes, and how much value a fresh candidate can
capture before the world changes again.
Use spot or preemptible GPU instances for training when the pipeline supports checkpointing and retries. Use on-demand capacity for serving paths where eviction would violate availability or latency objectives. Keep the old champion warm during the rollback window if rollback time matters. That costs money, but cold rollback can cost more when users wait for image pulls, model downloads, GPU allocation, or cache warming during an incident.
Retraining frequency should follow drift tolerance. Some models can tolerate small feature shifts for weeks because the business process is stable and the model has margin. Some models need daily windows because the population changes quickly. Some models should train only when performance or drift evidence crosses a threshold. A daily schedule that always trains even when drift is flat can cost more than the degradation it prevents, and it can create review fatigue that weakens real promotion decisions.
Storage costs matter because reproducibility requires retention. Data snapshots, model artifacts, reports, paired shadow outputs, and feature logs can grow quickly. Retain enough evidence to debug, audit, and roll back. Tier old artifacts to cheaper storage after the rollback and review window closes. Do not keep every raw feature vector forever unless policy requires it and the privacy model supports it. The evidence plan should name retention duration, compression, partitioning, and deletion ownership.
Cross-region egress is a quiet cost multiplier. A feature store in one region and a training cluster in another can turn every retraining run into a network bill and a latency problem. The first fix is architectural: train near the data or replicate approved snapshots intentionally. The second fix is operational: cache immutable snapshots, avoid repeated full reads during retries, and make egress visible in the candidate run summary.
Operational Budget Guardrails
Section titled “Operational Budget Guardrails”Cost control belongs in the workflow, not only in a quarterly platform review. Add a budget check near the beginning of the decision pipeline that reads the month-to-date training spend, the number of candidates already trained for the model, and the current severity of the drift or performance signal. A low severity event can open a ticket when the budget is nearly exhausted, while a high severity event can consume an emergency allowance and record the reason.
The budget check should distinguish candidate creation from candidate exposure. Training a small challenger may be cheap enough to run during a drift incident, but shadowing every production request against a larger model can dominate the bill. That is why the validation report should include a serving-cost estimate before the shadow window starts. A candidate that is affordable to train may still be too expensive to serve at the desired latency and scale.
Resource requests are part of the cost policy. If a training workflow requests a full GPU for a preprocessing step, the cluster scheduler will reserve scarce capacity for work that could run on CPU. If a workflow omits requests entirely, the job may be evicted, starve other tenants, or run on an expensive node by accident. The pipeline owner should set explicit CPU, memory, GPU, and ephemeral storage requests per task, then review actual usage from completed runs.
Spot and preemptible capacity should be paired with checkpoints. A retraining job that can resume after preemption is a good candidate for cheaper capacity. A job that must restart a six-hour training run from the beginning after every eviction may burn more money than it saves. The checkpoint interval should be short enough to preserve progress and long enough that checkpoint writes do not dominate training time or object-store costs.
The storage plan should split retention by evidence type. Keep raw training snapshots long enough to reproduce candidates during the rollback and audit window. Keep compact drift summaries and validation reports longer because they are cheaper and useful for trend analysis. Keep paired shadow outputs only as long as policy allows, because they can contain sensitive features, predictions, and business outcomes. Deletion should be automated and logged.
Finally, make “do nothing” visible. A decision pipeline that exits because drift was transient, labels contradicted the proxy signal, or the budget gate blocked non-urgent training should still write a small evidence record. Those records prevent future teams from asking why no candidate exists for a known alert. They also help tune thresholds because false positives become measurable instead of disappearing into chat history.
Governance and Ownership Boundaries
Section titled “Governance and Ownership Boundaries”Closed-loop retraining crosses team boundaries, so ownership must be explicit before autonomy increases. The platform team usually owns the trigger path, orchestrator reliability, namespace policy, service accounts, and rollback mechanics. The data team owns source freshness, feature definitions, data quality suites, and snapshot retention. The modeling team owns training code, evaluation metrics, challenger design, and slice analysis. The product or risk owner owns business thresholds and human approval policy.
Those boundaries should appear in the runbook. A drift alert caused by missing feature values routes to the data owner. A candidate that wins offline but fails KServe smoke tests routes to the platform and modeling owners together. A candidate that improves global metrics but regresses a protected or critical segment routes to the risk owner before any broader exposure. Routing rules reduce confusion when an automated loop pauses and asks for human judgment.
Access boundaries matter as much as ownership. The component that reads raw training data does not need permission to patch KServe traffic. The component that patches traffic does not need permission to read raw labels. The Sensor that receives Alertmanager webhooks does not need registry write access. Narrow permissions limit blast radius when a container image, workflow parameter, or service account is misconfigured.
Audit language should be plain enough for an incident review. Instead of
recording “workflow succeeded,” record “drift signal for orders_30d exceeded
the threshold for five minutes, data validation passed, challenger version 18
beat champion version 16 by 0.018 accuracy globally, no critical slice
regressed beyond policy, shadow ran for three windows, and canary promotion
started at ten percent.” That sentence is longer, but it is evidence.
That same evidence sentence should appear in the pull-request comment, model registry tags, or incident timeline that humans already review. Autonomy is easier to trust when the machine leaves a trail in the places operators actually inspect.
Patterns and Anti-Patterns
Section titled “Patterns and Anti-Patterns”The patterns below are the production shapes that keep a drift-triggered loop useful after the demo. Each pattern separates evidence from action. Each anti-pattern collapses evidence and action into one fragile step.
| Pattern | When to Use | Why It Works | Scaling Consideration |
|---|---|---|---|
| Decision pipeline before training | Drift alerts are noisy, labels are delayed, or upstream data quality can fail | The trigger starts evidence collection before spending GPU time | Keep the decision step cheap and idempotent |
| Champion/challenger with shadow first | Candidate behavior must be compared on real requests before users see answers | The champion stays live while the challenger proves serving compatibility | Sample shadow traffic and set an exit date |
| Data-quality gate before training | Fresh data comes from mutable pipelines, vendors, or feature stores | Bad data is stopped before it becomes a bad model | Own GX suites as reviewed code, not auto-generated noise |
| Registry tags as promotion evidence | Multiple teams train, review, serve, and audit models | MLflow records why a candidate moved stages | Enforce required tags in CI or a registry wrapper |
| Separate rollback workflow | A candidate can fail after promotion or during canary | Rollback does not depend on the service being evaluated | Practice rollback and keep old artifacts warm long enough |
| Anti-Pattern | What Goes Wrong | Why Teams Fall Into It | Better Alternative |
|---|---|---|---|
| Retrain on every drift alert | GPU spend spikes and candidates train on temporary or broken data | Drift feels objective because it has a number | Require sustained drift plus quality and context gates |
| Promote from the training job | A successful train becomes a production change without live evidence | Teams confuse pipeline success with model success | Register a candidate, then shadow or canary through serving gates |
| Let the retraining service own rollback | A failing system may be unable or unwilling to stop itself | It is convenient to keep all logic in one service | Put rollback in KServe traffic split, ArgoCD, or a separate workflow |
| Compare candidate only to a fixed threshold | A model can pass the threshold while losing to the champion | Old thresholds are easy to encode and forget | Compare challenger and champion on the same windows and slices |
| Ignore delayed labels | The loop optimizes proxy metrics that may not match outcomes | Immediate metrics are easier to automate | Use CBPE or DLE for early warning and realized labels for final evidence |
| Keep all snapshots forever | Storage and privacy risk grow without improving decisions | Reproducibility gets interpreted as infinite retention | Set retention tiers for rollback, audit, and deletion windows |
Decision Framework
Section titled “Decision Framework”Choosing a retraining strategy starts with two questions: how quickly the world changes and how expensive a wrong update would be. A model with slow drift and low serving stakes can use scheduled retraining. A model with fast drift and clear monitoring can use drift-triggered challengers. A model with reliable labels can use performance-triggered training. A model with high traffic and fast feedback can use champion/challenger continuous training. A model that must adapt within seconds or minutes may need online learning, but that choice raises audit and rollback complexity.
| Strategy | Use When | Avoid When | Primary Evidence | Promotion Posture |
|---|---|---|---|---|
| Scheduled retraining | Drift is predictable, compute windows are planned, and review capacity is limited | The model can decay badly between schedules | Nightly or weekly validation summary | Batch candidate, ordinary review |
| Drift-triggered retraining | Feature or prediction drift is measurable and staleness cost is meaningful | Baselines are noisy or data quality often fails | Sustained PSI, KS, Wasserstein, JS, prediction shift | Train challenger, require validation |
| Performance-triggered retraining | Labels or trusted business outcomes arrive quickly enough | Labels are delayed, sparse, biased, or unjoined | Realized metric regression | Train challenger, compare to champion |
| Champion/challenger continuous training | Fresh data is valuable and candidate comparison is routine | Traffic is too small for stable windows | Holdout, shadow, canary, slice win-rate | Promote only after repeated wins |
| Online learning | Feedback is immediate, low-risk, and model mutation is auditable | Regulated, high-stakes, or hard-to-reproduce decisions | Streaming loss or reward | Use strict guardrails and rollback design |
Start | +-- Are labels reliable within the decision window? | | | +-- yes --> Use performance-triggered training as the main signal. | | | +-- no --> Use drift, estimated performance, and proxy metrics. | +-- Is staleness cost higher than routine training cost? | | | +-- no --> Prefer scheduled checks and human review. | | | +-- yes --> Allow drift-triggered challenger creation. | +-- Can the candidate be compared safely on live requests? | | | +-- yes --> Shadow, then canary or A/B promotion. | | | +-- no --> Require stronger offline gates and human approval. | +-- Does the model need continuous mutation to be useful? | +-- yes --> Consider online learning only with audit and rollback. | +-- no --> Use champion/challenger continuous training.When in doubt, choose the less autonomous option first. A scheduled check-and-maybe-train workflow is easier to audit than a fully autonomous drift-to-production loop. You can add autonomy after the evidence is boring: few false triggers, stable validation, reliable labels, practiced rollback, and cost within the budget.
Did You Know?
Section titled “Did You Know?”- Evidently’s current documentation describes more than
100evaluation metrics and more than40 milliondownloads for the open-source library. - KEDA
2.16documents a Prometheus scaler that can drive aScaledJobfrom a PromQL query and threshold. - Argo CronWorkflow schedules use standard cron expressions such as
15 2 * * *, giving teams a predictable guardrail beside event-driven triggers. - MLflow Model Registry stage workflows commonly use
None,Staging,Production, andArchived, but production teams should still attach explicit validation tags.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Treating drift as proof that retraining is safe | Drift is a symptom, not a root cause | Run data-quality and root-cause checks before training |
| Training on mutable “latest” data | It is easy to query and hard to reproduce later | Snapshot data with DVC or immutable object paths and record the query |
| Skipping champion comparison | Teams focus on a minimum metric threshold | Compare the candidate and champion on identical holdout and slice windows |
| Ignoring calibration drift | Accuracy stays acceptable while probabilities become misleading | Track calibration error and threshold behavior by segment |
| Launching duplicate workflows | Alertmanager retries, Kafka replay, and schedules overlap | Use debounce windows, sentinel claims, and Argo mutexes |
| Promoting without shadow or canary evidence | Offline validation feels complete | Use KServe smoke tests, shadow comparison, and bounded canary exposure |
| Letting retraining own rollback | One service is easier to build than two control paths | Put rollback in an independent workflow or traffic control plane |
| Forgetting cost and retention | The first demo is small and cheap | Budget GPU hours, shadow duplication, egress, and artifact retention |
1. A drift alert fires for a high-importance feature, but the Great Expectations checkpoint fails on the same dataset because nulls jumped above the reviewed limit. What should the retraining loop do first?
The loop should stop before training and treat the situation as a data-quality or upstream-data incident. Drift shows that the data moved, but the failed checkpoint says the data may not be trustworthy enough for training. The workflow can retain the snapshot and drift report for analysis, but it should not register a candidate trained on data that violates a critical validation gate.
2. Your fraud model has delayed labels that arrive two days after prediction. The drift score is stable, calibration looks worse, and NannyML CBPE estimates that ROC AUC is falling. How should you analyze drift, delayed-label, calibration, and business-performance signals before deciding whether retraining is justified?
It should not wait passively if the cost of staleness is meaningful. CBPE can start a challenger-training decision while the realized labels are delayed, but the candidate should still pass data-quality, holdout, slice, shadow, and eventual realized-performance review before broad promotion. Estimated performance is early warning evidence, not final proof.
3. A scheduled nightly retraining run and an Alertmanager drift webhook submit workflows for the same model and segment within five minutes. What design prevents two expensive candidates from training?
The trigger path should use both concurrency control and a deduplication claim.
Argo synchronization.mutex can prevent overlapping critical sections, while a
sentinel ConfigMap, Lease, or database row can record that a specific
model-segment-window was already claimed. The training tasks must also be
idempotent so a retry does not overwrite or confuse artifacts from the first
run.
4. A challenger improves global accuracy by one percent but reduces recall for a regulated segment by four percent. The product team wants to promote because the overall metric is green. What should the gate do?
The gate should block automatic promotion and require either discard, redesign, or explicit human approval with documented rationale. Global improvement does not erase a critical slice regression, especially in regulated or high-stakes settings. A mature gate defines slice limits before training so teams cannot reinterpret the result after seeing a convenient global win.
5. A KServe canary for the retrained model shows stable accuracy proxies but p99 latency doubles and GPU memory pressure rises. Is this a model-quality issue or a promotion issue?
It is a promotion-blocking serving issue even if model quality appears better. The candidate may be too expensive, too slow, or packaged with a runtime that does not meet the serving objective. The canary should stop expanding, and the team should inspect model size, runtime image, batching, feature-store calls, and GPU allocation before deciding whether to optimize or discard the candidate.
6. A team wants the retraining service to patch its own KServe traffic split back to the champion if it detects failure. Why is that risky?
The failing service might be unavailable, overloaded, or using the same flawed metrics that allowed the bad candidate to promote. Rollback should be owned by a separate workflow, GitOps controller, platform API, or traffic-control layer with its own permissions and audit. The retraining service can emit evidence, but it should not be the only actor able to remove its candidate from traffic.
7. Your team retrains daily because it is easy, but each run costs about `8 USD` and almost every candidate loses to the champion. What should change?
The team should move from unconditional daily training to a scheduled check-and-maybe-train posture or a drift/performance-triggered challenger. The schedule can still run monitoring and validation, but GPU training should require evidence that staleness risk is higher than routine training cost. The candidate-loss history is itself evidence that the trigger threshold is too eager or the data window is not adding value.
Hands-On Exercise: Build a Minimal Closed-Loop Retraining Demo
Section titled “Hands-On Exercise: Build a Minimal Closed-Loop Retraining Demo”Exercise scenario: this lab uses a local churn model to practice the same closed-loop boundaries you would use in production, while keeping the data and promotion decision small enough to inspect by hand.
You operate a small churn model on a local kind or minikube cluster. The
champion model is served through KServe. Recent prediction logs are summarized
by an Evidently job that emits ml_feature_psi_score to Prometheus. When PSI
stays above 0.2 for five minutes, Alertmanager sends a webhook to Argo
Events. The Sensor submits an Argo Workflow. The Workflow validates fresh data
with a Great Expectations checkpoint, trains a candidate, registers it in
MLflow Staging, runs a shadow comparison, and promotes only when
accuracy_delta > 0.
This lab is intentionally small, but it uses the same control points as a real platform. Required steps are marked as required. Optional cost steps use extra compute or install heavier components. If you do not have a cluster with KServe and Argo installed, read the manifests and compare them to your platform’s equivalents; the expected output blocks show the pinned versions used for the exercise.
Exercise Versions
Section titled “Exercise Versions”Use these pinned versions so your local exercise matches the manifests and expected output shown below. Newer versions may work, but verify CRD fields, CLI flags, and Python APIs before changing the pinned set.
Kubernetes: 1.35+KServe: 0.15.xArgo Workflows: 3.6.xArgo Events: 1.9.xKEDA: 2.16.xMLflow: 2.17.xGreat Expectations: 1.5.xEvidently: 0.7.xTask 1: Create the Namespace and Local Tooling
Section titled “Task 1: Create the Namespace and Local Tooling”This task is required because the namespaces separate serving, monitoring, and workflow responsibilities. Keeping those resources separate makes RBAC, cleanup, and troubleshooting easier during the rest of the exercise.
kubectl create namespace mlopskubectl create namespace ml-servingkubectl create namespace monitoringmkdir -p drift-loop-democd drift-loop-demo.venv/bin/python -m pip install \ evidently==0.7.15 \ great-expectations==1.5.11 \ mlflow==2.17.2 \ nannyml==0.13.1 \ pandas==2.2.3 \ prometheus-client==0.21.0 \ scikit-learn==1.5.2The expected output is abbreviated to the resource names and success messages that prove the namespaces and Python packages are available for later steps. Exact package resolver lines can vary by platform.
namespace/mlops creatednamespace/ml-serving creatednamespace/monitoring createdSuccessfully installed evidently great-expectations mlflow nannyml pandas prometheus-client scikit-learnTask 2: Deploy a Baseline KServe InferenceService
Section titled “Task 2: Deploy a Baseline KServe InferenceService”This task is required when KServe is installed because the rest of the loop needs a champion endpoint to protect. If your local cluster lacks KServe, use a small HTTP model server and keep the same champion/challenger traffic idea.
Use an MLflow model artifact that your local cluster can reach. In a quick
exercise, a PVC, MinIO bucket, or local model server can hold the artifact. The
important field is storageUri: the retraining workflow later registers a
candidate and updates traffic rather than overwriting the champion artifact.
apiVersion: serving.kserve.io/v1beta1kind: InferenceServicemetadata: name: churn-router namespace: ml-serving annotations: serving.kserve.io/deploymentMode: Serverlessspec: predictor: minReplicas: 1 model: modelFormat: name: sklearn storageUri: pvc://ml-models/churn/champion resources: requests: cpu: "500m" memory: 1Gi limits: cpu: "1" memory: 2Gikubectl apply -f kserve-churn-champion.yamlkubectl wait --for=condition=Ready inferenceservice/churn-router \ -n ml-serving \ --timeout=300sThe expected output confirms that the InferenceService resource exists and
that KServe marked it ready. If readiness times out, inspect predictor pods,
storage access, and ServingRuntime availability before continuing.
inferenceservice.serving.kserve.io/churn-router createdinferenceservice.serving.kserve.io/churn-router condition metTask 3: Run Evidently and Export PSI to Prometheus
Section titled “Task 3: Run Evidently and Export PSI to Prometheus”This task is required because the trigger path needs a metric, not just a notebook report. The generated drift is intentionally obvious so you can test the loop before tuning realistic thresholds.
Create a tiny data generator and drift exporter. In a production system this job reads prediction logs, request features, model version, and segment labels. For the lab it writes reference and current data locally so the drift is controlled and easy to verify.
from __future__ import annotations
import numpy as npimport pandas as pd
rng = np.random.default_rng(20260511)reference = pd.DataFrame( { "age_days": rng.normal(180, 30, 1000), "orders_30d": rng.poisson(3, 1000), "avg_cart_value": rng.normal(42, 8, 1000), })current = pd.DataFrame( { "age_days": rng.normal(220, 35, 1000), "orders_30d": rng.poisson(7, 1000), "avg_cart_value": rng.normal(51, 12, 1000), })reference.to_parquet("reference.parquet")current.to_parquet("current.parquet").venv/bin/python make_drift_data.py.venv/bin/python drift_window.py.venv/bin/python export_drift_metrics.pyIn another terminal, scrape the exporter directly before you wire it into Prometheus. This isolates Python metric generation from Kubernetes service discovery and keeps the first debugging step small.
curl -sS http://127.0.0.1:9108/metrics | grep ml_feature_psi_scoreThe expected output should contain one Prometheus gauge per monitored feature with bounded labels. The exact PSI values depend on the random sample, but the shifted features should exceed the alert threshold.
ml_feature_psi_score{feature_name="age_days",model_name="churn",model_version="champion-2026-05-01",segment="all"} 0.2...ml_feature_psi_score{feature_name="orders_30d",model_name="churn",model_version="champion-2026-05-01",segment="all"} 0.2...Task 4: Configure Prometheus Alerting and Argo Events
Section titled “Task 4: Configure Prometheus Alerting and Argo Events”This task is required when Prometheus, Alertmanager, and Argo Events are installed because it converts the monitoring signal into workflow submission. If those controllers are absent, read the manifests and use the manual submit command in the next task.
The Prometheus rule requires drift to remain above threshold for five minutes. That duration is the first debounce layer. The Argo Events Sensor and workflow claim are the second layer.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: churn-drift-alert namespace: monitoringspec: groups: - name: churn-drift.rules rules: - alert: ChurnFeatureDriftHigh expr: | max_over_time(ml_feature_psi_score{model_name="churn",segment="all"}[5m]) > 0.2 for: 5m labels: severity: warning model_name: churn segment: all annotations: summary: Churn model feature drift remained above threshold description: Drift should start a deduplicated retraining decision workflow.kubectl apply -f prometheus-drift-rule.yamlkubectl apply -f argo-events-eventsource.yamlkubectl apply -f argo-events-sensor.yamlThe expected output confirms that the alert rule, EventSource, and Sensor were accepted by the Kubernetes API. Controller readiness and webhook delivery are cluster-specific, so inspect controller logs if events do not arrive.
prometheusrule.monitoring.coreos.com/churn-drift-alert configuredeventsource.argoproj.io/drift-alert-webhook createdsensor.argoproj.io/drift-retraining-sensor createdTask 5: Create the Retraining Workflow
Section titled “Task 5: Create the Retraining Workflow”This task is required because it defines the retraining action that every trigger source calls. The template is intentionally compact, but the ordering matches the production loop: validate, train, register, shadow, and promote.
The lab WorkflowTemplate is compact. It validates data, trains a candidate,
registers it as MLflow Staging, runs a shadow comparison, and promotes only
when accuracy_delta > 0. The container images are placeholders because every
local cluster packages training differently; the command shape and parameters
are the important part.
apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: churn-closed-loop-retraining namespace: mlopsspec: entrypoint: closed-loop synchronization: mutex: name: churn-closed-loop arguments: parameters: - name: model-name value: churn - name: dedupe-key value: churn-all-demo templates: - name: closed-loop dag: tasks: - name: validate-fresh-data template: gx-checkpoint - name: train-candidate dependencies: - validate-fresh-data template: train - name: register-staging dependencies: - train-candidate template: register - name: run-shadow dependencies: - register-staging template: shadow - name: promote dependencies: - run-shadow when: "{{tasks.run-shadow.outputs.parameters.accuracy-delta}} > 0" template: promote - name: gx-checkpoint container: image: ghcr.io/example/gx-runner:0.1.0 command: ["/app/run-checkpoint"] args: - --checkpoint - churn_fresh_data - name: train container: image: ghcr.io/example/churn-trainer:0.1.0 command: ["/app/train"] args: - --model-name - "{{workflow.parameters.model-name}}" - name: register container: image: ghcr.io/example/mlflow-register:0.1.0 command: ["/app/register"] args: - --stage - Staging - name: shadow container: image: ghcr.io/example/shadow-compare:0.1.0 command: ["/app/compare"] outputs: parameters: - name: accuracy-delta valueFrom: path: /tmp/accuracy-delta - name: promote container: image: ghcr.io/example/kserve-promote:0.1.0 command: ["/app/promote"]kubectl apply -f churn-closed-loop-workflowtemplate.yamlargo submit \ --from workflowtemplate/churn-closed-loop-retraining \ -n mlops \ -p model-name=churn \ -p dedupe-key=churn-all-manual-testThe expected output shows both the template creation and a submitted workflow. If the workflow stays pending, inspect service account permissions, image pull events, and Argo controller logs before editing the model code.
workflowtemplate.argoproj.io/churn-closed-loop-retraining createdName: churn-closed-loop-retraining-...Namespace: mlopsServiceAccount: unsetStatus: PendingTask 6: Inject Drift and Watch the Loop
Section titled “Task 6: Inject Drift and Watch the Loop”This task is required because it proves the loop reacts to a known data shift. The manual checks also teach where to look when an alert fires but no workflow appears.
Increase the current-window shift, restart the exporter, and let Prometheus
scrape the new scores. Then watch the alert, Sensor, and Workflow. In a real
cluster, you would inspect Alertmanager and Argo Events logs; in the exercise,
the manual argo submit command proves the workflow path while the alerting
manifests show the automated path.
.venv/bin/python make_drift_data.py.venv/bin/python drift_window.py
kubectl -n monitoring get prometheusrule churn-drift-alertkubectl -n mlops get eventsourceskubectl -n mlops get sensorsargo list -n mlopsargo watch -n mlops @latestThe expected output shows that the Prometheus rule and Argo Events resources exist. A live cluster should also show an alert notification and a workflow submission after the five-minute threshold window.
NAME AGEchurn-drift-alert 1mNAME AGEdrift-alert-webhook 1mNAME AGEdrift-retraining-sensor 1mTask 7: Confirm Promotion or Discard
Section titled “Task 7: Confirm Promotion or Discard”This task is required because promotion is the point where the loop can create user impact. A successful training run is not enough; the workflow must either promote with evidence or leave the champion untouched.
The candidate promotes only when the shadow comparison writes an
accuracy_delta greater than zero. If the candidate fails, the workflow should
leave the champion serving and record the discard reason. If it passes, the
promotion step updates MLflow and KServe traffic according to your local
implementation.
argo get -n mlops @latestkubectl -n ml-serving describe inferenceservice churn-routerThe expected output is intentionally minimal because local promotion mechanics
vary. Your evidence should show a succeeded workflow and either a promotion
decision or a discard reason tied to accuracy_delta.
Phase: Succeededpromotion.decision: promotedOptional Task: Try KEDA Instead of Argo Events
Section titled “Optional Task: Try KEDA Instead of Argo Events”This task is optional because KEDA adds another controller path and can create jobs while the metric remains above threshold. Use it when you want to compare metric-driven jobs with webhook-driven Argo Events.
Apply the ScaledJob from the trigger section and point it at your Prometheus
service. Keep maxReplicaCount: 1 and keep the workflow-level dedupe claim.
Then raise and lower the PSI metric to verify that the job appears only while
the threshold is exceeded.
kubectl apply -f keda-scaledjob-drift-trigger.yamlkubectl -n mlops get scaledjob drift-retraining-triggerkubectl -n mlops get jobs -l app.kubernetes.io/name=drift-retraining-triggerThe expected output confirms that KEDA accepted the ScaledJob. Job creation
depends on the Prometheus query result, polling interval, and whether the
metric remains above threshold long enough for the scaler to react.
scaledjob.keda.sh/drift-retraining-trigger createdSuccess Criteria
Section titled “Success Criteria”- A baseline KServe
InferenceServiceexists for the champion model or you can explain which local serving substitute you used. - Evidently produces a drift report and a per-feature PSI JSON file for reference and current windows.
- Prometheus can scrape
ml_feature_psi_scorewith bounded labels. - Alertmanager or a manual equivalent submits an Argo Workflow through the retraining template.
- The Workflow validates fresh data before training the candidate.
- MLflow receives a Staging candidate with validation tags or you can show the equivalent registry metadata.
- Shadow comparison records
accuracy_delta, and promotion happens only when the delta is positive. - You can describe the rollback actor and show that it is separate from the retraining service.
Sources
Section titled “Sources”- Evidently Data Drift Preset
- Evidently Report API
- Evidently Output Formats
- NannyML Performance Estimation
- NannyML Comparing Estimated and Realized Performance
- Argo Events
- Argo Events Workflow Trigger
- Argo Events Webhook EventSource
- KEDA Prometheus Scaler
- Argo Workflows CronWorkflows
- Argo Workflows WorkflowTemplates
- Kubeflow Pipelines
- MLflow Model Registry
- Seldon Alibi Detect
- Prometheus Alertmanager
- Prometheus Exposition Formats
- Great Expectations Core Introduction
- KServe Canary Rollout Strategy
Next Module
Section titled “Next Module”Module 5.12: CML for ML CI is in progress; it will show how to bring model tests, reports, metrics, and reviewer evidence into pull-request automation.