Module 3.4: Drift Detection and Remediation
Module 3.4: Drift Detection and Remediation
Section titled “Module 3.4: Drift Detection and Remediation”Discipline Module | Complexity:
[MEDIUM]| Time: 35-45 min | Track: Platform Engineering / GitOps
Prerequisites
Section titled “Prerequisites”Before starting this module, you should already be comfortable with the GitOps loop from Module 3.1: What is GitOps? and the promotion patterns from Module 3.3: Environment Promotion. You should also be able to inspect Kubernetes workloads with kubectl; after this module introduces the common Kubernetes shorthand, examples may use k as the same command.
This module assumes Kubernetes 1.35+ behavior for server-side defaults, controller reconciliation, managed fields, and autoscaling resources. The examples focus on ArgoCD and Flux because those are common GitOps controllers, but the diagnostic method applies to any reconciler that compares desired state with observed state.
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Analyze the difference between harmful drift, expected controller-managed variance, and harmless runtime metadata changes in Kubernetes resources.
- Design drift detection signals that separate urgent production risk from normal autoscaler, webhook, and status-field behavior.
- Debug an out-of-sync GitOps application by comparing desired manifests, live resources, controller events, and ownership metadata.
- Evaluate whether to auto-heal, alert, ignore, or backport a drifted change into Git based on operational risk and source-of-truth confidence.
- Build a remediation runbook that preserves GitOps discipline while still allowing controlled emergency response.
Why This Module Matters
Section titled “Why This Module Matters”A platform team believed Git described production, so every incident review started by reading the repository. During one outage, the repository said the checkout service had conservative CPU limits, but production was running with much higher limits that had been added during a previous emergency. The team spent hours tuning database pools and retry logic before realizing that the real workload shape did not match the file everyone trusted.
That failure was not just a missing commit; it was a broken source-of-truth contract. GitOps gives teams a powerful promise: if the repository is reviewed, versioned, and reconciled, then production should be explainable from code. Drift breaks that promise quietly. The cluster may keep serving traffic, dashboards may remain green, and deployments may continue, while the system becomes harder to reason about each day.
Drift detection is the practice that keeps GitOps honest. It is not only a tool feature that marks an application OutOfSync; it is a discipline for deciding which differences matter, which differences are expected, and which differences prove that the operating model has been bypassed. Senior platform engineers do not chase every diff blindly. They classify drift, design tolerances, preserve auditability, and choose the least dangerous correction.
This module teaches drift as an operational decision problem rather than a simple equality check. You will learn how to compare Git and cluster state, how controllers create legitimate differences, how ArgoCD and Flux report divergence, and how to decide whether the right answer is sync, alert, ignore, or commit the live change back to Git.
Core Content
Section titled “Core Content”1. Drift Is a Broken Agreement, Not Just a Diff
Section titled “1. Drift Is a Broken Agreement, Not Just a Diff”Drift occurs when the desired state declared in Git differs from the actual state running in the cluster. That definition sounds simple, but the engineering challenge is deciding whether the difference represents danger, normal Kubernetes behavior, or a missing rule in the GitOps configuration. A useful drift process starts with intent: what should Git own, what should a controller own, and what should nobody edit directly?
graph LR subgraph Git["Desired State in Git"] direction TB G1["Deployment replicas: 3"] G2["Image: registry.example.com/api:v1.2.3"] G3["Memory limit: 512Mi"] G4["NetworkPolicy: enabled"] end
subgraph Cluster["Observed State in Cluster"] direction TB C1["Deployment replicas: 9"] C2["Image: registry.example.com/api:v1.2.3"] C3["Memory limit: 2Gi"] C4["NetworkPolicy: enabled"] end
Git -. "comparison finds differences" .-> ClusterThe example above has two differences, but they should not be treated the same. Replica count may be owned by a HorizontalPodAutoscaler during a traffic surge, while memory limits are usually part of the deployment contract and should remain under Git review. Treating both as equally urgent creates alert fatigue; treating both as harmless creates hidden production risk.
A GitOps controller usually compares rendered manifests with live objects, then reports whether the application is synchronized. The rendered manifest may come from plain YAML, Kustomize, Helm, Jsonnet, or another generator, but the comparison problem stays the same. The controller asks, “If I applied what Git currently declares, would the cluster change?”
+----------------------+ render +----------------------+| Git repository |------------------>| Desired manifest set || branch, commit, path | | normalized YAML |+----------------------+ +----------+-----------+ | | compare v+----------------------+ observe +----------+-----------+| Kubernetes API |------------------>| Live resource state || objects and fields | | selected fields |+----------------------+ +----------------------+The comparison is not a plain text diff between files. Kubernetes adds fields such as uid, resourceVersion, generation, timestamps, status, and managed ownership details after an object is created. Admission webhooks may also default or mutate fields before storage. A drift detector that treats every runtime field as an application defect will be noisy enough that humans stop trusting it.
The better mental model is ownership. Git should own the fields that define application intent, such as images, ports, policies, resource requests, and most environment variables. Kubernetes should own identity and status fields. Autoscalers, operators, and admission controllers may own specific fields when the platform has intentionally delegated that responsibility.
Pause and predict: A deployment is declared in Git with
replicas: 3, an HPA targets the deployment with a maximum of15, and live state showsreplicas: 11. Before reading further, decide whether a GitOps controller should auto-heal this difference. Your answer should mention who owns the replica field during normal traffic and what outage could happen if the controller wins the field fight.
A field fight happens when two controllers believe they own the same field and keep overwriting each other. In a GitOps environment, strict self-healing may set replicas back to 3 every reconciliation interval, while the HPA scales back to 11 based on metrics. The cluster can appear healthy in brief snapshots but still oscillate under load, which is worse than a clean failure because it hides the cause inside competing automation.
Drift detection therefore requires an ownership map. The map does not need to be a separate system at first; it can be encoded as ignore rules, policies, alerts, and runbooks. The important step is making ownership explicit enough that drift signals mean something actionable.
| Drift category | Typical owner | Example difference | Default response |
|---|---|---|---|
| Desired configuration drift | Git and review process | Image tag, resource limit, env var, security context changed live | Investigate quickly, then sync to Git or backport to Git |
| Runtime metadata drift | Kubernetes API server | uid, resourceVersion, creationTimestamp, status | Exclude from comparison and do not alert |
| Controller-managed drift | HPA, VPA, operator, webhook | Replica count, injected annotation, generated secret reference | Ignore specific fields only when ownership is intentional |
| Orphan resource drift | GitOps inventory and pruning policy | Live object remains after deletion from Git | Prune automatically where safe, alert in sensitive namespaces |
| Sync availability drift | GitOps controller and infrastructure | Agent cannot fetch Git or cannot apply manifests | Alert on controller health before changing application state |
This table is the first decision filter. It prevents a common beginner mistake: assuming that every OutOfSync result means the cluster is wrong. Sometimes Git is stale, sometimes the cluster has a manual change, and sometimes the comparison rules are incomplete. The job is to prove which case you are in before remediation changes production.
2. Why Drift Happens in Real Clusters
Section titled “2. Why Drift Happens in Real Clusters”Drift is usually caused by one of four mechanisms: manual intervention, another controller, Kubernetes defaulting, or reconciliation failure. Each mechanism leaves different evidence, so the fastest debugging path starts by identifying the likely source. When you know the mechanism, you can decide whether the fix belongs in Git, RBAC, ignore rules, controller configuration, or incident process.
flowchart TD A["Drift detected"] --> B{"Who changed the field?"} B --> C["Human or script used kubectl, Helm, or API"] B --> D["Controller or operator reconciled its own intent"] B --> E["Kubernetes defaulted or recorded runtime state"] B --> F["GitOps agent failed to sync, prune, or fetch source"] C --> G["Audit access, backport or revert, tighten production permissions"] D --> H["Assign ownership and configure precise ignore rules"] E --> I["Normalize comparison and exclude runtime-only fields"] F --> J["Fix credentials, network, CRDs, hooks, or controller health"]Manual drift is the most culturally dangerous source because it trains teams to bypass the delivery path. A person runs kubectl edit, kubectl scale, kubectl set image, or a one-off Helm command directly against production. The change may be rational during an incident, but if it is not backported into Git or reverted deliberately, the repository stops describing reality.
kubectl scale deployment checkout-api --replicas=8 -n prodkubectl set image deployment/checkout-api app=registry.example.com/checkout:v1.2.4-hotfix -n prodkubectl edit configmap checkout-settings -n prodThose commands are not inherently evil; they are dangerous when they become invisible. In a mature platform, emergency manual changes are either blocked by RBAC, routed through an audited break-glass role, or followed by a required backport task. Drift detection gives the team evidence that the contract was bypassed, but policy and process decide whether the bypass happens again.
Controller drift happens when another reconciler is supposed to modify a field. HPAs adjust replica counts, cert-manager writes certificate status and secret data, service meshes inject sidecars or annotations, and operators manage child objects from higher-level custom resources. If GitOps owns the parent object while an operator owns generated children, the comparison must respect that boundary.
apiVersion: apps/v1kind: Deploymentmetadata: name: checkout-api namespace: prodspec: replicas: 3 selector: matchLabels: app: checkout-api template: metadata: labels: app: checkout-api spec: containers: - name: app image: registry.example.com/checkout:v1.2.3 resources: requests: cpu: 250m memory: 256Mi limits: cpu: 1000m memory: 512MiIf this deployment is paired with an HPA, the replicas field becomes conditional. Git may still set the initial replica count, but day-to-day scaling belongs to the autoscaler. A precise drift configuration ignores only /spec/replicas for this workload, not the entire deployment, because the image, resources, selector, and security fields still matter.
Kubernetes internal changes are the least interesting but the most common source of noisy diffs. The API server writes object identity, versioning, timestamps, managed fields, and status. Controllers update observed generation, conditions, and readiness data. These fields belong in live state; they do not belong in Git for ordinary application manifests.
metadata: name: checkout-api namespace: prod uid: 6f6a6f5c-2b55-4b1e-9a0e-21a8f0b98c52 resourceVersion: "826103" generation: 6 creationTimestamp: "2026-04-26T08:10:00Z"status: observedGeneration: 6 replicas: 8 readyReplicas: 8Sync availability drift is different because the cluster may not have been changed by anyone. If the GitOps controller cannot reach the repository, authenticate to Git, render Kustomize, apply a CRD, or pass an admission policy, the desired state may advance in Git while the cluster remains on an older revision. In that case, the drift is a delivery failure rather than an unauthorized edit.
graph LR Git["Git commit main@abc123"] --> Agent["GitOps controller"] Agent -->|apply denied by policy| API["Kubernetes API"] API --> Live["Live state remains main@old789"] Agent --> Status["Application status: OutOfSync or degraded"]A production-grade drift program monitors both application drift and controller health. If the controller is down, a clean sync status is not reassuring because no one is comparing the current commit to the cluster. If the controller is healthy but one resource is out of sync, the investigation narrows to ownership, failed apply, or intentional mutation.
| Cause | Evidence to inspect | Likely fix | Risk if ignored |
|---|---|---|---|
| Manual production change | Audit logs, events, kubectl.kubernetes.io/last-applied-configuration, user reports | Revert or backport, then restrict direct write paths | Git becomes a suggestion instead of the source of truth |
| Autoscaler field ownership | HPA object, metrics, deployment replica diff only | Ignore /spec/replicas for that managed workload | Self-healing may fight autoscaling during traffic spikes |
| Admission mutation | Diff shows injected labels, annotations, defaults, sidecars | Configure targeted ignore rules or move defaults into Git | Endless sync loops and noisy alerts reduce trust |
| Operator reconciliation | Owner references, custom resources, generated children | Manage the parent resource and exclude operator-owned children | GitOps may delete or rewrite resources the operator needs |
| Git or render failure | Controller logs, source revision, auth errors, dry-run output | Fix credentials, dependencies, CRDs, or rendering inputs | New Git commits do not reach production |
| Prune disabled or blocked | Resource exists live but not in rendered manifests | Enable prune where safe, or document retained resources | Old objects keep affecting traffic, policy, or cost |
This cause table is also a prioritization tool. Drift caused by a stale status field is not a pager. Drift caused by a production image changed outside review may be an urgent security and audit issue. Drift caused by a failed Git fetch may block every future deployment, even if the running workload looks fine right now.
3. A Worked Example: Debugging a Drift Alert Without Guessing
Section titled “3. A Worked Example: Debugging a Drift Alert Without Guessing”The safest way to debug drift is to compare desired state, live state, ownership, and recent events in that order. Jumping directly to “sync” is tempting because GitOps tools make it easy, but a sync is a write operation against production. Before writing, prove whether Git is right, the cluster is right, or both are incomplete.
Assume the platform receives an alert: checkout-api has been out of sync for more than ten minutes. The service is healthy, but a diff shows that production has a higher memory limit than Git. A junior operator wants to click self-heal immediately. A senior operator slows the process down because resource limits can crash a service if reverted blindly.
First, render the desired state from Git. The exact renderer depends on your repository, but the goal is to inspect what the GitOps controller should apply. This example uses Kustomize because it is common in GitOps repositories.
mkdir -p /tmp/kubedojo-drift-demo/desiredcat > /tmp/kubedojo-drift-demo/desired/deployment.yaml <<'YAML'apiVersion: apps/v1kind: Deploymentmetadata: name: checkout-api namespace: prodspec: replicas: 3 selector: matchLabels: app: checkout-api template: metadata: labels: app: checkout-api spec: containers: - name: app image: registry.example.com/checkout:v1.2.3 resources: requests: cpu: 250m memory: 256Mi limits: cpu: 1000m memory: 512MiYAML
cat > /tmp/kubedojo-drift-demo/desired/kustomization.yaml <<'YAML'resources: - deployment.yamlYAML
kubectl kustomize /tmp/kubedojo-drift-demo/desiredNext, inspect the live object. In a real cluster, you would run kubectl get deployment checkout-api -n prod -o yaml. For a local learning exercise, the following command creates a live-state file that mirrors the kind of difference you might see in production.
mkdir -p /tmp/kubedojo-drift-demo/livecat > /tmp/kubedojo-drift-demo/live/deployment.yaml <<'YAML'apiVersion: apps/v1kind: Deploymentmetadata: name: checkout-api namespace: prod uid: 6f6a6f5c-2b55-4b1e-9a0e-21a8f0b98c52 resourceVersion: "826103" generation: 6 creationTimestamp: "2026-04-26T08:10:00Z"spec: replicas: 3 selector: matchLabels: app: checkout-api template: metadata: labels: app: checkout-api spec: containers: - name: app image: registry.example.com/checkout:v1.2.3 resources: requests: cpu: 250m memory: 256Mi limits: cpu: 1000m memory: 2Gistatus: observedGeneration: 6 replicas: 3 readyReplicas: 3YAMLA raw diff will show both meaningful and meaningless differences. Runtime metadata and status are expected, while the memory limit deserves investigation. Normalization removes fields that Kubernetes owns so the remaining diff is easier to reason about.
yq eval 'del(.metadata.uid, .metadata.resourceVersion, .metadata.generation, .metadata.creationTimestamp, .status)' \ /tmp/kubedojo-drift-demo/live/deployment.yaml \ > /tmp/kubedojo-drift-demo/live/normalized.yaml
kubectl kustomize /tmp/kubedojo-drift-demo/desired \ > /tmp/kubedojo-drift-demo/desired/rendered.yaml
diff -u /tmp/kubedojo-drift-demo/desired/rendered.yaml /tmp/kubedojo-drift-demo/live/normalized.yamlThe meaningful diff is the memory limit. At this point, the wrong question is “How do I make the diff disappear?” The right question is “Which state is safe and intended?” If Git says 512Mi but production has survived for weeks on 2Gi, syncing to Git might reintroduce OOM kills. If the production value was an accidental edit, leaving it live makes future deployments unpredictable.
Now inspect who or what could have changed the field. Kubernetes events may not retain old history long enough, but they are still a useful first check. If audit logs are enabled, they are stronger evidence because they can identify the actor and API verb.
kubectl describe deployment checkout-api -n prodkubectl get events -n prod --field-selector involvedObject.name=checkout-api --sort-by=.lastTimestampkubectl auth can-i update deployments -n prod --as=alice@example.comIf your shell defines alias k=kubectl, the same checks can be shorter: k describe deployment checkout-api -n prod and k get events -n prod --field-selector involvedObject.name=checkout-api --sort-by=.lastTimestamp. The alias is convenient during incident work, but runbooks should define it once so responders do not wonder whether k is a custom tool.
A careful remediation record includes the observed diff, the suspected cause, the risk of each response, and the chosen action. If the higher memory limit was necessary, the team should update Git through review and annotate the incident or pull request. If the higher value was an unauthorized change, the team should sync to Git, then adjust RBAC or break-glass controls to prevent recurrence.
What would happen if: You clicked “Sync” immediately after seeing the memory drift, and the live
2Gilimit was an uncommitted production hotfix for repeated OOM kills. Write down the blast radius before continuing: which pods restart, which SLOs might burn, which dashboards would confirm the rollback was unsafe, and which team should own the follow-up commit?
The worked example shows why drift detection is not a button-clicking exercise. A reconciler can restore Git quickly, but it cannot decide whether Git is stale, unsafe, or incomplete. Human judgment is still required when the difference affects capacity, security, traffic routing, data retention, or identity.
4. Reading Drift Signals in ArgoCD and Flux
Section titled “4. Reading Drift Signals in ArgoCD and Flux”ArgoCD usually presents drift through sync status. An application can be Synced or OutOfSync, and it can separately be Healthy, Progressing, Degraded, or another health state. That distinction matters because an application may be healthy while drifted, or synchronized while unhealthy. Drift answers “does live match desired,” while health answers “is the workload behaving.”
graph TD subgraph ArgoCD["ArgoCD Application: checkout-api"] direction TB App["Application object"] Sync["Sync status: OutOfSync"] Health["Health status: Healthy"] Revision["Target revision: main@abc123"] Diff["Diff: memory limit 512Mi -> 2Gi"] App --> Sync App --> Health App --> Revision Sync --> Diff endUse the ArgoCD CLI to read status before writing changes. The first command tells you which resources are out of sync, the second shows the diff, and the third filters the portfolio for applications that need attention. In production, pair these commands with the UI, alerts, and controller logs so the investigation is not dependent on one interface.
argocd app get checkout-apiargocd app diff checkout-apiargocd app list --sync-status OutOfSyncargocd app history checkout-apiArgoCD also exposes metrics that can feed Prometheus alerts. A useful alert should wait long enough to avoid transient reconciliation noise but not so long that drift becomes normal. The labels in your environment may differ by chart and version, so confirm metric names from your running controller before making the alert a paging rule.
groups: - name: gitops-drift rules: - alert: ArgoCDApplicationOutOfSync expr: argocd_app_info{sync_status="OutOfSync"} == 1 for: 10m labels: severity: warning team: platform annotations: summary: "ArgoCD application is out of sync" description: "Application {{ $labels.name }} has reported OutOfSync for more than ten minutes."Flux reports reconciliation through Kubernetes custom resources and status conditions. A Kustomization or HelmRelease has conditions such as Ready, Reconciling, and failure reasons. Flux is CLI-first, but the diagnostic question is the same: what revision did it try, what revision is applied, and which object blocked convergence?
flux get kustomizations --all-namespacesflux get helmreleases --all-namespaceskubectl get kustomization checkout-api -n flux-system -o yamlkubectl describe kustomization checkout-api -n flux-systemA healthy Flux status includes the last attempted and last applied revision. If those revisions differ, Flux may be unable to apply the current Git state. If they match but live resources differ, a controller, manual edit, or ignored field may be involved. Read conditions closely because “ready” and “applied latest commit” are related but not identical ideas.
apiVersion: kustomize.toolkit.fluxcd.io/v1kind: Kustomizationmetadata: name: checkout-api namespace: flux-systemspec: interval: 5m path: ./apps/checkout/overlays/prod prune: true sourceRef: kind: GitRepository name: platform-configstatus: conditions: - type: Ready status: "True" reason: ReconciliationSucceeded message: "Applied revision: main@sha1:abc123" lastAppliedRevision: main@sha1:abc123 lastAttemptedRevision: main@sha1:abc123Flux alerting often focuses on controller conditions rather than a single “sync status” label. You can watch gotk_reconcile_condition metrics, Kubernetes events, or notification-controller messages. The key design choice is to alert on persistent failure and meaningful drift, not every short reconciliation attempt.
groups: - name: flux-reconciliation rules: - alert: FluxKustomizationNotReady expr: gotk_reconcile_condition{kind="Kustomization",type="Ready",status="False"} == 1 for: 10m labels: severity: warning team: platform annotations: summary: "Flux Kustomization is not ready" description: "Kustomization {{ $labels.name }} in {{ $labels.namespace }} has not reconciled successfully."| Signal | ArgoCD example | Flux example | Interpretation |
|---|---|---|---|
| Desired revision | Target revision in application details | lastAttemptedRevision | What Git state the controller tried to apply |
| Applied revision | Sync history and live object tracking | lastAppliedRevision | What Git state appears to be applied successfully |
| Drift flag | OutOfSync | Not-ready condition, diff tooling, inventory mismatch | Live state does not match desired or reconciliation failed |
| Workload health | ArgoCD health assessment | Kubernetes readiness and custom health checks | Application behavior may be good or bad independent of drift |
| Controller health | Repo server, application controller, Redis, API access | source-controller, kustomize-controller, helm-controller | Drift signals are trustworthy only if controllers are running |
| Diff detail | argocd app diff | flux diff, kubectl diff, rendered manifest comparison | The exact fields that changed and need classification |
The senior-level habit is to separate signal layers. Do not page an application team because the GitOps controller restarted once. Do not ignore an application image drift just because pods are currently ready. Do not let a false-positive annotation diff train responders to mute all GitOps alerts. A drift signal is useful only when it points to an expected decision.
5. Designing the Response: Auto-Heal, Alert, Ignore, or Backport
Section titled “5. Designing the Response: Auto-Heal, Alert, Ignore, or Backport”Every drift response chooses a source of truth. Auto-heal says Git is correct and the cluster should be rewritten. Alert-only says the difference requires human judgment before changing state. Ignore says the difference is expected and should not be considered drift. Backport says the live cluster change is correct and Git must be updated to restore the source-of-truth contract.
flowchart TD A["Drift detected"] --> B{"Is the difference expected and delegated?"} B -- "Yes" --> C["Add precise ignore rule"] B -- "No" --> D{"Is Git known to be safe and current?"} D -- "Yes" --> E["Auto-heal or manual sync to Git"] D -- "No" --> F{"Is live state safer than Git?"} F -- "Yes" --> G["Backport live intent into Git through review"] F -- "No" --> H["Investigate before writing production"] H --> I["Check diff, events, audit logs, metrics, owner"] I --> DAuto-healing is powerful when the team has high confidence that Git is reviewed, tested, and current. It is appropriate for many stateless application fields, policy resources, and lower-risk environments. It is dangerous when Git lags behind emergency fixes, when another controller legitimately owns a field, or when an automated rollback could cause a larger incident than the drift itself.
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: checkout-api namespace: argocdspec: project: default source: repoURL: https://git.example.com/platform/apps.git targetRevision: main path: apps/checkout/overlays/prod destination: server: https://kubernetes.default.svc namespace: prod syncPolicy: automated: prune: true selfHeal: true allowEmpty: falsePruning deserves special attention. selfHeal reverts live changes to tracked objects, while prune deletes live objects that are no longer declared. Pruning is essential for preventing orphaned services, policies, and roles, but it can be destructive when ownership is unclear. Mature teams use pruning broadly after inventory is clean, while protecting shared namespaces and generated resources with explicit ownership rules.
Flux uses reconciliation intervals and pruning behavior rather than the same ArgoCD sync policy model. A Kustomization with prune: true will remove objects previously applied by that Kustomization when they disappear from source. The controller should be given a clear inventory boundary so it does not fight other systems.
apiVersion: kustomize.toolkit.fluxcd.io/v1kind: Kustomizationmetadata: name: checkout-api namespace: flux-systemspec: interval: 5m path: ./apps/checkout/overlays/prod prune: true sourceRef: kind: GitRepository name: platform-config targetNamespace: prodAlert-only mode is not a weak GitOps posture when used deliberately. Some production changes should not be rewritten automatically because the cost of a wrong revert is high. Databases, traffic routing, identity providers, network policies, and emergency resource tuning often deserve human confirmation even when the eventual answer is to restore Git.
groups: - name: gitops-critical-drift rules: - alert: CriticalApplicationDrift expr: argocd_app_info{sync_status="OutOfSync",project="critical"} == 1 for: 5m labels: severity: page response: human-review-required annotations: summary: "Critical GitOps application drift requires review" description: "Application {{ $labels.name }} is out of sync and belongs to a critical project."Ignore rules are safe only when they are narrow. Ignoring all differences on a deployment because an HPA changes replicas hides image changes, environment changes, security-context changes, and accidental edits. A good ignore rule names the group, kind, optional object name, and exact JSON pointer or expression for the delegated field.
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: checkout-api namespace: argocdspec: project: default source: repoURL: https://git.example.com/platform/apps.git targetRevision: main path: apps/checkout/overlays/prod destination: server: https://kubernetes.default.svc namespace: prod ignoreDifferences: - group: apps kind: Deployment name: checkout-api namespace: prod jsonPointers: - /spec/replicas - group: autoscaling kind: HorizontalPodAutoscaler name: checkout-api namespace: prod jqPathExpressions: - .statusSome differences should be moved into Git instead of ignored. If an admission webhook always adds a label that your organization requires, it may be better to add that label to the manifest so reviewers see it. If a mutating webhook injects a runtime-only timestamp, an ignore rule is better. The design question is whether the field carries intentional configuration or operational bookkeeping.
| Response | Choose it when | Avoid it when | Verification |
|---|---|---|---|
| Auto-heal | Git is trusted, tested, and should override direct edits | Another controller owns the field or Git may be stale | Diff clears, rollout remains healthy, no repeated field fight |
| Alert-only | Human judgment is required before writing production | The drift is routine and safe to fix automatically | Alert includes resource, diff, owner, and runbook link |
| Ignore | A specific field is intentionally managed outside Git | The ignore hides broad configuration or security changes | Only the expected field disappears from drift reports |
| Backport to Git | Live state is correct and Git is incomplete | Live state is accidental, unreviewed, or unsafe | Pull request explains evidence and future desired state |
| Prune | Live orphan resources are within GitOps ownership | Shared or generated resources may be deleted incorrectly | Inventory confirms resource was formerly managed by the app |
| Escalate | Ownership, safety, or impact is unclear | A low-risk automated policy already covers the case | Incident notes identify the decision maker and next action |
Backporting is often the most overlooked response. Engineers sometimes treat “cluster differs from Git” as proof that the cluster must be wrong. In real incidents, the cluster may contain a necessary emergency fix that never made it back to code. Backporting restores GitOps by turning a live discovery into a reviewed change, not by pretending the emergency never happened.
6. Preventing Recurrence With Policy, Runbooks, and Feedback Loops
Section titled “6. Preventing Recurrence With Policy, Runbooks, and Feedback Loops”Drift detection is reactive; drift prevention changes how teams operate. The prevention layer includes access control, admission policy, controller configuration, emergency workflow, and periodic review. Without those controls, the same alert will recur and responders will waste time rediscovering the same cause.
Start with RBAC. Most developers do not need write access to production Deployments, ConfigMaps, Services, or RBAC objects if GitOps is the delivery path. They may need read access, logs, port-forwarding in limited namespaces, or a break-glass role. The fewer direct write paths exist, the less manual drift you need to explain later.
apiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: production-readonly namespace: prodrules: - apiGroups: ["", "apps", "batch", "networking.k8s.io", "autoscaling"] resources: ["pods", "services", "configmaps", "deployments", "jobs", "networkpolicies", "horizontalpodautoscalers"] verbs: ["get", "list", "watch"]RBAC alone is not enough because automation accounts, CI jobs, and operators may still write resources. Admission policy can require ownership labels, block direct changes to protected resources, or allow writes only from known service accounts. The goal is not to make incidents impossible to handle; it is to make exceptional writes visible, intentional, and reviewable.
A practical runbook should make the first five minutes of drift response boring. It should tell responders where to find the diff, how to identify the owner, which fields are expected, how to decide between sync and backport, and when to page an application team. Boring runbooks reduce panic-driven syncs.
+-------------------+ +-------------------+ +-------------------+| 1. Read alert | --> | 2. Classify diff | --> | 3. Check owner || app, resource | | harmful/expected | | Git/controller |+-------------------+ +-------------------+ +-------------------+ | | | v v v+-------------------+ +-------------------+ +-------------------+| 4. Choose action | --> | 5. Verify result | --> | 6. Prevent again || sync/backport/etc | | health and diff | | RBAC/rule/process|+-------------------+ +-------------------+ +-------------------+Periodic drift review catches weak patterns before they become incidents. A weekly review might look at the number of out-of-sync applications, repeated ignored fields, manual production writes, prune candidates, and alerts that were acknowledged but not resolved. The value is not the meeting; the value is turning noisy signals into better ownership rules.
| Prevention control | What it prevents | Implementation hint | Failure mode to watch |
|---|---|---|---|
| Production read-only RBAC | Casual kubectl edit and kubectl scale changes | Grant write access only to GitOps controllers and break-glass roles | Teams create shadow automation with broad credentials |
| Break-glass workflow | Invisible emergency edits | Require ticket, expiration, audit log, and backport decision | The process is too slow, so responders bypass it |
| Precise ignore catalog | Repeated false-positive drift alerts | Store ignore rules near the application or platform policy | Broad ignores hide real configuration drift |
| Drift alert routing | Unowned alerts and delayed response | Route by application label, project, or namespace owner | Every alert goes to platform, creating bottlenecks |
| Prune inventory review | Orphan services, roles, and policies | Enable prune after confirming ownership boundaries | Shared objects are deleted because inventory was unclear |
| Post-incident backport check | Production fixes missing from Git | Add a drift item to incident closure criteria | Incidents close before configuration truth is restored |
| Controller health monitoring | Silent loss of reconciliation | Alert on GitOps controller readiness and source fetch failures | Teams trust stale sync status while controllers are down |
There is one final senior-level nuance: the goal is not mathematically zero drift at every instant. Kubernetes is a reconciled system, so short-lived differences can happen during rollouts, autoscaling, admission, and controller restarts. The goal is known drift with explicit ownership, fast detection of harmful divergence, and reliable restoration of the source-of-truth contract.
Did You Know?
Section titled “Did You Know?”-
Autoscaling creates intentional field movement. A HorizontalPodAutoscaler does not violate GitOps simply because the live replica count changes; it violates GitOps only if the platform never declared that the autoscaler owns that field.
-
A synchronized application can still be broken. Sync status compares desired and live configuration, while health depends on runtime behavior such as readiness, error rates, dependencies, and capacity.
-
Orphaned resources are drift too. A Service, RoleBinding, or NetworkPolicy left behind after deletion from Git can keep affecting traffic or permissions even when the main workload looks clean.
-
Backporting can be remediation. If production was manually changed for a valid emergency reason, the correct fix may be a reviewed Git commit that captures the new intent rather than an automatic rollback.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why it hurts | Better practice |
|---|---|---|
Treating every OutOfSync result as an emergency | Teams page on harmless status or metadata differences and eventually ignore real drift | Classify by ownership, field, resource criticality, and duration before choosing severity |
| Enabling self-heal without ignore rules for autoscaled workloads | The GitOps controller can fight HPA or VPA behavior during peak traffic | Ignore only the delegated fields, then verify the application still reports meaningful drift |
| Ignoring entire resources to silence noisy diffs | Image, security, policy, or environment changes can hide behind a broad ignore rule | Use object-specific JSON pointers or expressions and review ignores periodically |
| Clicking sync before investigating long-lived production drift | Git may be stale, and reverting live state can reintroduce an old outage | Compare metrics, events, incidents, and ownership before deciding whether to sync or backport |
| Monitoring application drift but not controller health | A failed GitOps controller can make stale state look stable because no reconciliation is happening | Alert on controller readiness, source fetch failures, apply failures, and queue backlog |
| Leaving manual emergency changes outside Git | Future deploys become unpredictable and incident reviews lose the real configuration history | Require a post-incident decision to revert, backport, or document the exception |
| Disabling prune everywhere out of fear | Deleted services, roles, policies, and ingresses can keep affecting production after Git cleanup | Enable prune with clear ownership boundaries and protect shared generated resources explicitly |
| Routing all drift alerts to the platform team | Application owners lose accountability for their own desired state and platform becomes a bottleneck | Route alerts using labels, projects, namespaces, and service ownership metadata |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”Your team deploys orders-api with ArgoCD. Git declares replicas: 3, live state shows replicas: 12, and an HPA exists with minReplicas: 3 and maxReplicas: 20. The application is healthy, traffic is elevated, and the only diff is /spec/replicas. What should you do?
Show Answer
Do not auto-heal the deployment back to three replicas. The HPA is intentionally managing the replica field during the traffic spike, so the correct response is to configure a precise ignore rule for /spec/replicas on that autoscaled deployment. You should still verify that the HPA object is expected, metrics are healthy, and no other fields have drifted. A broad ignore on the whole deployment would be unsafe because it could hide image, resource, or security-context drift.
Question 2
Section titled “Question 2”A payment service has been stable for six weeks with a live memory limit of 2Gi, but Git still declares 512Mi. ArgoCD reports the application as OutOfSync, and a teammate suggests pressing “Sync” because Git is supposed to be the source of truth. How should you proceed?
Show Answer
Investigate before syncing because the live value may be an uncommitted emergency fix. Compare recent incidents, OOM kill history, performance metrics, deployment events, and audit logs to determine why the limit changed. If the higher limit is required, open a reviewed Git change that updates the manifest and explains the evidence. If the live value is accidental or unsafe, sync back to Git and then address the access path that allowed the drift.
Question 3
Section titled “Question 3”Flux reports a Kustomization as not ready after a Git commit, and live objects remain on the previous image tag. No one edited the deployment manually. The Flux status shows lastAttemptedRevision at the new commit, while lastAppliedRevision is still the old commit. What is the most likely class of drift, and where do you look first?
Show Answer
This is most likely reconciliation or apply failure rather than manual configuration drift. Flux attempted the new revision but did not successfully apply it, so you should inspect the Kustomization conditions, controller logs, rendered manifests, missing CRDs, admission denials, and Kubernetes events. The fix is probably in source rendering, permissions, policy, or dependencies, not in reverting a manual cluster edit.
Question 4
Section titled “Question 4”A third-party operator is installed through GitOps. After every sync, ArgoCD reports differences on generated child resources that have owner references pointing to the operator’s custom resource. The operator immediately recreates those fields when ArgoCD removes them. What design change should you make?
Show Answer
Manage the operator’s intended parent custom resource through Git and stop trying to own the operator-generated child fields directly. Depending on the tool and object model, configure ignore rules for operator-owned generated fields or exclude generated children from GitOps ownership. The goal is to avoid a controller fight while still keeping the declarative parent configuration reviewed in Git.
Question 5
Section titled “Question 5”A namespace contains several Services that were removed from Git months ago, but prune was disabled during the original migration. One old Service still selects pods from a current deployment and exposes an unintended internal endpoint. What kind of drift is this, and how should remediation be planned?
Show Answer
This is orphan resource drift. The live resource remains after its declaration disappeared from Git, and it still changes system behavior. You should identify which resources are within GitOps ownership, verify that deleting them will not break shared dependencies, then enable or perform pruning in a controlled change. After cleanup, keep prune enabled where ownership boundaries are clear so deleted Git resources do not silently remain active.
Question 6
Section titled “Question 6”A mutating admission webhook injects a required compliance label into every Deployment. ArgoCD shows the injected label as drift because the label is missing from Git. Should you ignore the label or add it to Git?
Show Answer
It depends on whether the label represents intentional configuration that reviewers should see. If the label is stable compliance intent and belongs to the application contract, add it to Git so the repository shows the full desired configuration. If the webhook writes runtime bookkeeping or values that change per cluster, use a narrow ignore rule. The important distinction is whether the field is meaningful desired state or operational mutation.
Question 7
Section titled “Question 7”A production team repeatedly uses a break-glass role to patch environment variables directly during incidents, then forgets to open follow-up pull requests. Drift alerts are accurate, but they keep recurring. What should the platform team change?
Show Answer
The issue is no longer detection; it is recurrence prevention. Keep the break-glass path for real incidents, but require an expiring role, an incident or change record, audit logging, and a closure step that decides whether to revert or backport the live change. You can also route repeated drift alerts to service ownership and review RBAC scope. The goal is to preserve emergency response while restoring Git as the reviewed source of truth after the incident.
Hands-On Exercise: Build a Drift Detection and Remediation Plan
Section titled “Hands-On Exercise: Build a Drift Detection and Remediation Plan”In this exercise, you will create a small drift plan for a GitOps-managed application. You do not need a live cluster to complete the design portions, but the commands are written so you can run the comparison workflow locally if you have kubectl, yq, and either ArgoCD or Flux available. The goal is to practice classification and response, not to memorize tool flags.
Scenario
Section titled “Scenario”You own a production application called checkout-api. It is deployed from Git into the prod namespace, uses an HPA, receives injected annotations from a platform webhook, and has recently triggered drift alerts. Your task is to design the ignore rules, alerting policy, diagnostic commands, and remediation runbook that an on-call engineer should follow.
Part 1: Inventory Expected and Unexpected Drift
Section titled “Part 1: Inventory Expected and Unexpected Drift”Start by listing fields that may differ between Git and the cluster. Do not mark a field as expected just because it is noisy. A field is expected only when you can name the controller or platform component that owns it.
| Resource | Field | Expected owner | Drift classification | Planned action |
|---|---|---|---|---|
Deployment checkout-api | /spec/replicas | HorizontalPodAutoscaler | Expected controller-managed drift | Ignore this exact field |
Deployment checkout-api | Container image | Git review process | Harmful configuration drift | Alert and sync or backport after investigation |
Deployment checkout-api | Memory limits | Git review process | Harmful configuration drift | Investigate before sync because capacity changes can be risky |
Deployment checkout-api | Platform webhook annotations | Admission webhook | Expected mutation if documented | Ignore exact annotation prefix or move stable labels into Git |
Service checkout-api | /spec/clusterIP | Kubernetes API server | Expected runtime assignment | Exclude from comparison |
| Any managed object | /status | Kubernetes controllers | Expected runtime status | Exclude from comparison |
| Old Service not in Git | Entire object | No current owner | Orphan resource drift | Prune after ownership review |
Success criteria for this part:
- You identified at least three expected drift scenarios and named the component that owns each one.
- You identified at least two harmful drift scenarios that should not be ignored.
- You included at least one orphan-resource scenario rather than focusing only on field-level drift.
- You avoided broad language such as “ignore metadata” unless you can narrow the exact field or prefix.
Part 2: Configure Precise Ignore Rules
Section titled “Part 2: Configure Precise Ignore Rules”Write ignore rules for expected drift only. The following ArgoCD example ignores HPA-managed replicas and status fields while preserving drift detection on images, resources, environment variables, security context, and policies.
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: checkout-api namespace: argocdspec: project: prod source: repoURL: https://git.example.com/platform/apps.git targetRevision: main path: apps/checkout/overlays/prod destination: server: https://kubernetes.default.svc namespace: prod ignoreDifferences: - group: apps kind: Deployment name: checkout-api namespace: prod jsonPointers: - /spec/replicas - group: "*" kind: "*" jqPathExpressions: - .status - group: apps kind: Deployment name: checkout-api namespace: prod jqPathExpressions: - .metadata.annotations | select(. != null) | with_entries(select(.key | startswith("platform.example.com/injected-")))If you use Flux, field-level comparison behavior is often handled through the manifests you apply, controller ownership, server-side apply behavior, and policy around generated resources. You may still use patches to avoid declaring fields that another controller owns, but be careful not to remove fields that should be reviewed in Git.
apiVersion: kustomize.config.k8s.io/v1beta1kind: Kustomizationresources: - deployment.yamlpatches: - target: group: apps version: v1 kind: Deployment name: checkout-api patch: |- - op: remove path: /spec/replicasSuccess criteria for this part:
- Your ignore rule is scoped to a resource kind, name, namespace, or exact field where possible.
- Your configuration does not ignore the whole Deployment just to silence one field.
- Your comments or surrounding documentation explain why another controller owns the ignored field.
- You can describe which harmful drift would still be detected after the ignore rule is applied.
Part 3: Configure Drift Alerts With Actionable Context
Section titled “Part 3: Configure Drift Alerts With Actionable Context”Create at least one warning alert for persistent application drift and one higher-severity route for critical projects. Alerts should include the application name, duration, owner or team labels if available, and a runbook link. A good alert tells the responder where to start instead of merely saying something is wrong.
groups: - name: checkout-gitops-drift rules: - alert: CheckoutApplicationOutOfSync expr: argocd_app_info{name="checkout-api",sync_status="OutOfSync"} == 1 for: 10m labels: severity: warning service: checkout-api team: payments annotations: summary: "checkout-api is out of sync with Git" description: "checkout-api has reported OutOfSync for more than ten minutes. Review the diff before syncing." runbook_url: "https://runbooks.example.com/gitops/drift"
- alert: CriticalGitOpsProjectOutOfSync expr: argocd_app_info{project="critical",sync_status="OutOfSync"} == 1 for: 5m labels: severity: page team: platform annotations: summary: "Critical GitOps application drift" description: "Application {{ $labels.name }} in a critical project is out of sync and requires human review." runbook_url: "https://runbooks.example.com/gitops/critical-drift"Success criteria for this part:
- Your alert waits long enough to avoid transient reconciliation noise.
- Your alert includes labels that support routing to the right owner.
- Your alert description instructs responders to inspect the diff before writing changes.
- Your higher-severity alert is based on business criticality, not merely the existence of any diff.
Part 4: Practice the Manual Diff Workflow
Section titled “Part 4: Practice the Manual Diff Workflow”Use the worked example pattern to compare desired and live YAML after removing runtime-only fields. If you have a real cluster, replace the sample file with kubectl get deployment checkout-api -n prod -o yaml. If you do not have a cluster, the sample files still let you practice normalization and classification.
mkdir -p /tmp/kubedojo-drift-lab/git /tmp/kubedojo-drift-lab/live
cat > /tmp/kubedojo-drift-lab/git/deployment.yaml <<'YAML'apiVersion: apps/v1kind: Deploymentmetadata: name: checkout-api namespace: prodspec: replicas: 3 selector: matchLabels: app: checkout-api template: metadata: labels: app: checkout-api spec: containers: - name: app image: registry.example.com/checkout:v1.2.3 resources: limits: cpu: 1000m memory: 512MiYAML
cat > /tmp/kubedojo-drift-lab/live/deployment.yaml <<'YAML'apiVersion: apps/v1kind: Deploymentmetadata: name: checkout-api namespace: prod uid: aaaaaaaa-bbbb-4ccc-8ddd-eeeeeeeeeeee resourceVersion: "912300" creationTimestamp: "2026-04-26T10:00:00Z" annotations: platform.example.com/injected-policy: baselinespec: replicas: 9 selector: matchLabels: app: checkout-api template: metadata: labels: app: checkout-api spec: containers: - name: app image: registry.example.com/checkout:v1.2.3 resources: limits: cpu: 1000m memory: 2Gistatus: replicas: 9 readyReplicas: 9YAML
yq eval 'del(.metadata.uid, .metadata.resourceVersion, .metadata.creationTimestamp, .metadata.annotations."platform.example.com/injected-policy", .status)' \ /tmp/kubedojo-drift-lab/live/deployment.yaml \ > /tmp/kubedojo-drift-lab/live/normalized.yaml
diff -u /tmp/kubedojo-drift-lab/git/deployment.yaml /tmp/kubedojo-drift-lab/live/normalized.yamlSuccess criteria for this part:
- You can identify which remaining diff is HPA-managed and which diff affects application capacity.
- You can explain why deleting runtime metadata made the diff easier to read.
- You can name at least one piece of evidence needed before reverting the memory limit.
- You can state whether the injected annotation should be ignored or moved into Git.
Part 5: Write the Remediation Runbook
Section titled “Part 5: Write the Remediation Runbook”Create a short runbook that a responder can follow during an alert. The runbook must force a decision instead of ending with “sync the app.” Use the checklist below as a minimum.
## Drift Remediation Runbook: checkout-api
### Step 1: Acknowledge and scope the alert- [ ] Identify the GitOps application, namespace, resource kind, and field that drifted.- [ ] Check whether the application is healthy or degraded separately from sync status.- [ ] Confirm the GitOps controller is healthy enough for drift status to be trustworthy.
### Step 2: Inspect the diff- [ ] Run `argocd app diff checkout-api` or the equivalent Flux and `kubectl diff` workflow.- [ ] Normalize runtime-only fields such as status, UID, resourceVersion, and timestamps.- [ ] Compare the remaining diff against the expected drift inventory.
### Step 3: Identify ownership and cause- [ ] Check for HPA, VPA, operator, webhook, or generated-resource ownership.- [ ] Check Kubernetes events and audit logs for direct production writes.- [ ] Check recent incidents and changes for emergency patches that were not backported.
### Step 4: Choose the response- [ ] If Git is correct and safe, sync or allow auto-heal to restore desired state.- [ ] If live state is correct, open a pull request that backports the desired live intent into Git.- [ ] If the difference is expected, add or refine a narrow ignore rule.- [ ] If ownership is unclear, escalate to the service owner before writing production.
### Step 5: Verify and prevent recurrence- [ ] Confirm the diff cleared or was intentionally ignored.- [ ] Confirm workload health, error rate, latency, and saturation remain acceptable.- [ ] Record whether RBAC, webhook defaults, prune policy, or break-glass process needs improvement.Success criteria for this part:
- Your runbook distinguishes sync, backport, ignore, prune, and escalation.
- Your runbook checks workload health after remediation, not only GitOps status.
- Your runbook includes a prevention step so the same drift does not recur silently.
- Your runbook is specific enough that a new on-call engineer can follow it under pressure.
Part 6: Senior Review Questions for Your Design
Section titled “Part 6: Senior Review Questions for Your Design”Before calling the exercise complete, challenge your own plan. A drift system that looks correct on paper can still fail under pressure if it hides ownership, alerts the wrong team, or automates a risky rollback. Use these questions to find weak spots before production does.
- Which field in your plan is most likely to be incorrectly ignored, and what real drift would that hide?
- Which field in your plan is most dangerous to auto-heal, and what metric would reveal harm quickly?
- Which team owns the application diff, and which team owns the GitOps controller health signal?
- What evidence would convince you that live state should be backported into Git instead of reverted?
- Which resources can be pruned automatically, and which resources need an ownership review first?
- How will break-glass changes be connected to follow-up pull requests after an incident?
- How will you detect that the GitOps controller itself is unhealthy and drift signals may be stale?
Next Module
Section titled “Next Module”Continue to Module 3.5: Secrets in GitOps to learn how to manage sensitive values without breaking the GitOps source-of-truth model.
Sources
Section titled “Sources”- Argo CD Diff Customization — Best primary reference for precise ignore rules, manager-owned fields, and status-field handling in drift detection.
- Flux Kustomization — Covers reconciliation conditions, revision tracking, inventory, drift correction, and prune behavior in Flux.
- Horizontal Pod Autoscaling — Primary source for replica ownership and autoscaling behavior, which is central to distinguishing real drift from delegated scaling.