Module 3.4: Drift Detection and Remediation

Discipline Module | Complexity: [MEDIUM] | Time: 30-35 min

Prerequisites

Before starting this module:

Required: Module 3.1: What is GitOps? — GitOps fundamentals
Required: Module 3.3: Environment Promotion — Environment management
Recommended: Experience with Kubernetes kubectl debugging

What You’ll Be Able to Do

After completing this module, you will be able to:

Implement drift detection that identifies when cluster state diverges from Git-declared configuration
Design alerting policies that distinguish critical drift requiring immediate attention from benign changes
Build automated remediation workflows that reconcile drift without manual intervention
Analyze drift patterns to identify systemic causes — manual kubectl edits, mutating webhooks, operator conflicts

Why This Module Matters

GitOps principle #4: “Continuously reconciled.”

But what happens when someone runs kubectl edit directly? Or when a Kubernetes controller modifies a resource? Or when network issues prevent syncing?

Drift occurs when the cluster state doesn’t match Git.

Without drift detection:

You don’t know your Git repo lies
Manual changes go unnoticed
Troubleshooting becomes “is it Git or is it the cluster?”
GitOps becomes “Git-sometimes-Ops”

This module teaches you to detect drift, understand why it happens, and decide whether to remediate or accept it.

What Is Drift?

Drift is the difference between desired state (Git) and actual state (cluster).

┌─────────────────────────────────────────────────────────────┐
│                     Desired State (Git)                      │
│                                                              │
│    replicas: 3                                               │
│    image: my-app:v1.2.3                                      │
│    memory: 512Mi                                             │
└─────────────────────────────────────────────────────────────┘
                              ≠
┌─────────────────────────────────────────────────────────────┐
│                   Actual State (Cluster)                     │
│                                                              │
│    replicas: 5        ← Someone scaled manually              │
│    image: my-app:v1.2.3                                      │
│    memory: 512Mi                                             │
└─────────────────────────────────────────────────────────────┘

                        DRIFT DETECTED

Types of Drift

Configuration Drift:

Manual kubectl changes
Helm release modifications outside GitOps
Direct API changes

Operational Drift:

HPA scaling pods (expected)
VPA adjusting resources
Cluster autoscaler adding nodes

Structural Drift:

Resources deleted from Git but still in cluster
Resources in cluster that were never in Git
Namespace changes

What Causes Drift?

1. Manual Changes (The Usual Suspect)

# The classic drift-causer
kubectl scale deployment my-app --replicas=10
kubectl set image deployment/my-app app=my-app:hotfix
kubectl edit configmap my-config

Why it happens:

Debugging production issues
Emergency fixes
“Just this once”
Forgetting GitOps exists

2. Controllers and Operators

# Git says:
spec:
  replicas: 3

# HPA says:
# "Traffic is high, I'm scaling to 10"
# Cluster now has 10 replicas

# Is this drift? Depends on how you define it.

Why it happens:

HPA, VPA, Cluster Autoscaler
Custom operators
Admission webhooks modifying resources

3. Kubernetes Internal Changes

# Git says:
metadata:
  name: my-app

# Cluster has:
metadata:
  name: my-app
  uid: abc-123-def
  resourceVersion: "12345"
  creationTimestamp: "2024-01-15T10:00:00Z"
  generation: 3

Why it happens:

Kubernetes adds metadata
Status fields updated
Annotations added by controllers

4. Network/Sync Issues

GitOps Agent ──✕──> Git Repository
              ^
              │
         Network failure

Agent can't sync, cluster drifts over time

Why it happens:

Git authentication issues
Network partitions
Agent crashes/restarts

Try This: Detect Drift Manually

Check for drift in your cluster:

# Export what's in Git
kustomize build overlays/prod > /tmp/desired.yaml

# Export what's in cluster
kubectl get deployment my-app -o yaml > /tmp/actual.yaml

# Compare (ignore metadata/status)
diff <(yq 'del(.metadata.uid, .metadata.resourceVersion,
       .metadata.creationTimestamp, .metadata.generation,
       .metadata.managedFields, .status)' /tmp/desired.yaml) \
     <(yq 'del(.metadata.uid, .metadata.resourceVersion,
       .metadata.creationTimestamp, .metadata.generation,
       .metadata.managedFields, .status)' /tmp/actual.yaml)

GitOps Tool Drift Detection

ArgoCD Sync Status

ArgoCD shows drift as “OutOfSync”:

┌─────────────────────────────────────────────────────────────┐
│                    ArgoCD Application                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Application: my-app                                         │
│  Sync Status: OutOfSync  ⚠️                                  │
│  Health: Healthy                                             │
│                                                              │
│  Diff:                                                       │
│  - spec.replicas: 3                                          │
│  + spec.replicas: 5                                          │
│                                                              │
│  [Sync] [Refresh] [History]                                  │
└─────────────────────────────────────────────────────────────┘

ArgoCD CLI:

# Check sync status
argocd app get my-app

# Show diff
argocd app diff my-app

# List all out-of-sync apps
argocd app list --sync-status OutOfSync

Flux Kustomization Status

Flux shows drift in conditions:

# Check kustomization status
flux get kustomizations

# Detailed status
kubectl get kustomization my-app -n flux-system -o yaml

status:
  conditions:
    - type: Ready
      status: "True"
      reason: ReconciliationSucceeded
    - type: Reconciling
      status: "False"
      reason: ReconciliationSucceeded
  lastAppliedRevision: main/abc123
  lastAttemptedRevision: main/abc123

Custom Drift Detection

For advanced needs, build custom detection:

# Prometheus alert for drift
groups:
  - name: gitops
    rules:
      - alert: GitOpsDrift
        expr: |
          argocd_app_info{sync_status="OutOfSync"} == 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GitOps drift detected"
          description: "{{ $labels.name }} is out of sync for 10+ minutes"

Did You Know?

Netflix’s approach: They run continuous “drift detectors” that compare running services to their source of truth, alerting when differences are found.
Google SRE coined “Configuration Drift” as one of the major causes of outages. Their solution: continuous reconciliation (what became GitOps).
Some drift is intentional and acceptable. The key is knowing which drift matters and which doesn’t. HPA scaling isn’t a problem; manual hotfixes are.
AWS Config was created specifically to address drift detection at cloud infrastructure level. The concept of “desired state vs actual state” reconciliation exists at every layer of the stack, from cloud resources to Kubernetes to application config.

War Story: The Drift That Hid a Bug

A team I worked with had a puzzling situation:

The Problem:

Users reported intermittent 500 errors
Logs showed database connection timeouts
Dev said “but it works in staging!”

Investigation:

# Git says:
resources:
  limits:
    memory: 512Mi

# Cluster had:
resources:
  limits:
    memory: 2Gi

What Happened:

Months ago, an engineer saw OOM kills and ran:

kubectl set resources deployment/api --limits=memory=2Gi

The fix worked. They forgot to commit it to Git. The GitOps agent was in “manual” sync mode (sync only on changes), so it never reverted.

Fast forward: staging was deployed fresh from Git (512Mi). Production had the manual change (2Gi). The bug appeared in staging but not production.

The Reveal:

When they finally noticed the ArgoCD “OutOfSync” status:

Production had been drifted for 3 months
47 resources were out of sync
Nobody knew what the “real” config was

The Fix:

Audit all drift (exported actual state)
Decided which to keep (2Gi was correct)
Updated Git to match desired actual state
Enabled auto-sync to prevent future drift

Lesson: Drift detection isn’t just about fixing things — it’s about knowing the truth.

Auto-Heal vs Alert

When drift is detected, you have two choices:

Option 1: Auto-Heal (Self-Healing)

Automatically revert drift to match Git.

# ArgoCD auto-sync
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Revert manual changes
      allowEmpty: false

# Flux auto-remediation
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  interval: 5m
  force: true  # Override manual changes
  prune: true  # Delete orphaned resources

Pros:

Git is always truth
No manual intervention
Consistent state

Cons:

Can revert intentional changes
May cause issues if Git is wrong
Less flexibility

Option 2: Alert Only

Detect drift, notify humans, don’t auto-fix.

# ArgoCD manual sync
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  syncPolicy:
    automated: null  # No auto-sync
    # Or:
    automated:
      selfHeal: false  # Don't auto-fix drift

Pros:

Human review before changes
Can investigate drift cause
Safer for critical systems

Cons:

Drift persists until human acts
Requires monitoring
Can be forgotten

The Middle Ground

Many teams use a hybrid:

# Auto-sync during business hours, alert-only at night
syncPolicy:
  automated:
    selfHeal: true
  syncWindows:
    - kind: allow
      schedule: '0 9-17 * * 1-5'  # Weekdays 9am-5pm
      duration: 8h
      manualSync: true
    - kind: deny
      schedule: '0 17 * * *'  # After 5pm
      duration: 16h

Handling Legitimate Drift

Not all drift should be reverted. Some is expected.

Ignoring Specific Fields

# ArgoCD ignore differences
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Ignore HPA-managed replicas
    - group: autoscaling
      kind: HorizontalPodAutoscaler
      jqPathExpressions:
        - .status  # Ignore all status fields

Common Fields to Ignore

Resource	Field	Why
Deployment	`/spec/replicas`	HPA manages this
Service	`/spec/clusterIP`	Kubernetes assigns
PVC	`/spec/volumeName`	Dynamic provisioning
All	`/metadata/annotations/kubectl.kubernetes.io/*`	kubectl artifacts
All	`/status`	Runtime status

Flux Field Exclusions

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  patches:
    - patch: |
        - op: remove
          path: /spec/replicas
      target:
        kind: Deployment

Drift Remediation Strategies

Strategy 1: Revert to Git (Most Common)

Cluster is wrong, Git is right.

# ArgoCD
argocd app sync my-app --force

# Flux
flux reconcile kustomization my-app --with-source

Strategy 2: Update Git to Match Cluster

Cluster is right, Git needs updating.

# Export actual state
kubectl get deployment my-app -o yaml > deployment.yaml

# Clean up (remove runtime fields)
yq 'del(.metadata.uid, .metadata.resourceVersion, ...)' deployment.yaml

# Commit to Git
cp deployment.yaml overlays/prod/deployment.yaml
git add . && git commit -m "Update deployment to match production"
git push

Strategy 3: Investigate and Decide

Not sure what’s right — investigate first.

# See the diff
argocd app diff my-app

# Check who/what changed it
kubectl describe deployment my-app | grep -A 10 "Events"

# Check audit logs (if enabled)
kubectl get events --field-selector involvedObject.name=my-app

Decision Framework

Drift Detected
     │
     ▼
Was it intentional?
     │
  ┌──┴──┐
  │     │
 Yes    No
  │     │
  ▼     ▼
Is it   Revert to Git
correct?
  │
  ▼
Update Git
to match

Common Mistakes

Mistake	Problem	Solution
Ignoring drift alerts	Cluster diverges from Git	Act on alerts, automate where safe
Auto-healing everything	Reverts intentional changes	Use ignore rules for expected drift
Never auto-healing	Drift accumulates	Auto-heal non-critical resources
Not auditing drift causes	Same drift keeps happening	Investigate and fix root cause
Manual changes “just this once”	Becomes habit	Enforce GitOps discipline
No alerts on drift	Don’t know when it happens	Set up monitoring

Quiz: Check Your Understanding

Question 1

Your deployment has 10 replicas but Git says 3. HPA is configured with min=3, max=20. Is this drift?

Show Answer

It depends on your definition, but typically: No, this is not problematic drift.

The HPA is legitimately managing replica count based on load. This is expected behavior.

Best practice:

Ignore /spec/replicas in drift detection for HPA-managed deployments
Monitor that HPA exists and is configured correctly
Git defines the HPA policy, HPA controls the replicas

# ArgoCD ignore
ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
      - /spec/replicas

If HPA didn’t exist, then yes, this would be drift to investigate.

Question 2

You find drift that’s been present for 2 months. How do you decide whether to revert to Git or update Git?

Show Answer

Investigation questions:

Why does drift exist?
- Manual hotfix? → Probably should be in Git
- Controller change? → Configure ignore rules
- Bug in GitOps? → Fix the tooling
Which state is correct?
- Does the cluster version work? → It’s validated
- Does Git version work? → Test in staging first
- Neither? → Debug both
What’s the risk of each option?
- Reverting to Git: May break production
- Updating Git: May lose intended config

Safe approach for long-standing drift:

Don’t auto-revert immediately
Deploy Git version to staging
Test thoroughly
If staging works, then sync production
If staging fails, update Git to match cluster

Document the decision — future you will want to know why.

Question 3

How do you prevent manual kubectl changes from causing drift?

Show Answer

Technical controls:

RBAC: Restrict kubectl write access

# Read-only role for most users
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["get", "list", "watch"]

Admission webhooks: Block manual changes

# OPA/Gatekeeper policy
deny manual changes to GitOps-managed resources

Auto-sync with self-heal: Revert changes immediately
```
syncPolicy:
  automated:
    selfHeal: true
```
Alerting: Notify on drift
```
Alert when OutOfSync for > 5 minutes
```

Process controls:

Culture: “All changes through Git”
Training: Teach GitOps workflow
Runbooks: Document how to make emergency changes through Git
Postmortems: Treat manual changes as incidents to learn from

Question 4

ArgoCD shows “OutOfSync” but the diff is just metadata fields. What should you do?

Show Answer

Configure ignore rules for expected metadata:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  ignoreDifferences:
    # Kubernetes-added metadata
    - group: "*"
      kind: "*"
      jsonPointers:
        - /metadata/annotations/kubectl.kubernetes.io/last-applied-configuration

    # Controller-added annotations
    - group: apps
      kind: Deployment
      jsonPointers:
        - /metadata/annotations/deployment.kubernetes.io/revision

    # Status fields (shouldn't be in manifests anyway)
    - group: "*"
      kind: "*"
      managedFieldsManagers:
        - kube-controller-manager

Or use JQ expressions:

ignoreDifferences:
  - group: "*"
    kind: "*"
    jqPathExpressions:
      - .metadata.annotations | select(. != null) | with_entries(select(.key | startswith("kubectl")))

Prevention: Don’t include these fields in your Git manifests in the first place. Use kubectl apply --dry-run=client -o yaml to see clean output.

Hands-On Exercise: Drift Detection Setup

Configure comprehensive drift detection for a GitOps deployment.

Part 1: Identify Expected Drift

List resources where drift is expected:

## Expected Drift Inventory

| Resource | Field | Why Drift Expected | Action |
|----------|-------|-------------------|--------|
| Deployment | /spec/replicas | HPA manages | Ignore |
| Service | /spec/clusterIP | K8s assigns | Ignore |
| | | | |
| | | | |

Part 2: Configure Ignore Rules

Write the ignore configuration:

# For ArgoCD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  ignoreDifferences:
    # Add your rules:
    - group: ___
      kind: ___
      jsonPointers:
        - ___

# For Flux (if applicable)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  # Add patches to handle expected drift
  patches:
    - ___

Part 3: Configure Alerts

Set up drift alerting:

## Prometheus Alert
groups:
  - name: gitops-drift
    rules:
      - alert: ___
        expr: ___
        for: ___
        labels:
          severity: ___
        annotations:
          summary: ___
          description: ___

Part 4: Remediation Runbook

Document the drift response process:

## Drift Remediation Runbook

### Step 1: Acknowledge Alert
- [ ] Check which resource is out of sync
- [ ] Command: _______________

### Step 2: Understand the Drift
- [ ] View the diff
- [ ] Command: _______________
- [ ] Check if drift is expected (see ignore list)

### Step 3: Investigate Cause
- [ ] Check recent kubectl activity
- [ ] Check controller logs
- [ ] Command: _______________

### Step 4: Decide Action
- [ ] If cluster is wrong: Sync to Git
      Command: _______________
- [ ] If Git is wrong: Update Git
      Steps: _______________
- [ ] If unclear: Escalate to _______________

### Step 5: Prevent Recurrence
- [ ] Add ignore rule if expected drift
- [ ] Improve RBAC if manual change
- [ ] Fix tooling if sync issue

Success Criteria

Identified at least 3 expected drift scenarios
Created appropriate ignore rules
Configured at least one drift alert
Documented remediation runbook

Key Takeaways

Drift = cluster != Git: The core GitOps promise is broken when drift exists
Not all drift is bad: HPA, VPA, and some controllers cause expected drift
Ignore expected drift: Configure rules to avoid alert fatigue
Choose your response: Auto-heal for safety, alert-only for critical systems
Investigate root causes: Repeated drift indicates a process problem

IaC Drift: For IaC-specific drift remediation (Terraform state, CloudFormation drift), see IaC Drift Remediation.

Summary

Drift is the enemy of GitOps. When the cluster doesn’t match Git:

You don’t know what’s actually running
Troubleshooting becomes guesswork
Git history becomes unreliable

To manage drift effectively:

Detect it: Use GitOps tools, alerts, monitoring
Classify it: Expected (HPA scaling) vs unexpected (manual changes)
Ignore expected drift: Configure rules to avoid noise
Remediate unexpected drift: Auto-heal or alert based on criticality
Prevent recurrence: RBAC, policies, culture

The goal isn’t zero drift — it’s knowing what drift exists and why.

Next Module

Continue to Module 3.5: Secrets in GitOps to learn how to handle sensitive data in a GitOps workflow.

“If the cluster and Git disagree, one of them is wrong. Find out which.” — GitOps Wisdom

Module 3.4: Drift Detection and Remediation

Prerequisites

What You’ll Be Able to Do

Why This Module Matters

What Is Drift?

Types of Drift

What Causes Drift?

1. Manual Changes (The Usual Suspect)

2. Controllers and Operators

3. Kubernetes Internal Changes

4. Network/Sync Issues

Try This: Detect Drift Manually

GitOps Tool Drift Detection

ArgoCD Sync Status

Flux Kustomization Status

Custom Drift Detection

Did You Know?

War Story: The Drift That Hid a Bug

Auto-Heal vs Alert

Option 1: Auto-Heal (Self-Healing)

Option 2: Alert Only

The Middle Ground

Handling Legitimate Drift

Ignoring Specific Fields

Common Fields to Ignore

Flux Field Exclusions

Drift Remediation Strategies

Strategy 1: Revert to Git (Most Common)

Strategy 2: Update Git to Match Cluster

Strategy 3: Investigate and Decide

Decision Framework

Common Mistakes

Quiz: Check Your Understanding

Question 1

Question 2

Question 3

Question 4

Hands-On Exercise: Drift Detection Setup

Part 1: Identify Expected Drift

Part 2: Configure Ignore Rules

Part 3: Configure Alerts

Part 4: Remediation Runbook

Success Criteria

Key Takeaways

Related Modules

Further Reading

Summary

Next Module