Skip to content

Module 6.5: Drift Detection and Remediation


Before starting this module, you should have completed:


After completing this module, you will be able to:

  • Implement drift detection pipelines that identify when infrastructure state diverges from IaC definitions
  • Design automated remediation workflows that reconcile drift without manual intervention
  • Build alerting and escalation procedures for drift that cannot be automatically resolved
  • Analyze drift root causes — manual changes, incomplete IaC coverage, provider bugs — to prevent recurrence

The Silent Security Group

At 2:47 AM, a senior engineer at a healthcare company received an urgent page. Their compliance monitoring system had detected an anomaly: a production database was accepting connections from an IP range that wasn’t in any approved configuration. The engineer’s first thought was a breach. Their second thought was worse: when did this happen?

The investigation revealed a troubling timeline. Six weeks earlier, during an incident, an on-call engineer had manually added a security group rule to allow access from a vendor’s IP for emergency debugging. The incident was resolved. The temporary rule was forgotten. For 42 days, the production database had a security hole that existed nowhere in their Terraform configurations.

The rule itself wasn’t exploited—they got lucky. But the audit finding triggered a mandatory penetration test, delayed their SOC2 certification by three months, and cost $340,000 in remediation and consulting fees.

This module teaches you how to detect and remediate infrastructure drift—because the most dangerous configuration changes are the ones you don’t know about.


Infrastructure drift occurs when the actual state of resources diverges from the desired state defined in code.

graph LR
subgraph Desired State [Desired State - Terraform]
direction TB
DS_Inst["instance_type = t3.medium"]
DS_Tags["tags = { env: prod }"]
DS_SG["sg_rules: port: 443"]
end
subgraph Actual State [Actual State - Cloud]
direction TB
AS_Inst["instance_type = t3.large"]
AS_Tags["tags = { env: prod, temp: true }"]
AS_SG["sg_rules: port: 443, port: 22"]
end
DS_Inst -.->|"DRIFT!"| AS_Inst
DS_Tags -.->|"DRIFT!"| AS_Tags
DS_SG -.->|"DRIFT!"| AS_SG
subgraph Sources [Common Drift Sources]
direction TB
C1["Manual console changes"]
C2["Emergency fixes not back-ported to code"]
C3["Auto-scaling / self-healing systems"]
C4["Other automation tools (scripts, Lambda)"]
C5["Cloud provider auto-updates"]
C6["Malicious actors"]
end

Stop and think: If an automated incident response tool correctly modifies an auto-scaling group’s size to mitigate an attack, but Terraform runs its next plan, will Terraform see this as drift? How should your configuration handle this?

Resource attributes differ from code:

# In Terraform:
resource "aws_instance" "app" {
instance_type = "t3.medium" # Desired
}
# In AWS:
# Instance is t3.large (someone resized manually)

Resources exist that aren’t in state:

Terminal window
# Resources in AWS but not in Terraform:
# - aws_security_group_rule (manually added)
# - aws_ebs_volume (created by console)
# - aws_iam_policy (created by script)
# These are "shadow IT" - unmanaged infrastructure

State doesn’t match code (uncommitted changes):

# main.tf has:
instance_type = "t3.large"
# terraform.tfstate has:
"instance_type": "t3.medium"
# Developer made local changes but never applied

Pause and predict: If you rely solely on terraform plan to detect drift, what blind spots remain in your infrastructure visibility? What types of shadow IT might slip past this check?

The simplest drift detection—run plan and look for unexpected changes:

Terminal window
# Basic drift detection
terraform plan -detailed-exitcode
# Exit codes:
# 0 = No changes (no drift)
# 1 = Error
# 2 = Changes detected (drift!)
# Script for CI/CD
#!/bin/bash
terraform plan -detailed-exitcode -out=tfplan
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "⚠️ DRIFT DETECTED"
terraform show tfplan
# Send alert
curl -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{"text":"🚨 Infrastructure drift detected in production!"}'
fi

Method 2: Terraform Refresh (Deprecated Approach)

Section titled “Method 2: Terraform Refresh (Deprecated Approach)”
Terminal window
# Old way - updates state from actual infrastructure
terraform refresh
# New way - use plan with refresh
terraform plan -refresh-only
# Shows what would change in state without modifying resources
.github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch:
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [dev, staging, production]
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Terraform Init
working-directory: environments/${{ matrix.environment }}
run: terraform init
- name: Check for Drift
id: drift
working-directory: environments/${{ matrix.environment }}
run: |
terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt
echo "exit_code=$?" >> $GITHUB_OUTPUT
continue-on-error: true
- name: Report Drift
if: steps.drift.outputs.exit_code == '2'
run: |
# Create GitHub Issue
gh issue create \
--title "🚨 Drift detected in ${{ matrix.environment }}" \
--body "$(cat environments/${{ matrix.environment }}/plan.txt)" \
--label "drift,infrastructure"
# Slack notification
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H "Content-Type: application/json" \
-d '{
"text": "🚨 Infrastructure drift detected",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Environment:* ${{ matrix.environment }}\n*Details:* See GitHub Issue"
}
}
]
}'

Detect drift at the cloud provider level:

# AWS Config rule for security group drift
resource "aws_config_config_rule" "security_group_ssh" {
name = "restricted-ssh"
source {
owner = "AWS"
source_identifier = "INCOMING_SSH_DISABLED"
}
scope {
compliance_resource_types = ["AWS::EC2::SecurityGroup"]
}
}
# Custom rule for required tags
resource "aws_config_config_rule" "required_tags" {
name = "required-tags"
source {
owner = "AWS"
source_identifier = "REQUIRED_TAGS"
}
input_parameters = jsonencode({
tag1Key = "Environment"
tag2Key = "Team"
tag3Key = "ManagedBy"
})
}
# Aggregate compliance across accounts
resource "aws_config_configuration_aggregator" "organization" {
name = "organization-aggregator"
organization_aggregation_source {
all_regions = true
role_arn = aws_iam_role.config_aggregator.arn
}
}
# Alert on non-compliant resources
resource "aws_cloudwatch_event_rule" "config_compliance" {
name = "config-compliance-change"
event_pattern = jsonencode({
source = ["aws.config"]
detail-type = ["Config Rules Compliance Change"]
detail = {
newEvaluationResult = {
complianceType = ["NON_COMPLIANT"]
}
}
})
}
resource "aws_cloudwatch_event_target" "sns" {
rule = aws_cloudwatch_event_rule.config_compliance.name
target_id = "send-to-sns"
arn = aws_sns_topic.security_alerts.arn
}

Specialized tool for drift detection:

Terminal window
# Install driftctl
brew install driftctl
# Scan for drift
driftctl scan
# Output example:
# Found 142 resource(s)
# - 134 covered by IaC
# - 5 not covered by IaC ◄── Unmanaged resources
# - 3 missing on cloud ◄── Deleted outside Terraform
#
# Coverage: 94%
# Scan with specific filters
driftctl scan --filter "Type=='aws_security_group'"
# Output as JSON for automation
driftctl scan --output json://drift-report.json
# Generate .driftignore for known exceptions
driftctl gen-driftignore
# .driftignore - Known acceptable drift
# Auto-generated resources
aws_autoscaling_group.*
aws_launch_template.*
# Managed by other teams
aws_iam_role.external-*
# AWS-managed resources
aws_iam_service_linked_role.*

Stop and think: If we block all manual console changes using SCPs, how will an on-call engineer address an emergency outage at 3 AM if the CI/CD pipeline is also down? What kind of “break-glass” mechanism is needed?

For non-critical drift, automatically apply to restore desired state:

.github/workflows/auto-remediate.yml
name: Auto-Remediate Drift
on:
workflow_dispatch:
inputs:
environment:
description: 'Environment to remediate'
required: true
type: choice
options: [dev, staging] # Not production!
jobs:
remediate:
runs-on: ubuntu-latest
environment: ${{ inputs.environment }}
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Terraform Init
working-directory: environments/${{ inputs.environment }}
run: terraform init
- name: Terraform Apply
working-directory: environments/${{ inputs.environment }}
run: terraform apply -auto-approve
- name: Notify Success
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-d '{"text":"✅ Drift remediated in ${{ inputs.environment }}"}'

When drift is intentional, bring resources under management:

Terminal window
# Discover unmanaged resource
driftctl scan --filter "Type=='aws_security_group'"
# Found: aws_security_group.sg-12345 (not in state)
# Add to Terraform configuration
cat >> security_groups.tf << 'EOF'
resource "aws_security_group" "imported" {
name = "manually-created-sg"
description = "Previously manual, now managed"
vpc_id = aws_vpc.main.id
# Add rules to match actual state
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"]
}
}
EOF
# Import into state
terraform import aws_security_group.imported sg-12345
# Verify
terraform plan # Should show no changes

For resources managed elsewhere:

# Auto-scaling managed instance counts
resource "aws_autoscaling_group" "app" {
name = "app-asg"
desired_capacity = 2
min_size = 2
max_size = 10
lifecycle {
ignore_changes = [
desired_capacity, # Managed by ASG policies
target_group_arns, # Managed by ALB
]
}
}
# EKS node group with cluster autoscaler
resource "aws_eks_node_group" "workers" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "workers"
scaling_config {
desired_size = 3
min_size = 2
max_size = 20
}
lifecycle {
ignore_changes = [
scaling_config[0].desired_size, # Cluster autoscaler manages this
]
}
}
# Secrets rotated externally
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db.id
secret_string = random_password.db.result
lifecycle {
ignore_changes = [secret_string] # Rotated by Lambda
}
}

Block manual changes entirely:

# SCP to prevent manual changes to tagged resources
resource "aws_organizations_policy" "prevent_manual_changes" {
name = "PreventManualChangesToManagedResources"
type = "SERVICE_CONTROL_POLICY"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyManualChangesToManagedResources"
Effect = "Deny"
Action = [
"ec2:ModifyInstanceAttribute",
"ec2:ModifySecurityGroupRules",
"rds:ModifyDBInstance",
"s3:PutBucketPolicy",
"s3:DeleteBucketPolicy"
]
Resource = "*"
Condition = {
StringEquals = {
"aws:ResourceTag/ManagedBy" = "terraform"
}
# Exception for Terraform role
"ArnNotLike" = {
"aws:PrincipalArn" = "arn:aws:iam::*:role/TerraformRole"
}
}
}
]
})
}
# IAM policy preventing modifications
resource "aws_iam_policy" "read_only_managed" {
name = "ReadOnlyManagedResources"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyWriteToManagedResources"
Effect = "Deny"
Action = [
"ec2:*",
"rds:*",
"s3:*"
]
Resource = "*"
Condition = {
StringEquals = {
"aws:ResourceTag/ManagedBy" = "terraform"
}
}
},
{
Sid = "AllowReadToManagedResources"
Effect = "Allow"
Action = [
"ec2:Describe*",
"rds:Describe*",
"s3:Get*",
"s3:List*"
]
Resource = "*"
}
]
})
}

flowchart TD
Sched["Scheduled Job<br>(6 hours)"] --> Detect["Drift Detection Service"]
CT["CloudTrail Events"] --> Detect
Config["AWS Config Rules"] --> Detect
Detect --> Metrics["Metrics Dashboard"]
Detect --> Alerts["Alerts (Slack/PD)"]
Detect --> Reports["Weekly Reports"]
Metrics --> Runbook["Remediation Runbook"]
Alerts --> Runbook
Reports --> Runbook
Runbook --> Auto["Auto-Remediate (Dev)"]
Runbook --> Manual["Manual Review (Prod)"]
Runbook --> Accept["Accept & Document (Known)"]
resource "aws_cloudwatch_dashboard" "drift_monitoring" {
dashboard_name = "InfrastructureDrift"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
x = 0
y = 0
width = 12
height = 6
properties = {
title = "Drift Detection Results"
region = "us-east-1"
metrics = [
["Terraform", "DriftDetected", "Environment", "production", { stat = "Sum", period = 86400 }],
["Terraform", "DriftDetected", "Environment", "staging", { stat = "Sum", period = 86400 }],
["Terraform", "DriftDetected", "Environment", "dev", { stat = "Sum", period = 86400 }]
]
view = "timeSeries"
}
},
{
type = "metric"
x = 12
y = 0
width = 12
height = 6
properties = {
title = "AWS Config Compliance"
region = "us-east-1"
metrics = [
["AWS/Config", "ComplianceByConfigRule", "ConfigRuleName", "required-tags"],
["AWS/Config", "ComplianceByConfigRule", "ConfigRuleName", "restricted-ssh"],
["AWS/Config", "ComplianceByConfigRule", "ConfigRuleName", "encrypted-volumes"]
]
}
},
{
type = "text"
x = 0
y = 6
width = 24
height = 2
properties = {
markdown = "## Drift Remediation\n\n| Priority | Action |\n|----------|--------|\n| Critical | [Run Auto-Remediation](https://github.com/company/infra/actions/workflows/remediate.yml) |\n| Review | [View Drift Report](https://github.com/company/infra/issues?q=label:drift) |"
}
}
]
})
}
# Lambda function to publish drift metrics
import boto3
import subprocess
import json
cloudwatch = boto3.client('cloudwatch')
def lambda_handler(event, context):
environment = event['environment']
# Run terraform plan
result = subprocess.run(
['terraform', 'plan', '-detailed-exitcode', '-json'],
cwd=f'/terraform/{environment}',
capture_output=True,
text=True
)
# Parse plan output
drift_detected = result.returncode == 2
changes = parse_plan_changes(result.stdout)
# Publish metrics
cloudwatch.put_metric_data(
Namespace='Terraform',
MetricData=[
{
'MetricName': 'DriftDetected',
'Dimensions': [
{'Name': 'Environment', 'Value': environment}
],
'Value': 1 if drift_detected else 0,
'Unit': 'Count'
},
{
'MetricName': 'ResourcesWithDrift',
'Dimensions': [
{'Name': 'Environment', 'Value': environment}
],
'Value': changes['total'],
'Unit': 'Count'
},
{
'MetricName': 'AddedResources',
'Dimensions': [
{'Name': 'Environment', 'Value': environment}
],
'Value': changes['add'],
'Unit': 'Count'
},
{
'MetricName': 'ChangedResources',
'Dimensions': [
{'Name': 'Environment', 'Value': environment}
],
'Value': changes['change'],
'Unit': 'Count'
},
{
'MetricName': 'DestroyedResources',
'Dimensions': [
{'Name': 'Environment', 'Value': environment}
],
'Value': changes['destroy'],
'Unit': 'Count'
}
]
)
return {
'drift_detected': drift_detected,
'changes': changes
}
def parse_plan_changes(plan_json):
changes = {'add': 0, 'change': 0, 'destroy': 0, 'total': 0}
for line in plan_json.split('\n'):
try:
data = json.loads(line)
if data.get('@level') == 'info' and 'changes' in data.get('@message', ''):
# Parse "Plan: X to add, Y to change, Z to destroy"
msg = data['@message']
# ... parsing logic
except json.JSONDecodeError:
continue
return changes

Company: Healthcare SaaS provider Incident: Undiscovered security group drift for 6 weeks

Timeline:

  • Day 0 (Friday 11 PM): Production incident—vendor can’t connect for emergency support
  • Day 0 (11:30 PM): On-call engineer adds temporary security group rule via console
  • Day 0 (Saturday 1 AM): Incident resolved, engineer goes to sleep
  • Day 1 (Saturday): Engineer has day off, forgets to remove rule
  • Day 2-41: Rule persists, nobody notices
  • Day 42 (Wednesday 2 PM): Compliance scanner detects anomaly
  • Day 42 (3 PM): Security team investigates, finds 42-day-old rule
  • Day 42 (5 PM): Incident declared, forensics begins

What the Security Group Looked Like:

# In Terraform (desired state):
resource "aws_security_group_rule" "db_access" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"] # Internal only
security_group_id = aws_security_group.db.id
}
# Actual AWS state (unknown to Terraform):
# - Original rule (10.0.0.0/8) ✓
# - Manual rule (vendor IP: 203.0.113.0/24) ← 42 days old!

Financial Impact:

  • Incident response team: $45K
  • External forensics: $120K
  • Compliance consultant: $85K
  • Penetration test (required): $65K
  • SOC2 delay (3 months): Lost deal worth $2.1M
  • Additional security monitoring: $25K/year
  • Total: 340Kdirect+340K direct + 2.1M opportunity cost

Detection That Would Have Caught This:

# If they had drift detection running...
# Day 1 detection would have shown:
terraform plan output:
# aws_security_group.db has changed
#
# ~ resource "aws_security_group" "db" {
# ~ ingress {
# + {
# + cidr_blocks = ["203.0.113.0/24"] # UNEXPECTED!
# + from_port = 5432
# + to_port = 5432
# + protocol = "tcp"
# }
# }
# }
#
# 1 resource has drift

Prevention Measures Implemented:

# 1. SCP preventing manual security group changes
resource "aws_organizations_policy" "no_manual_sg" {
content = jsonencode({
Statement = [{
Effect = "Deny"
Action = ["ec2:AuthorizeSecurityGroupIngress"]
Resource = "*"
Condition = {
ArnNotLike = {
"aws:PrincipalArn" = "arn:aws:iam::*:role/TerraformRole"
}
}
}]
})
}
# 2. Drift detection every 6 hours
# (GitHub Actions workflow)
# 3. Break-glass procedure for emergencies
# - Requires ticket number
# - Auto-expires after 4 hours
# - Creates GitHub issue for follow-up
# 4. CloudTrail alerting on any SG changes
resource "aws_cloudwatch_event_rule" "sg_changes" {
event_pattern = jsonencode({
source = ["aws.ec2"]
detail-type = ["AWS API Call via CloudTrail"]
detail = {
eventName = [
"AuthorizeSecurityGroupIngress",
"AuthorizeSecurityGroupEgress",
"RevokeSecurityGroupIngress",
"RevokeSecurityGroupEgress"
]
}
})
}

MistakeProblemSolution
No drift detectionSilent divergence from desired stateScheduled terraform plan
Ignoring drift alertsAlert fatigue, missed critical driftCategorize severity, auto-remediate low-risk
Manual console accessPrimary drift sourceSCPs, read-only console access
No break-glass procedureEmergency changes bypass IaC entirelyTracked emergency access with auto-expiry
Too broad ignore_changesMasks legitimate driftOnly ignore what’s truly managed elsewhere
No drift metricsCan’t track drift trendsCloudWatch metrics, dashboards
Remediation without reviewApply might break thingsAlways review production drift before apply
No root cause analysisSame drift recursTrack drift sources, fix process gaps

1. A junior engineer manually resizes a database instance from the AWS console during a load spike, while another engineer updates the Terraform code to change a security group rule, but hasn't applied it yet. What types of drift are occurring in this scenario?

Answer: In this scenario, two distinct types of drift are occurring simultaneously. First, there is configuration drift because the database instance size in the cloud now differs from the desired state defined in the Terraform configuration. Second, there is code drift because the unapplied security group changes mean the local Terraform codebase and state diverge from the actual active state. Understanding the difference is crucial because running a standard plan will detect both, but only one represents a shadow IT change that circumvented the deployment process.

2. You are configuring a CI/CD pipeline to automatically check for drift every night. You write a bash script that runs `terraform plan -detailed-exitcode`. The pipeline fails and reports an exit code of 2. What exactly does this indicate, and how should your pipeline handle it?

Answer: An exit code of 2 from terraform plan -detailed-exitcode specifically indicates that Terraform ran successfully but detected a non-empty diff, meaning drift or unapplied changes exist. This is distinct from an exit code of 0 (no changes) or an exit code of 1 (a runtime or syntax error during the planning phase). Your CI/CD pipeline should be configured to interpret exit code 2 not as a pipeline crash, but as a trigger to generate an alert, capture the plan output, and notify the infrastructure team that a divergence requires their attention. By capturing the specific drift output before exiting, the pipeline ensures the infrastructure team has immediate context on exactly which resources diverged without needing to rerun the check manually.

3. Your team uses an AWS Lambda function to automatically rotate a database password stored in AWS Secrets Manager every 30 days. However, your daily Terraform pipeline keeps detecting drift on the `secret_string` attribute and attempting to revert the password. How do you resolve this conflict?

Answer: You should resolve this by adding a lifecycle { ignore_changes = [secret_string] } block to the secret version resource in your Terraform code. This tells Terraform to ignore any future changes to that specific attribute after the resource is initially created. You apply this strategy because the attribute is intentionally managed by a secondary, dynamic system rather than static IaC. If you instead tried to “fix” the drift by running Terraform, you would overwrite the securely rotated password with the original state, causing an immediate database authentication outage.

4. Your leadership is hesitant to invest engineering time into a scheduled drift detection pipeline. They argue that manual reviews are sufficient. If an unmanaged change causes one 8-hour outage per quarter at $50,000 per hour, how can you financially justify implementing a 6-hour automated detection window?

Answer: Without automated detection, a single 8-hour outage costs the company 400,000perquarter,or400,000 per quarter, or 1.6 million annually. By implementing a 6-hour automated detection window, the maximum time a drifting configuration can exist undetected shrinks drastically, making it highly probable to catch the anomaly before it triggers an outage. Even if the automated detection only prevents 75 percent of the outages, it would save the organization $1.2 million per year. This massive cost avoidance far outweighs the minimal engineering investment required to set up a scheduled GitHub Action or CloudWatch rule, proving that drift detection is a highly profitable operational safeguard.

5. You run `terraform plan` and it reports zero changes. However, your security team informs you that a new, unapproved S3 bucket exists in the production account. Why did Terraform miss this, and what tool should you use to catch it next time?

Answer: Terraform missed the S3 bucket because standard Terraform commands only track resources that are already defined in its state file; it ignores unmanaged resources entirely. The S3 bucket represents “state drift” or shadow IT, which falls outside Terraform’s default purview. To catch this, you should use a specialized tool like driftctl, which actively scans the entire cloud environment and compares all discovered resources against the Terraform state. This provides a comprehensive coverage report identifying exactly what exists in the cloud but is missing from your IaC definitions, allowing you to quickly identify shadow resources.

6. Despite strict company policies, developers continue to manually modify EC2 security groups through the AWS console to quickly test new features. You need to implement a technical control that physically prevents this behavior for Terraform-managed resources without blocking read access. What is the most robust implementation?

Answer: The most robust implementation is to apply an AWS Organizations Service Control Policy (SCP) that explicitly denies modification actions, such as ec2:ModifySecurityGroupRules, to any resource tagged as managed by Terraform. You must include an exception condition that allows these actions only if the caller’s Principal ARN matches your dedicated Terraform execution role. This approach represents a hard boundary that cannot be overridden by individual IAM permissions or local account administrators. It effectively forces developers to route all changes through the IaC pipeline while preserving their ability to view resources via the console.

7. During a routine audit, you discover a manually added ingress rule on a production database security group that has been present for 42 days. The engineer who added it has left the company, and there is no documentation explaining its purpose. What is the safest, most systematic way to handle this drift?

Answer: The safest approach is to first assess the immediate security risk of the rule and document the finding in an incident or audit report. Since the rule’s purpose is unknown and reverting it blindly might break a critical undocumented integration, you should temporarily codify the rule into your Terraform configuration to bring it under management. Once the state is synchronized, you can monitor network traffic to see if the rule is actively used by reviewing VPC flow logs. If it is unused or deemed too risky, you then remove it via standard Terraform deployment processes, ensuring the removal is tracked, reviewed, and easily reversible.

8. You successfully configured a drift remediation script that automatically runs `terraform apply` whenever drift is detected. You deployed this to your staging environment and it worked perfectly. Why should you strongly reconsider deploying this same auto-remediation script to your production environment?

Answer: Deploying auto-remediation to production is highly dangerous because manual changes often represent emergency “break-glass” fixes implemented during critical outages. If an automated system immediately reverts those fixes, it will instantly recreate the outage, leading to a looping battle between the incident responders and the CI/CD pipeline. Additionally, automatically applying state changes strips away the human review process necessary to understand the blast radius of the remediation. For production environments, drift should trigger high-priority alerts and tickets, requiring human judgment to decide whether to codify the drift into the official configuration or safely revert it during a maintenance window.


Objective: Set up drift detection and demonstrate remediation strategies.

Terminal window
# Create test infrastructure
mkdir -p drift-lab
cd drift-lab
cat > main.tf << 'EOF'
terraform {
required_providers {
local = {
source = "hashicorp/local"
version = "~> 2.0"
}
}
}
# Create a file that we'll manually modify
resource "local_file" "config" {
filename = "${path.module}/config.json"
content = jsonencode({
environment = "production"
debug = false
timeout = 30
tags = {
ManagedBy = "terraform"
}
})
}
output "config_path" {
value = local_file.config.filename
}
EOF
# Apply initial configuration
terraform init
terraform apply -auto-approve
# Verify file contents
cat config.json
Terminal window
# Manually modify the file (simulating console change)
cat > config.json << 'EOF'
{
"environment": "production",
"debug": true,
"timeout": 60,
"tags": {
"ManagedBy": "terraform",
"TempFix": "ticket-12345"
}
}
EOF
# Check for drift
terraform plan -detailed-exitcode
echo "Exit code: $?"
# View the drift
terraform plan
Terminal window
# Option A: Revert to desired state
terraform apply -auto-approve
cat config.json # Back to original
# Option B: Accept drift by updating code
# First, introduce drift again
cat > config.json << 'EOF'
{
"environment": "production",
"debug": true,
"timeout": 60
}
EOF
# Update Terraform to match actual state
cat > main.tf << 'EOF'
terraform {
required_providers {
local = {
source = "hashicorp/local"
version = "~> 2.0"
}
}
}
resource "local_file" "config" {
filename = "${path.module}/config.json"
content = jsonencode({
environment = "production"
debug = true # Updated to match reality
timeout = 60 # Updated to match reality
tags = {
ManagedBy = "terraform"
}
})
}
EOF
terraform plan # Should show changes to match our new code
# Create drift detection script
cat > detect-drift.sh << 'EOF'
#!/bin/bash
set -e
echo "🔍 Running drift detection..."
terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -eq 0 ]; then
echo "✅ No drift detected"
elif [ $EXIT_CODE -eq 2 ]; then
echo "⚠️ DRIFT DETECTED!"
echo ""
echo "Changes found:"
terraform show tfplan
echo ""
echo "Remediation options:"
echo "1. Run 'terraform apply' to restore desired state"
echo "2. Update Terraform code to accept changes"
echo "3. Add lifecycle { ignore_changes } for managed attributes"
else
echo "❌ Error during drift detection"
exit 1
fi
exit $EXIT_CODE
EOF
chmod +x detect-drift.sh
# Test the script
./detect-drift.sh
  • Initial infrastructure deployed successfully
  • Manual changes introduced (drift created)
  • terraform plan -detailed-exitcode returns exit code 2
  • Drift details visible in plan output
  • At least one remediation option successfully executed
  • Detection script works and provides actionable output

  • Drift is inevitable - Manual changes, emergencies, other automation all cause drift
  • Detect early, detect often - Run terraform plan on schedule (every 6 hours minimum)
  • Categorize drift severity - Security drift is critical, tag drift might be informational
  • Automate detection, review remediation - Auto-apply only for non-production
  • Use ignore_changes sparingly - Only for truly externally-managed attributes
  • Prevent at source - SCPs, read-only console access, break-glass procedures
  • Track metrics - Drift frequency, time-to-detect, time-to-remediate
  • Root cause analysis - Fix the process that caused drift, not just the drift
  • Document exceptions - Known drift should be in .driftignore with explanation
  • Break-glass needs follow-up - Emergency changes must be codified within 24 hours

Drift Statistics: A 2023 survey found that 73% of organizations experience infrastructure drift within 24 hours of deployment, and 91% experience drift within one week.

Detection Gap: The average time to detect infrastructure drift is 12 days, but security-related drift takes an average of 27 days to discover.

Driftctl Origins: Driftctl was created by Cloudskiff (now part of Snyk) in 2020 specifically to solve the problem of detecting unmanaged cloud resources that Terraform plan cannot find.

Cost of Drift: According to a 2023 report, organizations spend an average of 4.2 hours per week manually investigating and remediating infrastructure drift, costing approximately $35,000 per engineer annually.


GitOps Drift: For GitOps-specific drift detection (ArgoCD sync, Flux reconciliation), see GitOps Drift Detection.


Continue to Module 6.6: IaC Cost Management to learn how to estimate, track, and optimize infrastructure costs directly in your Terraform workflow.