Module 6.5: Drift Detection and Remediation
Цей контент ще не доступний вашою мовою.
Complexity: [MEDIUM]
Section titled “Complexity: [MEDIUM]”Time to Complete: 45 minutes
Section titled “Time to Complete: 45 minutes”Prerequisites
Section titled “Prerequisites”Before starting this module, you should have completed:
- Module 6.1: IaC Fundamentals - Core IaC concepts
- Module 6.4: IaC at Scale - Scale challenges
- Basic understanding of desired state vs. actual state
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement drift detection pipelines that identify when infrastructure state diverges from IaC definitions
- Design automated remediation workflows that reconcile drift without manual intervention
- Build alerting and escalation procedures for drift that cannot be automatically resolved
- Analyze drift root causes — manual changes, incomplete IaC coverage, provider bugs — to prevent recurrence
Why This Module Matters
Section titled “Why This Module Matters”The Silent Security Group
At 2:47 AM, a senior engineer at a healthcare company received an urgent page. Their compliance monitoring system had detected an anomaly: a production database was accepting connections from an IP range that wasn’t in any approved configuration. The engineer’s first thought was a breach. Their second thought was worse: when did this happen?
The investigation revealed a troubling timeline. Six weeks earlier, during an incident, an on-call engineer had manually added a security group rule to allow access from a vendor’s IP for emergency debugging. The incident was resolved. The temporary rule was forgotten. For 42 days, the production database had a security hole that existed nowhere in their Terraform configurations.
The rule itself wasn’t exploited—they got lucky. But the audit finding triggered a mandatory penetration test, delayed their SOC2 certification by three months, and cost $340,000 in remediation and consulting fees.
This module teaches you how to detect and remediate infrastructure drift—because the most dangerous configuration changes are the ones you don’t know about.
Understanding Infrastructure Drift
Section titled “Understanding Infrastructure Drift”Infrastructure drift occurs when the actual state of resources diverges from the desired state defined in code.
graph LR subgraph Desired State [Desired State - Terraform] direction TB DS_Inst["instance_type = t3.medium"] DS_Tags["tags = { env: prod }"] DS_SG["sg_rules: port: 443"] end
subgraph Actual State [Actual State - Cloud] direction TB AS_Inst["instance_type = t3.large"] AS_Tags["tags = { env: prod, temp: true }"] AS_SG["sg_rules: port: 443, port: 22"] end
DS_Inst -.->|"DRIFT!"| AS_Inst DS_Tags -.->|"DRIFT!"| AS_Tags DS_SG -.->|"DRIFT!"| AS_SG
subgraph Sources [Common Drift Sources] direction TB C1["Manual console changes"] C2["Emergency fixes not back-ported to code"] C3["Auto-scaling / self-healing systems"] C4["Other automation tools (scripts, Lambda)"] C5["Cloud provider auto-updates"] C6["Malicious actors"] endTypes of Drift
Section titled “Types of Drift”Stop and think: If an automated incident response tool correctly modifies an auto-scaling group’s size to mitigate an attack, but Terraform runs its next plan, will Terraform see this as drift? How should your configuration handle this?
1. Configuration Drift
Section titled “1. Configuration Drift”Resource attributes differ from code:
# In Terraform:resource "aws_instance" "app" { instance_type = "t3.medium" # Desired}
# In AWS:# Instance is t3.large (someone resized manually)2. State Drift
Section titled “2. State Drift”Resources exist that aren’t in state:
# Resources in AWS but not in Terraform:# - aws_security_group_rule (manually added)# - aws_ebs_volume (created by console)# - aws_iam_policy (created by script)
# These are "shadow IT" - unmanaged infrastructure3. Code Drift
Section titled “3. Code Drift”State doesn’t match code (uncommitted changes):
# main.tf has:instance_type = "t3.large"
# terraform.tfstate has:"instance_type": "t3.medium"
# Developer made local changes but never appliedDetecting Drift
Section titled “Detecting Drift”Pause and predict: If you rely solely on
terraform planto detect drift, what blind spots remain in your infrastructure visibility? What types of shadow IT might slip past this check?
Method 1: Terraform Plan
Section titled “Method 1: Terraform Plan”The simplest drift detection—run plan and look for unexpected changes:
# Basic drift detectionterraform plan -detailed-exitcode
# Exit codes:# 0 = No changes (no drift)# 1 = Error# 2 = Changes detected (drift!)
# Script for CI/CD#!/bin/bashterraform plan -detailed-exitcode -out=tfplanEXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then echo "⚠️ DRIFT DETECTED" terraform show tfplan # Send alert curl -X POST "$SLACK_WEBHOOK" \ -H "Content-Type: application/json" \ -d '{"text":"🚨 Infrastructure drift detected in production!"}'fiMethod 2: Terraform Refresh (Deprecated Approach)
Section titled “Method 2: Terraform Refresh (Deprecated Approach)”# Old way - updates state from actual infrastructureterraform refresh
# New way - use plan with refreshterraform plan -refresh-only
# Shows what would change in state without modifying resourcesMethod 3: Scheduled Drift Detection
Section titled “Method 3: Scheduled Drift Detection”name: Drift Detection
on: schedule: - cron: '0 */6 * * *' # Every 6 hours workflow_dispatch:
jobs: detect-drift: runs-on: ubuntu-latest strategy: matrix: environment: [dev, staging, production]
steps: - uses: actions/checkout@v4
- name: Setup Terraform uses: hashicorp/setup-terraform@v3
- name: Configure AWS uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ secrets.AWS_ROLE_ARN }} aws-region: us-east-1
- name: Terraform Init working-directory: environments/${{ matrix.environment }} run: terraform init
- name: Check for Drift id: drift working-directory: environments/${{ matrix.environment }} run: | terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt echo "exit_code=$?" >> $GITHUB_OUTPUT continue-on-error: true
- name: Report Drift if: steps.drift.outputs.exit_code == '2' run: | # Create GitHub Issue gh issue create \ --title "🚨 Drift detected in ${{ matrix.environment }}" \ --body "$(cat environments/${{ matrix.environment }}/plan.txt)" \ --label "drift,infrastructure"
# Slack notification curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \ -H "Content-Type: application/json" \ -d '{ "text": "🚨 Infrastructure drift detected", "blocks": [ { "type": "section", "text": { "type": "mrkdwn", "text": "*Environment:* ${{ matrix.environment }}\n*Details:* See GitHub Issue" } } ] }'Method 4: AWS Config Rules
Section titled “Method 4: AWS Config Rules”Detect drift at the cloud provider level:
# AWS Config rule for security group driftresource "aws_config_config_rule" "security_group_ssh" { name = "restricted-ssh"
source { owner = "AWS" source_identifier = "INCOMING_SSH_DISABLED" }
scope { compliance_resource_types = ["AWS::EC2::SecurityGroup"] }}
# Custom rule for required tagsresource "aws_config_config_rule" "required_tags" { name = "required-tags"
source { owner = "AWS" source_identifier = "REQUIRED_TAGS" }
input_parameters = jsonencode({ tag1Key = "Environment" tag2Key = "Team" tag3Key = "ManagedBy" })}
# Aggregate compliance across accountsresource "aws_config_configuration_aggregator" "organization" { name = "organization-aggregator"
organization_aggregation_source { all_regions = true role_arn = aws_iam_role.config_aggregator.arn }}
# Alert on non-compliant resourcesresource "aws_cloudwatch_event_rule" "config_compliance" { name = "config-compliance-change"
event_pattern = jsonencode({ source = ["aws.config"] detail-type = ["Config Rules Compliance Change"] detail = { newEvaluationResult = { complianceType = ["NON_COMPLIANT"] } } })}
resource "aws_cloudwatch_event_target" "sns" { rule = aws_cloudwatch_event_rule.config_compliance.name target_id = "send-to-sns" arn = aws_sns_topic.security_alerts.arn}Method 5: Driftctl (Open Source)
Section titled “Method 5: Driftctl (Open Source)”Specialized tool for drift detection:
# Install driftctlbrew install driftctl
# Scan for driftdriftctl scan
# Output example:# Found 142 resource(s)# - 134 covered by IaC# - 5 not covered by IaC ◄── Unmanaged resources# - 3 missing on cloud ◄── Deleted outside Terraform## Coverage: 94%
# Scan with specific filtersdriftctl scan --filter "Type=='aws_security_group'"
# Output as JSON for automationdriftctl scan --output json://drift-report.json
# Generate .driftignore for known exceptionsdriftctl gen-driftignore# .driftignore - Known acceptable drift# Auto-generated resourcesaws_autoscaling_group.*aws_launch_template.*
# Managed by other teamsaws_iam_role.external-*
# AWS-managed resourcesaws_iam_service_linked_role.*Remediation Strategies
Section titled “Remediation Strategies”Stop and think: If we block all manual console changes using SCPs, how will an on-call engineer address an emergency outage at 3 AM if the CI/CD pipeline is also down? What kind of “break-glass” mechanism is needed?
Strategy 1: Auto-Remediate with Apply
Section titled “Strategy 1: Auto-Remediate with Apply”For non-critical drift, automatically apply to restore desired state:
name: Auto-Remediate Drift
on: workflow_dispatch: inputs: environment: description: 'Environment to remediate' required: true type: choice options: [dev, staging] # Not production!
jobs: remediate: runs-on: ubuntu-latest environment: ${{ inputs.environment }}
steps: - uses: actions/checkout@v4
- name: Setup Terraform uses: hashicorp/setup-terraform@v3
- name: Configure AWS uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ secrets.AWS_ROLE_ARN }} aws-region: us-east-1
- name: Terraform Init working-directory: environments/${{ inputs.environment }} run: terraform init
- name: Terraform Apply working-directory: environments/${{ inputs.environment }} run: terraform apply -auto-approve
- name: Notify Success run: | curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \ -d '{"text":"✅ Drift remediated in ${{ inputs.environment }}"}'Strategy 2: Import Unmanaged Resources
Section titled “Strategy 2: Import Unmanaged Resources”When drift is intentional, bring resources under management:
# Discover unmanaged resourcedriftctl scan --filter "Type=='aws_security_group'"# Found: aws_security_group.sg-12345 (not in state)
# Add to Terraform configurationcat >> security_groups.tf << 'EOF'resource "aws_security_group" "imported" { name = "manually-created-sg" description = "Previously manual, now managed" vpc_id = aws_vpc.main.id
# Add rules to match actual state ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["10.0.0.0/8"] }}EOF
# Import into stateterraform import aws_security_group.imported sg-12345
# Verifyterraform plan # Should show no changesStrategy 3: Accept Drift (Ignore Changes)
Section titled “Strategy 3: Accept Drift (Ignore Changes)”For resources managed elsewhere:
# Auto-scaling managed instance countsresource "aws_autoscaling_group" "app" { name = "app-asg" desired_capacity = 2 min_size = 2 max_size = 10
lifecycle { ignore_changes = [ desired_capacity, # Managed by ASG policies target_group_arns, # Managed by ALB ] }}
# EKS node group with cluster autoscalerresource "aws_eks_node_group" "workers" { cluster_name = aws_eks_cluster.main.name node_group_name = "workers"
scaling_config { desired_size = 3 min_size = 2 max_size = 20 }
lifecycle { ignore_changes = [ scaling_config[0].desired_size, # Cluster autoscaler manages this ] }}
# Secrets rotated externallyresource "aws_secretsmanager_secret_version" "db_password" { secret_id = aws_secretsmanager_secret.db.id secret_string = random_password.db.result
lifecycle { ignore_changes = [secret_string] # Rotated by Lambda }}Strategy 4: Prevent Drift at Source
Section titled “Strategy 4: Prevent Drift at Source”Block manual changes entirely:
# SCP to prevent manual changes to tagged resourcesresource "aws_organizations_policy" "prevent_manual_changes" { name = "PreventManualChangesToManagedResources" type = "SERVICE_CONTROL_POLICY" content = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "DenyManualChangesToManagedResources" Effect = "Deny" Action = [ "ec2:ModifyInstanceAttribute", "ec2:ModifySecurityGroupRules", "rds:ModifyDBInstance", "s3:PutBucketPolicy", "s3:DeleteBucketPolicy" ] Resource = "*" Condition = { StringEquals = { "aws:ResourceTag/ManagedBy" = "terraform" } # Exception for Terraform role "ArnNotLike" = { "aws:PrincipalArn" = "arn:aws:iam::*:role/TerraformRole" } } } ] })}
# IAM policy preventing modificationsresource "aws_iam_policy" "read_only_managed" { name = "ReadOnlyManagedResources"
policy = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "DenyWriteToManagedResources" Effect = "Deny" Action = [ "ec2:*", "rds:*", "s3:*" ] Resource = "*" Condition = { StringEquals = { "aws:ResourceTag/ManagedBy" = "terraform" } } }, { Sid = "AllowReadToManagedResources" Effect = "Allow" Action = [ "ec2:Describe*", "rds:Describe*", "s3:Get*", "s3:List*" ] Resource = "*" } ] })}Continuous Drift Monitoring
Section titled “Continuous Drift Monitoring”Architecture
Section titled “Architecture”flowchart TD Sched["Scheduled Job<br>(6 hours)"] --> Detect["Drift Detection Service"] CT["CloudTrail Events"] --> Detect Config["AWS Config Rules"] --> Detect
Detect --> Metrics["Metrics Dashboard"] Detect --> Alerts["Alerts (Slack/PD)"] Detect --> Reports["Weekly Reports"]
Metrics --> Runbook["Remediation Runbook"] Alerts --> Runbook Reports --> Runbook
Runbook --> Auto["Auto-Remediate (Dev)"] Runbook --> Manual["Manual Review (Prod)"] Runbook --> Accept["Accept & Document (Known)"]CloudWatch Dashboard
Section titled “CloudWatch Dashboard”resource "aws_cloudwatch_dashboard" "drift_monitoring" { dashboard_name = "InfrastructureDrift"
dashboard_body = jsonencode({ widgets = [ { type = "metric" x = 0 y = 0 width = 12 height = 6 properties = { title = "Drift Detection Results" region = "us-east-1" metrics = [ ["Terraform", "DriftDetected", "Environment", "production", { stat = "Sum", period = 86400 }], ["Terraform", "DriftDetected", "Environment", "staging", { stat = "Sum", period = 86400 }], ["Terraform", "DriftDetected", "Environment", "dev", { stat = "Sum", period = 86400 }] ] view = "timeSeries" } }, { type = "metric" x = 12 y = 0 width = 12 height = 6 properties = { title = "AWS Config Compliance" region = "us-east-1" metrics = [ ["AWS/Config", "ComplianceByConfigRule", "ConfigRuleName", "required-tags"], ["AWS/Config", "ComplianceByConfigRule", "ConfigRuleName", "restricted-ssh"], ["AWS/Config", "ComplianceByConfigRule", "ConfigRuleName", "encrypted-volumes"] ] } }, { type = "text" x = 0 y = 6 width = 24 height = 2 properties = { markdown = "## Drift Remediation\n\n| Priority | Action |\n|----------|--------|\n| Critical | [Run Auto-Remediation](https://github.com/company/infra/actions/workflows/remediate.yml) |\n| Review | [View Drift Report](https://github.com/company/infra/issues?q=label:drift) |" } } ] })}Custom Drift Metrics
Section titled “Custom Drift Metrics”# Lambda function to publish drift metricsimport boto3import subprocessimport json
cloudwatch = boto3.client('cloudwatch')
def lambda_handler(event, context): environment = event['environment']
# Run terraform plan result = subprocess.run( ['terraform', 'plan', '-detailed-exitcode', '-json'], cwd=f'/terraform/{environment}', capture_output=True, text=True )
# Parse plan output drift_detected = result.returncode == 2 changes = parse_plan_changes(result.stdout)
# Publish metrics cloudwatch.put_metric_data( Namespace='Terraform', MetricData=[ { 'MetricName': 'DriftDetected', 'Dimensions': [ {'Name': 'Environment', 'Value': environment} ], 'Value': 1 if drift_detected else 0, 'Unit': 'Count' }, { 'MetricName': 'ResourcesWithDrift', 'Dimensions': [ {'Name': 'Environment', 'Value': environment} ], 'Value': changes['total'], 'Unit': 'Count' }, { 'MetricName': 'AddedResources', 'Dimensions': [ {'Name': 'Environment', 'Value': environment} ], 'Value': changes['add'], 'Unit': 'Count' }, { 'MetricName': 'ChangedResources', 'Dimensions': [ {'Name': 'Environment', 'Value': environment} ], 'Value': changes['change'], 'Unit': 'Count' }, { 'MetricName': 'DestroyedResources', 'Dimensions': [ {'Name': 'Environment', 'Value': environment} ], 'Value': changes['destroy'], 'Unit': 'Count' } ] )
return { 'drift_detected': drift_detected, 'changes': changes }
def parse_plan_changes(plan_json): changes = {'add': 0, 'change': 0, 'destroy': 0, 'total': 0} for line in plan_json.split('\n'): try: data = json.loads(line) if data.get('@level') == 'info' and 'changes' in data.get('@message', ''): # Parse "Plan: X to add, Y to change, Z to destroy" msg = data['@message'] # ... parsing logic except json.JSONDecodeError: continue return changesWar Story: The 42-Day Security Hole
Section titled “War Story: The 42-Day Security Hole”Company: Healthcare SaaS provider Incident: Undiscovered security group drift for 6 weeks
Timeline:
- Day 0 (Friday 11 PM): Production incident—vendor can’t connect for emergency support
- Day 0 (11:30 PM): On-call engineer adds temporary security group rule via console
- Day 0 (Saturday 1 AM): Incident resolved, engineer goes to sleep
- Day 1 (Saturday): Engineer has day off, forgets to remove rule
- Day 2-41: Rule persists, nobody notices
- Day 42 (Wednesday 2 PM): Compliance scanner detects anomaly
- Day 42 (3 PM): Security team investigates, finds 42-day-old rule
- Day 42 (5 PM): Incident declared, forensics begins
What the Security Group Looked Like:
# In Terraform (desired state):resource "aws_security_group_rule" "db_access" { type = "ingress" from_port = 5432 to_port = 5432 protocol = "tcp" cidr_blocks = ["10.0.0.0/8"] # Internal only security_group_id = aws_security_group.db.id}
# Actual AWS state (unknown to Terraform):# - Original rule (10.0.0.0/8) ✓# - Manual rule (vendor IP: 203.0.113.0/24) ← 42 days old!Financial Impact:
- Incident response team: $45K
- External forensics: $120K
- Compliance consultant: $85K
- Penetration test (required): $65K
- SOC2 delay (3 months): Lost deal worth $2.1M
- Additional security monitoring: $25K/year
- Total: 2.1M opportunity cost
Detection That Would Have Caught This:
# If they had drift detection running...# Day 1 detection would have shown:
terraform plan output:# aws_security_group.db has changed## ~ resource "aws_security_group" "db" {# ~ ingress {# + {# + cidr_blocks = ["203.0.113.0/24"] # UNEXPECTED!# + from_port = 5432# + to_port = 5432# + protocol = "tcp"# }# }# }## 1 resource has driftPrevention Measures Implemented:
# 1. SCP preventing manual security group changesresource "aws_organizations_policy" "no_manual_sg" { content = jsonencode({ Statement = [{ Effect = "Deny" Action = ["ec2:AuthorizeSecurityGroupIngress"] Resource = "*" Condition = { ArnNotLike = { "aws:PrincipalArn" = "arn:aws:iam::*:role/TerraformRole" } } }] })}
# 2. Drift detection every 6 hours# (GitHub Actions workflow)
# 3. Break-glass procedure for emergencies# - Requires ticket number# - Auto-expires after 4 hours# - Creates GitHub issue for follow-up
# 4. CloudTrail alerting on any SG changesresource "aws_cloudwatch_event_rule" "sg_changes" { event_pattern = jsonencode({ source = ["aws.ec2"] detail-type = ["AWS API Call via CloudTrail"] detail = { eventName = [ "AuthorizeSecurityGroupIngress", "AuthorizeSecurityGroupEgress", "RevokeSecurityGroupIngress", "RevokeSecurityGroupEgress" ] } })}Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No drift detection | Silent divergence from desired state | Scheduled terraform plan |
| Ignoring drift alerts | Alert fatigue, missed critical drift | Categorize severity, auto-remediate low-risk |
| Manual console access | Primary drift source | SCPs, read-only console access |
| No break-glass procedure | Emergency changes bypass IaC entirely | Tracked emergency access with auto-expiry |
| Too broad ignore_changes | Masks legitimate drift | Only ignore what’s truly managed elsewhere |
| No drift metrics | Can’t track drift trends | CloudWatch metrics, dashboards |
| Remediation without review | Apply might break things | Always review production drift before apply |
| No root cause analysis | Same drift recurs | Track drift sources, fix process gaps |
1. A junior engineer manually resizes a database instance from the AWS console during a load spike, while another engineer updates the Terraform code to change a security group rule, but hasn't applied it yet. What types of drift are occurring in this scenario?
Answer: In this scenario, two distinct types of drift are occurring simultaneously. First, there is configuration drift because the database instance size in the cloud now differs from the desired state defined in the Terraform configuration. Second, there is code drift because the unapplied security group changes mean the local Terraform codebase and state diverge from the actual active state. Understanding the difference is crucial because running a standard plan will detect both, but only one represents a shadow IT change that circumvented the deployment process.
2. You are configuring a CI/CD pipeline to automatically check for drift every night. You write a bash script that runs `terraform plan -detailed-exitcode`. The pipeline fails and reports an exit code of 2. What exactly does this indicate, and how should your pipeline handle it?
Answer:
An exit code of 2 from terraform plan -detailed-exitcode specifically indicates that Terraform ran successfully but detected a non-empty diff, meaning drift or unapplied changes exist. This is distinct from an exit code of 0 (no changes) or an exit code of 1 (a runtime or syntax error during the planning phase). Your CI/CD pipeline should be configured to interpret exit code 2 not as a pipeline crash, but as a trigger to generate an alert, capture the plan output, and notify the infrastructure team that a divergence requires their attention. By capturing the specific drift output before exiting, the pipeline ensures the infrastructure team has immediate context on exactly which resources diverged without needing to rerun the check manually.
3. Your team uses an AWS Lambda function to automatically rotate a database password stored in AWS Secrets Manager every 30 days. However, your daily Terraform pipeline keeps detecting drift on the `secret_string` attribute and attempting to revert the password. How do you resolve this conflict?
Answer:
You should resolve this by adding a lifecycle { ignore_changes = [secret_string] } block to the secret version resource in your Terraform code. This tells Terraform to ignore any future changes to that specific attribute after the resource is initially created. You apply this strategy because the attribute is intentionally managed by a secondary, dynamic system rather than static IaC. If you instead tried to “fix” the drift by running Terraform, you would overwrite the securely rotated password with the original state, causing an immediate database authentication outage.
4. Your leadership is hesitant to invest engineering time into a scheduled drift detection pipeline. They argue that manual reviews are sufficient. If an unmanaged change causes one 8-hour outage per quarter at $50,000 per hour, how can you financially justify implementing a 6-hour automated detection window?
Answer: Without automated detection, a single 8-hour outage costs the company 1.6 million annually. By implementing a 6-hour automated detection window, the maximum time a drifting configuration can exist undetected shrinks drastically, making it highly probable to catch the anomaly before it triggers an outage. Even if the automated detection only prevents 75 percent of the outages, it would save the organization $1.2 million per year. This massive cost avoidance far outweighs the minimal engineering investment required to set up a scheduled GitHub Action or CloudWatch rule, proving that drift detection is a highly profitable operational safeguard.
5. You run `terraform plan` and it reports zero changes. However, your security team informs you that a new, unapproved S3 bucket exists in the production account. Why did Terraform miss this, and what tool should you use to catch it next time?
Answer: Terraform missed the S3 bucket because standard Terraform commands only track resources that are already defined in its state file; it ignores unmanaged resources entirely. The S3 bucket represents “state drift” or shadow IT, which falls outside Terraform’s default purview. To catch this, you should use a specialized tool like driftctl, which actively scans the entire cloud environment and compares all discovered resources against the Terraform state. This provides a comprehensive coverage report identifying exactly what exists in the cloud but is missing from your IaC definitions, allowing you to quickly identify shadow resources.
6. Despite strict company policies, developers continue to manually modify EC2 security groups through the AWS console to quickly test new features. You need to implement a technical control that physically prevents this behavior for Terraform-managed resources without blocking read access. What is the most robust implementation?
Answer:
The most robust implementation is to apply an AWS Organizations Service Control Policy (SCP) that explicitly denies modification actions, such as ec2:ModifySecurityGroupRules, to any resource tagged as managed by Terraform. You must include an exception condition that allows these actions only if the caller’s Principal ARN matches your dedicated Terraform execution role. This approach represents a hard boundary that cannot be overridden by individual IAM permissions or local account administrators. It effectively forces developers to route all changes through the IaC pipeline while preserving their ability to view resources via the console.
7. During a routine audit, you discover a manually added ingress rule on a production database security group that has been present for 42 days. The engineer who added it has left the company, and there is no documentation explaining its purpose. What is the safest, most systematic way to handle this drift?
Answer: The safest approach is to first assess the immediate security risk of the rule and document the finding in an incident or audit report. Since the rule’s purpose is unknown and reverting it blindly might break a critical undocumented integration, you should temporarily codify the rule into your Terraform configuration to bring it under management. Once the state is synchronized, you can monitor network traffic to see if the rule is actively used by reviewing VPC flow logs. If it is unused or deemed too risky, you then remove it via standard Terraform deployment processes, ensuring the removal is tracked, reviewed, and easily reversible.
8. You successfully configured a drift remediation script that automatically runs `terraform apply` whenever drift is detected. You deployed this to your staging environment and it worked perfectly. Why should you strongly reconsider deploying this same auto-remediation script to your production environment?
Answer: Deploying auto-remediation to production is highly dangerous because manual changes often represent emergency “break-glass” fixes implemented during critical outages. If an automated system immediately reverts those fixes, it will instantly recreate the outage, leading to a looping battle between the incident responders and the CI/CD pipeline. Additionally, automatically applying state changes strips away the human review process necessary to understand the blast radius of the remediation. For production environments, drift should trigger high-priority alerts and tickets, requiring human judgment to decide whether to codify the drift into the official configuration or safely revert it during a maintenance window.
Hands-On Exercise
Section titled “Hands-On Exercise”Objective: Set up drift detection and demonstrate remediation strategies.
Part 1: Create Driftable Infrastructure
Section titled “Part 1: Create Driftable Infrastructure”# Create test infrastructuremkdir -p drift-labcd drift-lab
cat > main.tf << 'EOF'terraform { required_providers { local = { source = "hashicorp/local" version = "~> 2.0" } }}
# Create a file that we'll manually modifyresource "local_file" "config" { filename = "${path.module}/config.json" content = jsonencode({ environment = "production" debug = false timeout = 30 tags = { ManagedBy = "terraform" } })}
output "config_path" { value = local_file.config.filename}EOF
# Apply initial configurationterraform initterraform apply -auto-approve
# Verify file contentscat config.jsonPart 2: Introduce Drift
Section titled “Part 2: Introduce Drift”# Manually modify the file (simulating console change)cat > config.json << 'EOF'{ "environment": "production", "debug": true, "timeout": 60, "tags": { "ManagedBy": "terraform", "TempFix": "ticket-12345" }}EOF
# Check for driftterraform plan -detailed-exitcodeecho "Exit code: $?"
# View the driftterraform planPart 3: Remediation Options
Section titled “Part 3: Remediation Options”# Option A: Revert to desired stateterraform apply -auto-approvecat config.json # Back to original
# Option B: Accept drift by updating code# First, introduce drift againcat > config.json << 'EOF'{ "environment": "production", "debug": true, "timeout": 60}EOF
# Update Terraform to match actual statecat > main.tf << 'EOF'terraform { required_providers { local = { source = "hashicorp/local" version = "~> 2.0" } }}
resource "local_file" "config" { filename = "${path.module}/config.json" content = jsonencode({ environment = "production" debug = true # Updated to match reality timeout = 60 # Updated to match reality tags = { ManagedBy = "terraform" } })}EOF
terraform plan # Should show changes to match our new codePart 4: Automated Detection Script
Section titled “Part 4: Automated Detection Script”# Create drift detection scriptcat > detect-drift.sh << 'EOF'#!/bin/bashset -e
echo "🔍 Running drift detection..."
terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan.txtEXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -eq 0 ]; then echo "✅ No drift detected"elif [ $EXIT_CODE -eq 2 ]; then echo "⚠️ DRIFT DETECTED!" echo "" echo "Changes found:" terraform show tfplan echo "" echo "Remediation options:" echo "1. Run 'terraform apply' to restore desired state" echo "2. Update Terraform code to accept changes" echo "3. Add lifecycle { ignore_changes } for managed attributes"else echo "❌ Error during drift detection" exit 1fi
exit $EXIT_CODEEOF
chmod +x detect-drift.sh
# Test the script./detect-drift.shSuccess Criteria
Section titled “Success Criteria”- Initial infrastructure deployed successfully
- Manual changes introduced (drift created)
-
terraform plan -detailed-exitcodereturns exit code 2 - Drift details visible in plan output
- At least one remediation option successfully executed
- Detection script works and provides actionable output
Key Takeaways
Section titled “Key Takeaways”- Drift is inevitable - Manual changes, emergencies, other automation all cause drift
- Detect early, detect often - Run terraform plan on schedule (every 6 hours minimum)
- Categorize drift severity - Security drift is critical, tag drift might be informational
- Automate detection, review remediation - Auto-apply only for non-production
- Use ignore_changes sparingly - Only for truly externally-managed attributes
- Prevent at source - SCPs, read-only console access, break-glass procedures
- Track metrics - Drift frequency, time-to-detect, time-to-remediate
- Root cause analysis - Fix the process that caused drift, not just the drift
- Document exceptions - Known drift should be in .driftignore with explanation
- Break-glass needs follow-up - Emergency changes must be codified within 24 hours
Did You Know?
Section titled “Did You Know?”Drift Statistics: A 2023 survey found that 73% of organizations experience infrastructure drift within 24 hours of deployment, and 91% experience drift within one week.
Detection Gap: The average time to detect infrastructure drift is 12 days, but security-related drift takes an average of 27 days to discover.
Driftctl Origins: Driftctl was created by Cloudskiff (now part of Snyk) in 2020 specifically to solve the problem of detecting unmanaged cloud resources that Terraform plan cannot find.
Cost of Drift: According to a 2023 report, organizations spend an average of 4.2 hours per week manually investigating and remediating infrastructure drift, costing approximately $35,000 per engineer annually.
Related Modules
Section titled “Related Modules”GitOps Drift: For GitOps-specific drift detection (ArgoCD sync, Flux reconciliation), see GitOps Drift Detection.
Next Module
Section titled “Next Module”Continue to Module 6.6: IaC Cost Management to learn how to estimate, track, and optimize infrastructure costs directly in your Terraform workflow.