Module 8.10: Scaling IaC & State Management
Complexity:
[MEDIUM]Time to Complete: 2 hours
Prerequisites: Basic Terraform experience (variables, modules, state), familiarity with Git workflows
Track: Advanced Cloud Operations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design Terraform or Pulumi state management strategies for large-scale multi-account infrastructure
- Implement modular IaC patterns with versioned modules, workspace isolation, and policy-as-code validation
- Configure automated drift detection and remediation pipelines for infrastructure managed by Terraform or CloudFormation
- Evaluate IaC tools (Terraform, Pulumi, CloudFormation, Bicep, Crossplane) for multi-cloud Kubernetes platform engineering
Why This Module Matters
Section titled “Why This Module Matters”May 2023. A platform engineering team at a logistics company. 280 Terraform resources. One state file.
A routine terraform plan took 12 minutes. A terraform apply for a single security group change took 18 minutes because Terraform refreshed the state of all 280 resources before making one change. The team had adopted Terraform early, put everything in one root module, and never refactored. The state file was 45MB of JSON. Merging pull requests was a nightmare because two engineers running terraform apply simultaneously could corrupt the state file. The team had experienced three state corruption incidents in six months, each requiring manual state surgery with terraform state rm and terraform import.
The breaking point came during a production incident. A load balancer needed a target group update. The engineer ran terraform apply. Twelve minutes of state refresh. During those twelve minutes, the incident escalated from P2 to P1. The engineer resorted to making the change manually in the AWS console — which worked immediately but introduced drift that broke the next Terraform run.
This module teaches you how to structure Terraform (and other IaC tools) for scale: how to split state files to keep operations fast, how to design reusable modules for Kubernetes infrastructure, how to bridge IaC with GitOps (Crossplane vs. Terraform Operator), and how to detect and prevent configuration drift. These are the patterns that let platform teams manage hundreds of resources across dozens of accounts without drowning in state files.
The State File Problem
Section titled “The State File Problem”Terraform state is the mapping between your configuration files and real infrastructure. Every resource you manage adds to the state file. As the state grows, every operation slows down because Terraform refreshes the entire state before making changes.
Pause and predict: If two engineers simultaneously run
terraform applyon a local monolithic state file without any remote backend or locking configured, what exactly happens to the JSON file?
STATE FILE GROWTH AND ITS CONSEQUENCES════════════════════════════════════════════════════════════════
Resources State Size Plan Time Apply Time Risk────────────────────────────────────────────────────────────10 ~100KB 5 sec 30 sec Low50 ~500KB 30 sec 2 min Low100 ~2MB 2 min 5 min Medium250 ~10MB 8 min 15 min High500 ~30MB 15 min 25 min Very High1000+ ~100MB+ 30+ min 45+ min Extreme
At 250+ resources, you WILL experience:- Slow plans blocking CI/CD pipelines- State lock timeouts- Team members waiting to apply changes- Temptation to make manual changes (drift)- State corruption from concurrent operationsState Splitting Strategy
Section titled “State Splitting Strategy”The solution is to split your Terraform configuration into multiple independent state files, each managing a logical group of resources.
Stop and think: In the split-state architecture shown below, if a syntax error breaks the
databases/configuration, can the platform team still deploy updates to the EKS cluster or IAM roles? How does this impact deployment velocity during an incident?
STATE SPLITTING ARCHITECTURE════════════════════════════════════════════════════════════════
BEFORE (monolith): terraform/ └── main.tf # 280 resources, one state file ├── VPC ├── Subnets ├── EKS cluster ├── Node groups ├── RDS ├── ElastiCache ├── S3 buckets ├── IAM roles ├── Route53 └── CloudWatch
AFTER (split by concern): terraform/ ├── networking/ # 30 resources, own state │ ├── vpc.tf │ ├── subnets.tf │ ├── transit-gw.tf │ └── outputs.tf # VPC ID, subnet IDs exported │ ├── eks-cluster/ # 25 resources, own state │ ├── cluster.tf │ ├── node-groups.tf │ ├── addons.tf │ └── data.tf # Reads networking outputs │ ├── databases/ # 20 resources, own state │ ├── rds.tf │ ├── elasticache.tf │ └── data.tf # Reads networking outputs │ ├── iam/ # 40 resources, own state │ ├── roles.tf │ ├── policies.tf │ └── irsa.tf │ └── dns/ # 15 resources, own state ├── zones.tf └── records.tf
Each directory: independent terraform init, plan, apply Each has ~20-40 resources: plans take seconds, not minutesRemote State Data Sources
Section titled “Remote State Data Sources”Split state files need to reference each other. Use terraform_remote_state or data sources.
# networking/outputs.tf -- Export values from networking stateoutput "vpc_id" { value = aws_vpc.main.id}
output "private_subnet_ids" { value = aws_subnet.private[*].id}
output "eks_security_group_id" { value = aws_security_group.eks.id}
# eks-cluster/data.tf -- Read from networking statedata "terraform_remote_state" "networking" { backend = "s3" config = { bucket = "company-terraform-state" key = "networking/terraform.tfstate" region = "us-east-1" }}
# eks-cluster/cluster.tf -- Use the imported valuesresource "aws_eks_cluster" "main" { name = "prod-cluster" role_arn = aws_iam_role.eks.arn
vpc_config { subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids security_group_ids = [data.terraform_remote_state.networking.outputs.eks_security_group_id] }}Better Alternative: Use Data Sources Instead of Remote State
Section titled “Better Alternative: Use Data Sources Instead of Remote State”# Instead of remote state, query AWS directly# This avoids tight coupling between state files
data "aws_vpc" "main" { tags = { Name = "production-vpc" Environment = "production" }}
data "aws_subnets" "private" { filter { name = "vpc-id" values = [data.aws_vpc.main.id] } tags = { Tier = "private" }}
resource "aws_eks_cluster" "main" { name = "prod-cluster" role_arn = aws_iam_role.eks.arn
vpc_config { subnet_ids = data.aws_subnets.private.ids }}Remote Backends and State Locking
Section titled “Remote Backends and State Locking”Stop and think: What happens if an engineer gets impatient during a long
terraform apply, force-quits their terminal, and then manually deletes the DynamoDB lock record so they can try again?
REMOTE BACKEND WITH LOCKING════════════════════════════════════════════════════════════════
Engineer A Engineer B terraform apply terraform apply │ │ ▼ ▼ ┌────────────────┐ ┌────────────────┐ │ Lock state │ │ Lock state │ │ (DynamoDB) │ │ (DynamoDB) │ │ SUCCESS │ │ FAILED: locked │ └────────┬───────┘ │ "State locked │ │ │ by Engineer A" │ ▼ └────────────────┘ ┌────────────────┐ │ Read state │ │ (S3 bucket) │ │ │ │ Apply changes │ │ │ │ Write state │ │ (S3 bucket) │ │ │ │ Release lock │ └────────────────┘# Backend configuration (per-state-file)terraform { backend "s3" { bucket = "company-terraform-state" key = "prod/us-east-1/networking/terraform.tfstate" region = "us-east-1" encrypt = true kms_key_id = "alias/terraform-state" dynamodb_table = "terraform-state-lock" }}
# Create the DynamoDB lock table (one-time setup)# aws dynamodb create-table \# --table-name terraform-state-lock \# --attribute-definitions AttributeName=LockID,AttributeType=S \# --key-schema AttributeName=LockID,KeyType=HASH \# --billing-mode PAY_PER_REQUESTState File Organization Pattern
Section titled “State File Organization Pattern”S3 BUCKET STRUCTURE FOR TERRAFORM STATE════════════════════════════════════════════════════════════════
s3://company-terraform-state/├── global/│ ├── iam/terraform.tfstate│ ├── dns/terraform.tfstate│ └── organizations/terraform.tfstate│├── prod/│ ├── us-east-1/│ │ ├── networking/terraform.tfstate│ │ ├── eks-cluster/terraform.tfstate│ │ ├── databases/terraform.tfstate│ │ └── monitoring/terraform.tfstate│ ││ └── eu-west-1/│ ├── networking/terraform.tfstate│ ├── eks-cluster/terraform.tfstate│ └── databases/terraform.tfstate│├── staging/│ └── us-east-1/│ ├── networking/terraform.tfstate│ └── eks-cluster/terraform.tfstate│└── sandbox/ └── terraform.tfstate
Key/path structure: {env}/{region}/{component}/terraform.tfstateThis matches your OU structure from Module 8.1.Terraform Modules for Kubernetes Clusters
Section titled “Terraform Modules for Kubernetes Clusters”Well-designed modules are the building blocks for managing Kubernetes infrastructure at scale. A good module encapsulates a logical unit of infrastructure with a clean interface.
Pause and predict: If a module has 50 variables to account for every possible AWS configuration, how does that impact the readability of the root module consuming it? Is it actually better than writing raw resources?
EKS Cluster Module
Section titled “EKS Cluster Module”variable "cluster_name" { type = string description = "Name of the EKS cluster"}
variable "cluster_version" { type = string description = "Kubernetes version" default = "1.35"}
variable "vpc_id" { type = string description = "VPC ID where the cluster will be created"}
variable "subnet_ids" { type = list(string) description = "Subnet IDs for the cluster (private subnets)"}
variable "node_groups" { type = map(object({ instance_types = list(string) desired_size = number min_size = number max_size = number capacity_type = optional(string, "ON_DEMAND") labels = optional(map(string), {}) taints = optional(list(object({ key = string value = string effect = string })), []) })) description = "Node group configurations"}
variable "enable_karpenter" { type = bool default = true description = "Install Karpenter for autoscaling"}
variable "cluster_addons" { type = map(object({ version = optional(string) })) default = { vpc-cni = {} coredns = {} kube-proxy = {} aws-ebs-csi-driver = {} }}
variable "tags" { type = map(string) default = {}}resource "aws_eks_cluster" "this" { name = var.cluster_name version = var.cluster_version role_arn = aws_iam_role.cluster.arn
vpc_config { subnet_ids = var.subnet_ids endpoint_private_access = true endpoint_public_access = false security_group_ids = [aws_security_group.cluster.id] }
access_config { authentication_mode = "API_AND_CONFIG_MAP" bootstrap_cluster_creator_admin_permissions = true }
encryption_config { provider { key_arn = aws_kms_key.eks.arn } resources = ["secrets"] }
tags = merge(var.tags, { "kubernetes.io/cluster/${var.cluster_name}" = "owned" })
depends_on = [ aws_iam_role_policy_attachment.cluster_policy, aws_iam_role_policy_attachment.cluster_vpc_policy, ]}
resource "aws_eks_node_group" "this" { for_each = var.node_groups
cluster_name = aws_eks_cluster.this.name node_group_name = each.key node_role_arn = aws_iam_role.node.arn subnet_ids = var.subnet_ids instance_types = each.value.instance_types capacity_type = each.value.capacity_type
scaling_config { desired_size = each.value.desired_size min_size = each.value.min_size max_size = each.value.max_size }
labels = merge(each.value.labels, { "node-group" = each.key })
dynamic "taint" { for_each = each.value.taints content { key = taint.value.key value = taint.value.value effect = taint.value.effect } }
tags = var.tags}
# EKS addonsresource "aws_eks_addon" "this" { for_each = var.cluster_addons
cluster_name = aws_eks_cluster.this.name addon_name = each.key addon_version = each.value.version resolve_conflicts_on_create = "OVERWRITE" resolve_conflicts_on_update = "PRESERVE"}output "cluster_name" { value = aws_eks_cluster.this.name}
output "cluster_endpoint" { value = aws_eks_cluster.this.endpoint}
output "cluster_ca_certificate" { value = aws_eks_cluster.this.certificate_authority[0].data}
output "oidc_provider_arn" { value = aws_iam_openid_connect_provider.eks.arn}
output "oidc_provider_url" { value = aws_eks_cluster.this.identity[0].oidc[0].issuer}
output "node_security_group_id" { value = aws_eks_cluster.this.vpc_config[0].cluster_security_group_id}Using the Module
Section titled “Using the Module”module "eks" { source = "../../../../modules/eks-cluster"
cluster_name = "prod-us-east-1" cluster_version = "1.35" vpc_id = data.aws_vpc.prod.id subnet_ids = data.aws_subnets.private.ids
node_groups = { general = { instance_types = ["m7i.xlarge"] desired_size = 3 min_size = 3 max_size = 10 capacity_type = "ON_DEMAND" labels = { "workload-class" = "general" } }
spot-workers = { instance_types = ["m7i.xlarge", "m6i.xlarge", "c7i.xlarge"] desired_size = 5 min_size = 2 max_size = 20 capacity_type = "SPOT" labels = { "workload-class" = "batch" "node-type" = "spot" } taints = [ { key = "spot" value = "true" effect = "NO_SCHEDULE" } ] } }
enable_karpenter = true
tags = { Environment = "production" Team = "platform" CostCenter = "CC-1000" }}IaC + GitOps: Crossplane vs. Terraform Operator
Section titled “IaC + GitOps: Crossplane vs. Terraform Operator”The traditional model is: Terraform manages cloud infrastructure, GitOps (ArgoCD/Flux) manages Kubernetes workloads. But there is a growing movement to manage cloud infrastructure through Kubernetes itself, using Crossplane or the Terraform Operator.
IAC + GITOPS MODELS════════════════════════════════════════════════════════════════
MODEL 1: Traditional (Terraform + GitOps) Git Repo ──▶ CI Pipeline ──▶ terraform apply ──▶ Cloud Resources Git Repo ──▶ ArgoCD ──────▶ kubectl apply ──▶ K8s Resources
Pros: Mature, well-understood, terraform plan in PR Cons: Two tools, two workflows, two state systems
MODEL 2: Crossplane (Everything is K8s) Git Repo ──▶ ArgoCD ──▶ Crossplane CRDs ──▶ Cloud + K8s Resources
Pros: Single GitOps workflow, K8s-native, reconciliation loop Cons: Newer, fewer providers, debugging is harder
MODEL 3: Terraform Operator (Terraform inside K8s) Git Repo ──▶ ArgoCD ──▶ TF Operator CRD ──▶ terraform apply ──▶ Cloud Resources
Pros: Uses existing TF modules, K8s-native workflow Cons: State management complexity, TF inside K8s adds a layerCrossplane Example
Section titled “Crossplane Example”Stop and think: If Crossplane continuously reconciles state every 60 seconds, how do you handle “break-glass” emergency changes where an engineer must temporarily manually modify an AWS resource in the console to stop a critical incident?
# Create an RDS instance using CrossplaneapiVersion: rds.aws.upbound.io/v1beta2kind: Instancemetadata: name: payments-db namespace: crossplane-systemspec: forProvider: region: us-east-1 instanceClass: db.r7g.large engine: postgres engineVersion: "16" allocatedStorage: 100 storageType: gp3 dbName: payments masterUsername: admin masterPasswordSecretRef: name: rds-password namespace: crossplane-system key: password vpcSecurityGroupIds: - sg-abc123 dbSubnetGroupName: prod-db-subnets publiclyAccessible: false backupRetentionPeriod: 14 multiAz: true tags: Environment: production Team: payments CostCenter: CC-2000 providerConfigRef: name: aws-provider---# The Crossplane controller continuously reconciles:# If someone changes the RDS instance in the console,# Crossplane will revert it to match this spec.# This is drift detection + correction built in.When to Use Each Approach
Section titled “When to Use Each Approach”| Factor | Terraform | Crossplane | TF Operator |
|---|---|---|---|
| Existing TF codebase | Keep Terraform | Consider migration | Use TF Operator |
| Team skill set | TF experts | K8s experts | TF experts |
| Cloud resource coverage | Excellent (all providers) | Good (growing) | Uses TF providers |
| Drift correction | Manual (terraform apply) | Automatic (reconciliation) | Periodic (terraform apply) |
| State management | S3 + DynamoDB | etcd (K8s) | S3 + DynamoDB |
| PR workflow | terraform plan in PR | kubectl diff in PR | terraform plan in PR |
| Multi-cloud | Excellent | Good | Excellent |
Drift Detection
Section titled “Drift Detection”Configuration drift occurs when the actual state of infrastructure diverges from the desired state in code. Someone makes a manual change in the console. An automated process modifies a resource. A provider update changes a default value.
Pause and predict: Aside from catching manual operational changes, why is scheduled drift detection considered a critical security control?
Detecting Drift
Section titled “Detecting Drift”# Terraform: Detect drift with refresh-only planterraform plan -refresh-only
# Expected output when drift exists:# Note: Objects have changed outside of Terraform## Terraform detected the following changes made outside of Terraform# since the last "terraform apply":## # aws_security_group.eks has been changed# ~ resource "aws_security_group" "eks" {# ~ ingress {# + cidr_blocks = ["0.0.0.0/0"] <-- SOMEONE OPENED THIS TO THE WORLD# }# }
# Run drift detection on a schedule (CI/CD)# GitHub Actions example:name: Terraform Drift Detectionon: schedule: - cron: '0 6 * * *' # Daily at 6 AM UTC workflow_dispatch:
jobs: detect-drift: runs-on: ubuntu-latest strategy: matrix: component: - networking - eks-cluster - databases - iam steps: - uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3 with: terraform_version: 1.9.0
- name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::111111111111:role/terraform-drift-detector aws-region: us-east-1
- name: Terraform Init working-directory: terraform/prod/us-east-1/${{ matrix.component }} run: terraform init -input=false
- name: Detect Drift id: drift working-directory: terraform/prod/us-east-1/${{ matrix.component }} run: | terraform plan -refresh-only -detailed-exitcode -input=false 2>&1 | tee plan.txt EXIT_CODE=${PIPESTATUS[0]} if [ $EXIT_CODE -eq 2 ]; then echo "drift_detected=true" >> $GITHUB_OUTPUT echo "DRIFT DETECTED in ${{ matrix.component }}" elif [ $EXIT_CODE -eq 0 ]; then echo "drift_detected=false" >> $GITHUB_OUTPUT echo "No drift in ${{ matrix.component }}" else echo "Error running terraform plan" exit 1 fi
- name: Alert on Drift if: steps.drift.outputs.drift_detected == 'true' run: | curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \ -H 'Content-type: application/json' \ --data "{ \"text\": \"DRIFT DETECTED in terraform/prod/us-east-1/${{ matrix.component }}\nRun terraform plan to see details.\" }"Terratest: Testing Infrastructure Code
Section titled “Terratest: Testing Infrastructure Code”Terratest is a Go library that writes automated tests for Terraform modules. It deploys real infrastructure, validates it, and tears it down.
Stop and think: Notice the
defer terraform.Destroy(t, terraformOptions)line in the Terratest example. What happens to the AWS resources if a test assertion fails halfway through the execution?
package test
import ( "testing" "time"
"github.com/gruntwork-io/terratest/modules/aws" "github.com/gruntwork-io/terratest/modules/k8s" "github.com/gruntwork-io/terratest/modules/terraform" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require")
func TestEksCluster(t *testing.T) { t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{ TerraformDir: "../modules/eks-cluster", Vars: map[string]interface{}{ "cluster_name": "test-cluster-" + time.Now().Format("20060102150405"), "cluster_version": "1.35", "vpc_id": "vpc-test123", "subnet_ids": []string{"subnet-a", "subnet-b"}, "node_groups": map[string]interface{}{ "test": map[string]interface{}{ "instance_types": []string{"t3.medium"}, "desired_size": 1, "min_size": 1, "max_size": 2, "capacity_type": "SPOT", }, }, }, })
// Clean up at the end defer terraform.Destroy(t, terraformOptions)
// Deploy the module terraform.InitAndApply(t, terraformOptions)
// Validate outputs clusterName := terraform.Output(t, terraformOptions, "cluster_name") assert.Contains(t, clusterName, "test-cluster-")
clusterEndpoint := terraform.Output(t, terraformOptions, "cluster_endpoint") assert.NotEmpty(t, clusterEndpoint)
// Validate the cluster is actually functional kubeconfig := aws.GetKubeConfigForEksCluster(t, clusterName, "us-east-1")
options := k8s.NewKubectlOptions("", kubeconfig, "default")
// Check that nodes are ready nodes := k8s.GetNodes(t, options) require.GreaterOrEqual(t, len(nodes), 1)
for _, node := range nodes { for _, condition := range node.Status.Conditions { if condition.Type == "Ready" { assert.Equal(t, "True", string(condition.Status)) } } }
// Check that core addons are running k8s.WaitUntilDeploymentAvailable(t, options, "coredns", 5, 30*time.Second)}# Run Terratestcd test/go test -v -timeout 30m -run TestEksClusterDid You Know?
Section titled “Did You Know?”-
Terraform state files have caused more production incidents than Terraform configurations according to a 2024 survey by Spacelift. The most common state-related incidents are: state corruption from concurrent applies (31%), state lock not released after a failed apply (28%), sensitive data exposed in state files (22%), and state lost due to backend misconfiguration (19%). This is why state management is not an afterthought — it is the most operationally critical aspect of Terraform.
-
Crossplane has over 200 managed resource types for AWS alone as of 2025, covering the most commonly used services (EKS, RDS, S3, IAM, VPC, Lambda, DynamoDB, and many more). However, it still has gaps compared to Terraform’s AWS provider, which covers over 1,200 resource types. The gap is closing — the Upbound Marketplace now generates Crossplane providers directly from Terraform providers using a tool called upjet.
-
The average Terraform module in the public registry has 2.3 variables that are never used according to an analysis by Bridgecrew/Prisma Cloud. Module bloat is a real problem: teams copy modules from the registry, add variables for every possible configuration, and end up with modules that are harder to understand than raw resources. The best modules have 5-10 input variables with sensible defaults, not 50+ variables that try to cover every edge case.
-
HashiCorp changed Terraform’s license from Mozilla Public License 2.0 to Business Source License (BSL) in August 2023, which triggered the creation of OpenTofu — a community-maintained fork under the Linux Foundation. As of 2025, both projects continue active development with diverging feature sets. OpenTofu added client-side state encryption (a long-requested feature) before HashiCorp, while Terraform added native testing and ephemeral values. Most organizations continue using Terraform, but the fork ensures that an open-source alternative exists.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| One monolithic state file for everything | ”It started small and grew” | Split by concern: networking, compute, databases, IAM. Each component in its own directory with its own state. Do this early — splitting later is painful. |
| Not using state locking | ”We’ll coordinate manually” | Always use DynamoDB (AWS), GCS (GCP), or Blob Storage (Azure) for locking. Without locking, concurrent applies WILL corrupt state. |
| Storing secrets in state | ”It’s encrypted at rest” | Terraform state contains plaintext values of all managed resources, including passwords. Use separate secret management (Vault, AWS Secrets Manager) and reference secrets by ARN, not value. |
| Writing modules that are too generic | ”We’ll configure everything through variables” | Write modules for YOUR use case. A module with 50 variables is worse than raw resources. Start specific, generalize only when you have three proven use cases. |
| No automated drift detection | ”We run terraform plan manually before changes” | Drift happens between planned changes. Schedule daily drift detection in CI. Alert on drift immediately — it is often a security issue. |
Using terraform taint to force recreation | ”The resource is broken, just recreate it” | terraform taint is destructive and deprecated in favor of -replace. Understand why the resource is broken before recreating. Tainting a node group kills all pods on those nodes. |
| Not testing modules before use | ”It works on my machine” | Use Terratest or terraform test (built-in since 1.6) to validate modules create functional infrastructure. Test in an isolated account to avoid production impact. |
| Manual state manipulation without backup | ”I’ll just terraform state rm this broken resource” | Always back up state before manipulation: terraform state pull > backup.tfstate. State operations are irreversible. One wrong state rm and you orphan real resources. |
1. Scenario: You just joined a team where a single `terraform plan` takes 14 minutes. The state file contains 800 resources across VPCs, EKS clusters, and RDS instances. What is happening under the hood during those 14 minutes, and how does splitting the state resolve this?
Before every plan or apply, Terraform performs a “state refresh” where it queries the cloud provider API for the current state of every resource in the state file. With 800 resources, these API calls compound, taking minutes just to verify existing infrastructure before planning new changes. Additionally, Terraform must evaluate dependencies across the entire monolithic graph and hold the massive JSON state in memory. Splitting the state drastically reduces the number of API calls and limits the dependency graph for any single operation. This ensures that a change to an RDS instance only refreshes the database resources, keeping plan times under a minute.
2. Scenario: Your team split the networking and compute state files. You need to pass the VPC ID from the networking state to the compute state. You can either use `terraform_remote_state` or an `aws_vpc` data source. Which approach creates tighter coupling, and why might you prefer the other?
Using terraform_remote_state creates tight coupling because the compute configuration must know exactly where the networking state file is stored and what its outputs are named. If the networking team moves their state file or renames an output, the compute deployment breaks. AWS data sources offer loose coupling by querying the cloud provider API directly based on resource tags or attributes. The compute module doesn’t need to know how the VPC was created, only how to find it. While data sources require consistent tagging conventions and add API calls during the plan phase, they are generally preferred because they survive organizational restructuring and state file migrations.
3. Scenario: Your organization relies heavily on ArgoCD and wants developers to self-service RDS databases using Kubernetes manifests, rather than learning HCL. Should you adopt Crossplane or stick with Terraform, and why?
You should adopt Crossplane for this scenario because it aligns perfectly with a Kubernetes-native, self-service model. Crossplane allows developers to provision cloud resources using the same Kubernetes YAML and GitOps pipelines (like ArgoCD) they already use for application deployments. Instead of forcing developers to learn Terraform HCL and maintain separate CI/CD pipelines, they simply submit a Custom Resource Definition (CRD) to the cluster. Furthermore, Crossplane’s continuous reconciliation loop ensures that the RDS database remains in its desired state automatically, without requiring developers to run terraform apply.
4. Scenario: A junior engineer temporarily opens port 22 to the world on a production security group via the AWS Console at 3 AM. Your IaC tool is Crossplane. How will the system react compared to a traditional Terraform setup running in a daily CI pipeline?
Because Crossplane operates as a Kubernetes controller, its continuous reconciliation loop will detect the manual console change during its next cycle (typically within 60 seconds). Crossplane will automatically revert the security group back to the desired state defined in the Git repository, closing the unauthorized port without any human intervention. In contrast, a traditional Terraform setup would not detect this drift until the scheduled daily CI pipeline runs its terraform plan -refresh-only. The port would remain open and vulnerable for hours, requiring manual review of the drift report and a subsequent terraform apply to fix it.
5. Scenario: A team proposes committing their Terraform state file to their private Git repository to keep code and state versioned together, arguing that the repo is secure. Why is this a dangerous anti-pattern, and what three specific problems does a remote backend solve?
Committing state files to Git is dangerous because Terraform stores the plaintext values of all managed resources, meaning database passwords, private keys, and API tokens would be permanently recorded in the commit history. Furthermore, Git cannot provide the state locking necessary to prevent two engineers from simultaneously applying changes and corrupting the infrastructure state. A remote backend solves these issues by providing encryption at rest, centralized access control, and state locking via mechanisms like DynamoDB. It also prevents the repository from bloating with massive JSON files that change on every infrastructure update.
6. Scenario: You maintain an EKS Terraform module used by 15 different product teams. One team requests a new feature that fundamentally changes how node groups are defined. How do you implement this change without breaking the module for the other 14 teams?
You must treat the module as a versioned software artifact and implement the change using semantic versioning. Add the new feature by introducing optional variables with default values that strictly preserve the existing behavior for the 14 other teams. If the change is fundamentally breaking and cannot be made backward-compatible, you must release a new major version of the module (e.g., v2.0.0). The other teams will remain pinned to the v1 release and can plan their migration to the new architecture independently, ensuring that your feature addition causes zero operational disruption.
Hands-On Exercise: Structure and Test Terraform for Multi-Account EKS
Section titled “Hands-On Exercise: Structure and Test Terraform for Multi-Account EKS”In this exercise, you will restructure a monolithic Terraform configuration into a modular, multi-state design.
Scenario
Section titled “Scenario”You have a monolithic Terraform file that creates a VPC, EKS cluster, RDS database, and IAM roles. Split it into independent state files and create a basic module test.
Task 1: Design the Directory Structure
Section titled “Task 1: Design the Directory Structure”Solution
terraform/├── modules/│ ├── networking/│ │ ├── main.tf│ │ ├── variables.tf│ │ └── outputs.tf│ ├── eks-cluster/│ │ ├── main.tf│ │ ├── variables.tf│ │ └── outputs.tf│ └── database/│ ├── main.tf│ ├── variables.tf│ └── outputs.tf│├── environments/│ ├── prod/│ │ └── us-east-1/│ │ ├── networking/│ │ │ ├── main.tf # Uses modules/networking│ │ │ ├── backend.tf # S3 state: prod/us-east-1/networking│ │ │ └── terraform.tfvars│ │ ├── eks-cluster/│ │ │ ├── main.tf # Uses modules/eks-cluster│ │ │ ├── backend.tf # S3 state: prod/us-east-1/eks-cluster│ │ │ ├── data.tf # Reads VPC from networking state│ │ │ └── terraform.tfvars│ │ └── database/│ │ ├── main.tf # Uses modules/database│ │ ├── backend.tf # S3 state: prod/us-east-1/database│ │ ├── data.tf # Reads VPC from networking state│ │ └── terraform.tfvars│ ││ └── staging/│ └── us-east-1/│ └── ... # Same structure, different values│└── test/ ├── networking_test.go └── eks_cluster_test.goTask 2: Write the Networking Module
Section titled “Task 2: Write the Networking Module”Solution
variable "environment" { type = string}
variable "region" { type = string}
variable "vpc_cidr" { type = string default = "10.0.0.0/16"}
variable "azs" { type = list(string) default = ["us-east-1a", "us-east-1b", "us-east-1c"]}
# modules/networking/main.tfresource "aws_vpc" "main" { cidr_block = var.vpc_cidr enable_dns_hostnames = true enable_dns_support = true
tags = { Name = "${var.environment}-vpc" Environment = var.environment }}
resource "aws_subnet" "private" { count = length(var.azs) vpc_id = aws_vpc.main.id cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index) availability_zone = var.azs[count.index]
tags = { Name = "${var.environment}-private-${var.azs[count.index]}" "kubernetes.io/role/internal-elb" = "1" Tier = "private" }}
resource "aws_subnet" "public" { count = length(var.azs) vpc_id = aws_vpc.main.id cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.azs)) availability_zone = var.azs[count.index] map_public_ip_on_launch = true
tags = { Name = "${var.environment}-public-${var.azs[count.index]}" "kubernetes.io/role/elb" = "1" Tier = "public" }}
# modules/networking/outputs.tfoutput "vpc_id" { value = aws_vpc.main.id}
output "private_subnet_ids" { value = aws_subnet.private[*].id}
output "public_subnet_ids" { value = aws_subnet.public[*].id}
output "vpc_cidr" { value = aws_vpc.main.cidr_block}Task 3: Write the Environment Configuration
Section titled “Task 3: Write the Environment Configuration”Solution
terraform { backend "s3" { bucket = "company-terraform-state" key = "prod/us-east-1/networking/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" }}
# environments/prod/us-east-1/networking/main.tfmodule "networking" { source = "../../../../modules/networking"
environment = "production" region = "us-east-1" vpc_cidr = "10.0.0.0/16" azs = ["us-east-1a", "us-east-1b", "us-east-1c"]}
output "vpc_id" { value = module.networking.vpc_id}
output "private_subnet_ids" { value = module.networking.private_subnet_ids}
# environments/prod/us-east-1/eks-cluster/data.tf# Read VPC info from the networking statedata "terraform_remote_state" "networking" { backend = "s3" config = { bucket = "company-terraform-state" key = "prod/us-east-1/networking/terraform.tfstate" region = "us-east-1" }}
# environments/prod/us-east-1/eks-cluster/main.tfmodule "eks" { source = "../../../../modules/eks-cluster"
cluster_name = "prod-us-east-1" vpc_id = data.terraform_remote_state.networking.outputs.vpc_id subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
node_groups = { general = { instance_types = ["m7i.xlarge"] desired_size = 3 min_size = 3 max_size = 10 } }
tags = { Environment = "production" CostCenter = "CC-1000" }}Task 4: Write a Drift Detection Script
Section titled “Task 4: Write a Drift Detection Script”Solution
#!/bin/bashset -e
COMPONENTS=("networking" "eks-cluster" "database")ENVIRONMENT="prod"REGION="us-east-1"DRIFT_FOUND=0
echo "=== Terraform Drift Detection ==="echo "Environment: $ENVIRONMENT"echo "Region: $REGION"echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)"echo ""
for COMPONENT in "${COMPONENTS[@]}"; do DIR="terraform/environments/${ENVIRONMENT}/${REGION}/${COMPONENT}"
if [ ! -d "$DIR" ]; then echo "SKIP: $COMPONENT (directory not found)" continue fi
echo "--- Checking $COMPONENT ---" cd "$DIR"
terraform init -input=false -no-color > /dev/null 2>&1
# Run refresh-only plan with detailed exit code # Exit code 0 = no changes, 1 = error, 2 = changes detected set +e PLAN_OUTPUT=$(terraform plan -refresh-only -detailed-exitcode -input=false -no-color 2>&1) EXIT_CODE=$? set -e
if [ $EXIT_CODE -eq 2 ]; then echo "DRIFT DETECTED in $COMPONENT" echo "$PLAN_OUTPUT" | grep -A 5 "has been changed" DRIFT_FOUND=1 elif [ $EXIT_CODE -eq 0 ]; then echo "OK: No drift in $COMPONENT" else echo "ERROR: Failed to check $COMPONENT" echo "$PLAN_OUTPUT" fi
cd - > /dev/null echo ""done
if [ $DRIFT_FOUND -eq 1 ]; then echo "=== DRIFT DETECTED ===" echo "Run 'terraform plan' in the affected components for details." exit 2else echo "=== ALL CLEAR ===" echo "No drift detected in any component." exit 0fiTask 5: Write a Basic Terraform Test
Section titled “Task 5: Write a Basic Terraform Test”Solution
# (Terraform native testing, available since v1.6)
variables { environment = "test" region = "us-east-1" vpc_cidr = "10.99.0.0/16" azs = ["us-east-1a", "us-east-1b"]}
run "vpc_is_created" { command = plan
assert { condition = aws_vpc.main.cidr_block == "10.99.0.0/16" error_message = "VPC CIDR should be 10.99.0.0/16" }
assert { condition = aws_vpc.main.enable_dns_hostnames == true error_message = "VPC should have DNS hostnames enabled" }
assert { condition = aws_vpc.main.tags["Environment"] == "test" error_message = "VPC should be tagged with Environment=test" }}
run "subnets_are_created" { command = plan
assert { condition = length(aws_subnet.private) == 2 error_message = "Should create 2 private subnets (one per AZ)" }
assert { condition = length(aws_subnet.public) == 2 error_message = "Should create 2 public subnets (one per AZ)" }
assert { condition = aws_subnet.private[0].tags["Tier"] == "private" error_message = "Private subnets should be tagged Tier=private" }}
run "subnets_have_unique_cidrs" { command = plan
assert { condition = aws_subnet.private[0].cidr_block != aws_subnet.private[1].cidr_block error_message = "Private subnets should have different CIDR blocks" }}# Run the testscd modules/networkingterraform test
# Expected output:# tests/networking.tftest.hcl... in progress# run "vpc_is_created"... pass# run "subnets_are_created"... pass# run "subnets_have_unique_cidrs"... pass# tests/networking.tftest.hcl... tearing down# tests/networking.tftest.hcl... pass## Success! 3 passed, 0 failed.Success Criteria
Section titled “Success Criteria”- Directory structure separates networking, EKS, and database into independent state files
- Each component has its own backend configuration with unique state key
- EKS component reads VPC info from networking state (via remote_state or data sources)
- Drift detection script checks all components and reports findings
- Terraform tests validate module behavior without deploying real resources
Next Module
Section titled “Next Module”Return to the Advanced Operations README for a summary of all modules in this phase and guidance on what to learn next. You have covered the full spectrum of advanced cloud operations: from multi-account architecture through transit networking, identity, disaster recovery, active-active deployments, migration, cost optimization, observability, and infrastructure as code at scale.