Module 6.4: Infrastructure as Code at Scale
Complexity: [COMPLEX]
Section titled “Complexity: [COMPLEX]”Time to Complete: 55 minutes
Section titled “Time to Complete: 55 minutes”Prerequisites
Section titled “Prerequisites”Before starting this module, you should have completed:
- Module 6.1: IaC Fundamentals - Core concepts
- Module 6.2: IaC Testing - Testing strategies
- Module 6.3: IaC Security - Security practices
- Basic understanding of organizational structures
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design IaC module registries and versioning strategies that support hundreds of consuming teams
- Implement state management patterns — workspaces, accounts, backends — for large multi-team environments
- Build dependency management workflows that prevent breaking changes across shared IaC modules
- Evaluate monorepo versus multi-repo strategies for IaC at organizational scale
Why This Module Matters
Section titled “Why This Module Matters”The Great Terraform Migration Crisis
The platform team at a rapidly growing fintech company had what seemed like a great problem: they’d grown from 3 to 47 engineering teams in two years. Each team had been given autonomy to manage their own infrastructure, and they’d all chosen Terraform. The company now had over 2,400 Terraform configurations spread across 180 repositories.
Then came the compliance audit.
The auditors needed to verify that all databases were encrypted, all S3 buckets had versioning enabled, and all security groups followed the corporate baseline. The platform team spent three weeks writing scripts to scan the configurations. They found that 23% of databases weren’t encrypted, 41% of S3 buckets lacked versioning, and security group configurations ranged from locked-down to completely open.
Fixing these issues took 4 months. During that time, teams couldn’t ship new features because every infrastructure change required security review. The total cost in delayed features, overtime, and audit fees: $8.7 million.
This module teaches you how to scale infrastructure as code without losing control—because managing IaC for 3 teams is fundamentally different from managing it for 300.
The Scale Challenge
Section titled “The Scale Challenge”As organizations grow, IaC complexity increases exponentially.
┌─────────────────────────────────────────────────────────────────────┐│ IAC COMPLEXITY GROWTH │├─────────────────────────────────────────────────────────────────────┤│ ││ Complexity ││ ▲ ││ │ ████ ││ │ ████████ ││ │ ████████████ ││ │ █████████████████ ││ │ ██████████████████████ ││ │ ███████████████████████████ ││ │ ████████████████████████████████ ││ │ █████████████████████████████████████ ││ │ ██████████████████████████████████████████ ││ │ ███████████████████████████████████████████████ ││ │ ████████████████████████████████████████████████████ ││ └─────────────────────────────────────────────────────────► ││ 3 10 25 50 100 200+ Teams ││ ││ Scale Challenges: ││ ┌────────────────────────────────────────────────────────────┐ ││ │ • State file conflicts and corruption │ ││ │ • Inconsistent module versions across teams │ ││ │ • Divergent security and compliance standards │ ││ │ • Duplicate infrastructure definitions │ ││ │ • Long plan/apply times (10+ minutes) │ ││ │ • Difficulty tracking who owns what │ ││ │ • Cascading breaking changes from module updates │ ││ │ • Knowledge silos and tribal knowledge │ ││ └────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────┘Repository Strategies
Section titled “Repository Strategies”Monorepo: Single Repository for All Infrastructure
Section titled “Monorepo: Single Repository for All Infrastructure”infrastructure/├── modules/ # Shared, versioned modules│ ├── vpc/│ │ ├── v1.0.0/│ │ ├── v1.1.0/│ │ └── v2.0.0/│ ├── eks/│ ├── rds/│ └── s3/├── environments/ # Environment-specific configs│ ├── shared/ # Cross-environment resources│ │ ├── networking/│ │ └── security/│ ├── dev/│ │ ├── team-alpha/│ │ ├── team-beta/│ │ └── team-gamma/│ ├── staging/│ │ └── ...│ └── production/│ ├── us-east-1/│ ├── us-west-2/│ └── eu-west-1/├── policies/ # OPA/Sentinel policies├── tests/ # Integration tests├── .github/│ └── workflows/│ ├── module-release.yml│ ├── policy-check.yml│ └── deploy.yml└── CODEOWNERS # Per-directory ownershipAdvantages:
- Single source of truth
- Easy cross-team collaboration
- Consistent tooling and CI/CD
- Atomic changes across environments
Disadvantages:
- Requires strong access controls (CODEOWNERS)
- Large repository size over time
- All teams affected by CI/CD issues
- Permission boundaries harder to enforce
Polyrepo: Separate Repository per Team/Project
Section titled “Polyrepo: Separate Repository per Team/Project”org-infrastructure/├── terraform-modules/ # Central modules repo├── platform-core/ # Platform team's infra├── team-alpha-infra/ # Each team owns their repo├── team-beta-infra/├── team-gamma-infra/└── compliance-policies/ # Central policy repoAdvantages:
- Clear ownership boundaries
- Independent release cycles
- Simpler permissions (repo-level)
- Smaller, focused repositories
Disadvantages:
- Module version fragmentation
- Inconsistent practices
- Harder to enforce standards
- Duplication of common patterns
Hybrid: Best of Both Worlds
Section titled “Hybrid: Best of Both Worlds”# Centralizedplatform-infrastructure/├── modules/ # Blessed modules├── global/ # Org-wide resources└── policies/ # Mandatory policies
# Team-owned (using central modules)team-alpha-infrastructure/└── environments/ ├── dev/ │ └── main.tf # Uses central modules └── production/ └── main.tf
# Example team configurationmodule "vpc" { source = "git::https://github.com/org/platform-infrastructure//modules/vpc?ref=v2.1.0" # ...}Module Registry and Versioning
Section titled “Module Registry and Versioning”Private Module Registry
Section titled “Private Module Registry”Terraform Cloud/Enterprise or self-hosted registry for internal modules:
# terraform.tf - Using private registryterraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } }}
# Using module from private registrymodule "vpc" { source = "app.terraform.io/company/vpc/aws" version = "2.1.0"
environment = var.environment cidr_block = "10.0.0.0/16"}
module "eks" { source = "app.terraform.io/company/eks/aws" version = "~> 3.0" # Allow minor/patch updates
vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnet_ids}Semantic Versioning for Modules
Section titled “Semantic Versioning for Modules”version = "MAJOR.MINOR.PATCH"
MAJOR: Breaking changes (interface changes, removed features)MINOR: New features (backward compatible)PATCH: Bug fixes (backward compatible)# Changelog
## [3.0.0] - 2024-01-15### Breaking Changes- Removed `enable_nat_gateway` variable, NAT is now always enabled- Changed output `subnet_ids` to `private_subnet_ids` and `public_subnet_ids`- Minimum Terraform version now 1.5.0
### Added- Support for IPv6- Transit gateway attachment option
## [2.3.0] - 2024-01-01### Added- VPC flow logs enabled by default- New `log_retention_days` variable
## [2.2.1] - 2023-12-15### Fixed- Subnet CIDR calculation for large VPCsVersion Constraints
Section titled “Version Constraints”# Exact version - most predictableversion = "2.1.0"
# Pessimistic constraint - allow patchesversion = "~> 2.1.0" # Allows 2.1.x, not 2.2.0
# Allow minor updatesversion = "~> 2.1" # Allows 2.x.x, not 3.0.0
# Range constraintversion = ">= 2.0.0, < 3.0.0"
# Best practice: Use pessimistic for stabilitymodule "vpc" { source = "app.terraform.io/company/vpc/aws" version = "~> 2.1.0" # Get patches, avoid breaking changes}State Management at Scale
Section titled “State Management at Scale”State File Organization
Section titled “State File Organization”┌─────────────────────────────────────────────────────────────────┐│ STATE FILE ARCHITECTURE │├─────────────────────────────────────────────────────────────────┤│ ││ Approach 1: Monolithic State (Bad for Scale) ││ ┌─────────────────────────────────────────────────────────┐ ││ │ terraform.tfstate │ ││ │ ├── All VPCs │ ││ │ ├── All EKS clusters │ ││ │ ├── All databases │ ││ │ └── 50,000+ resources │ ││ │ │ ││ │ Problems: │ ││ │ • Plan takes 10+ minutes │ ││ │ • Single team blocks all changes │ ││ │ • Corruption affects everything │ ││ └─────────────────────────────────────────────────────────┘ ││ ││ Approach 2: State per Environment (Better) ││ ┌────────────────────────────────────────────────────────┐ ││ │ s3://state/ │ ││ │ ├── dev/terraform.tfstate │ ││ │ ├── staging/terraform.tfstate │ ││ │ └── production/terraform.tfstate │ ││ └────────────────────────────────────────────────────────┘ ││ ││ Approach 3: State per Component (Best for Scale) ││ ┌────────────────────────────────────────────────────────┐ ││ │ s3://state/production/ │ ││ │ ├── networking/terraform.tfstate (VPC, subnets) │ ││ │ ├── security/terraform.tfstate (IAM, KMS) │ ││ │ ├── eks/terraform.tfstate (EKS cluster) │ ││ │ ├── databases/terraform.tfstate (RDS, DynamoDB) │ ││ │ ├── team-alpha/terraform.tfstate (Team workloads) │ ││ │ ├── team-beta/terraform.tfstate │ ││ │ └── team-gamma/terraform.tfstate │ ││ └────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Workspaces vs. Directories
Section titled “Workspaces vs. Directories”# Approach A: Terraform Workspaces# Same code, different state per workspace
# Select workspaceterraform workspace select production
# Backend uses workspace nameterraform { backend "s3" { bucket = "terraform-state" key = "infrastructure/terraform.tfstate" region = "us-east-1" # State file: terraform-state/env:/production/infrastructure/terraform.tfstate }}
# Reference workspace in codelocals { environment = terraform.workspace
instance_type = { dev = "t3.small" staging = "t3.medium" production = "t3.large" }[terraform.workspace]}# Approach B: Directory Structure (Recommended for Scale)environments/├── dev/│ ├── backend.tf # Points to dev state│ ├── main.tf│ └── terraform.tfvars # Dev-specific values├── staging/│ ├── backend.tf # Points to staging state│ ├── main.tf│ └── terraform.tfvars└── production/ ├── backend.tf # Points to production state ├── main.tf └── terraform.tfvarsDirectory approach advantages:
- Clear separation of environments
- Different providers/versions per environment
- Easier to understand state location
- No workspace switching mistakes
Cross-State References
Section titled “Cross-State References”output "vpc_id" { value = aws_vpc.main.id}
output "private_subnet_ids" { value = aws_subnet.private[*].id}
# eks/main.tf - Reference networking statedata "terraform_remote_state" "networking" { backend = "s3" config = { bucket = "terraform-state" key = "production/networking/terraform.tfstate" region = "us-east-1" }}
module "eks" { source = "../modules/eks"
vpc_id = data.terraform_remote_state.networking.outputs.vpc_id subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids}Terragrunt for DRY Configuration
Section titled “Terragrunt for DRY Configuration”# terragrunt.hcl (root)remote_state { backend = "s3" generate = { path = "backend.tf" if_exists = "overwrite_terragrunt" } config = { bucket = "company-terraform-state" key = "${path_relative_to_include()}/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-locks" }}
generate "provider" { path = "provider.tf" if_exists = "overwrite_terragrunt" contents = <<EOFprovider "aws" { region = var.aws_region default_tags { tags = { ManagedBy = "terraform" Environment = var.environment } }}EOF}include "root" { path = find_in_parent_folders()}
terraform { source = "../../../modules/eks"}
dependency "vpc" { config_path = "../networking"}
inputs = { environment = "production" vpc_id = dependency.vpc.outputs.vpc_id subnet_ids = dependency.vpc.outputs.private_subnet_ids}Policy as Code at Scale
Section titled “Policy as Code at Scale”OPA/Gatekeeper for Kubernetes
Section titled “OPA/Gatekeeper for Kubernetes”# Constraint TemplateapiVersion: templates.gatekeeper.sh/v1kind: ConstraintTemplatemetadata: name: k8srequiredlabelsspec: crd: spec: names: kind: K8sRequiredLabels validation: openAPIV3Schema: properties: labels: type: array items: type: string targets: - target: admission.k8s.gatekeeper.sh rego: | package k8srequiredlabels
violation[{"msg": msg}] { provided := {label | input.review.object.metadata.labels[label]} required := {label | label := input.parameters.labels[_]} missing := required - provided count(missing) > 0 msg := sprintf("Missing required labels: %v", [missing]) }# ConstraintapiVersion: constraints.gatekeeper.sh/v1beta1kind: K8sRequiredLabelsmetadata: name: require-team-labelsspec: match: kinds: - apiGroups: [""] kinds: ["Namespace", "Pod"] parameters: labels: - "team" - "environment" - "cost-center"Sentinel for Terraform Enterprise
Section titled “Sentinel for Terraform Enterprise”import "tfplan/v2" as tfplan
# Get all S3 bucketss3_buckets = filter tfplan.resource_changes as _, rc { rc.type is "aws_s3_bucket" and rc.mode is "managed" and (rc.change.actions contains "create" or rc.change.actions contains "update")}
# Check encryption configuration existsrequire_encryption = rule { all s3_buckets as _, bucket { bucket.change.after.server_side_encryption_configuration is not null }}
# Main rulemain = rule { require_encryption}Conftest for CI/CD
Section titled “Conftest for CI/CD”package terraform.required_tags
import future.keywords.in
required_tags := ["Environment", "Team", "CostCenter"]
deny[msg] { resource := input.resource_changes[_] resource.change.actions[_] in ["create", "update"]
# Resources that should have tags taggable := ["aws_instance", "aws_s3_bucket", "aws_rds_cluster", "aws_vpc"] resource.type in taggable
tags := object.get(resource.change.after, "tags", {})
missing := [tag | tag := required_tags[_] not tags[tag] ]
count(missing) > 0 msg := sprintf("%s is missing required tags: %v", [resource.address, missing])}name: Policy Check
on: pull_request: paths: ['terraform/**']
jobs: policy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Setup Terraform uses: hashicorp/setup-terraform@v3
- name: Generate Plan run: | cd terraform/environments/production terraform init -backend=false terraform plan -out=tfplan terraform show -json tfplan > tfplan.json
- name: Install Conftest run: | wget -q https://github.com/open-policy-agent/conftest/releases/download/v0.45.0/conftest_0.45.0_Linux_x86_64.tar.gz tar xzf conftest_0.45.0_Linux_x86_64.tar.gz sudo mv conftest /usr/local/bin/
- name: Run Policy Tests run: | conftest test terraform/environments/production/tfplan.json \ --policy policy/terraform/ \ --output tableSelf-Service Infrastructure
Section titled “Self-Service Infrastructure”Platform Engineering Approach
Section titled “Platform Engineering Approach”┌─────────────────────────────────────────────────────────────────┐│ SELF-SERVICE INFRASTRUCTURE │├─────────────────────────────────────────────────────────────────┤│ ││ Traditional Model: ││ ┌─────────┐ Ticket ┌──────────┐ Manual ┌───────┐ ││ │Developer│ ──────────► │ Ops/SRE │ ──────────► │Infra │ ││ └─────────┘ (days) └──────────┘ (days) └───────┘ ││ ││ Self-Service Model: ││ ┌─────────┐ PR/Form ┌──────────┐ Automated ┌───────┐ ││ │Developer│ ──────────► │ Platform │ ──────────► │Infra │ ││ └─────────┘ (minutes) │ API │ (minutes) └───────┘ ││ └──────────┘ ││ │ ││ ┌──────────┴──────────┐ ││ │ │ ││ Policy Gates Module Library ││ (guardrails) (golden paths) ││ │└─────────────────────────────────────────────────────────────────┘Service Catalog with Backstage
Section titled “Service Catalog with Backstage”# catalog-info.yaml - Backstage templateapiVersion: scaffolder.backstage.io/v1beta3kind: Templatemetadata: name: kubernetes-microservice title: Kubernetes Microservice description: Create a new microservice with EKS deployment tags: - kubernetes - microservice - recommendedspec: owner: platform-team type: service
parameters: - title: Service Information required: - name - team properties: name: title: Service Name type: string pattern: '^[a-z][a-z0-9-]*$' team: title: Owning Team type: string ui:field: OwnerPicker description: title: Description type: string
- title: Infrastructure Options properties: environment: title: Environment type: string enum: ['dev', 'staging', 'production'] default: 'dev' instanceSize: title: Instance Size type: string enum: ['small', 'medium', 'large'] default: 'small' description: | small: 0.5 CPU, 512MB RAM medium: 1 CPU, 1GB RAM large: 2 CPU, 2GB RAM needsDatabase: title: Needs Database? type: boolean default: false databaseType: title: Database Type type: string enum: ['postgresql', 'mysql'] ui:disabled: '{{ not parameters.needsDatabase }}'
steps: - id: fetch-template name: Fetch Template action: fetch:template input: url: ./skeleton values: name: ${{ parameters.name }} team: ${{ parameters.team }} environment: ${{ parameters.environment }}
- id: create-terraform name: Create Infrastructure action: terraform:apply input: workspace: ${{ parameters.environment }} variables: service_name: ${{ parameters.name }} instance_size: ${{ parameters.instanceSize }} enable_database: ${{ parameters.needsDatabase }} database_type: ${{ parameters.databaseType }}
- id: create-repo name: Create Repository action: publish:github input: repoUrl: github.com?owner=company&repo=${{ parameters.name }} defaultBranch: main
output: links: - title: Repository url: ${{ steps.create-repo.output.remoteUrl }} - title: Infrastructure url: https://terraform.company.com/workspaces/${{ parameters.name }}Terraform Module as Service
Section titled “Terraform Module as Service”# Complete microservice infrastructure in one module
variable "service_name" { description = "Name of the microservice" type = string}
variable "team" { description = "Owning team" type = string}
variable "environment" { description = "Deployment environment" type = string}
variable "instance_size" { description = "Resource allocation tier" type = string default = "small"
validation { condition = contains(["small", "medium", "large"], var.instance_size) error_message = "Instance size must be small, medium, or large" }}
variable "enable_database" { description = "Create associated database" type = bool default = false}
locals { # Size mappings sizes = { small = { cpu = "500m" memory = "512Mi" replicas = 2 } medium = { cpu = "1000m" memory = "1Gi" replicas = 3 } large = { cpu = "2000m" memory = "2Gi" replicas = 5 } }
common_tags = { Service = var.service_name Team = var.team Environment = var.environment ManagedBy = "terraform" }}
# ECR repository for container imagesresource "aws_ecr_repository" "service" { name = var.service_name image_tag_mutability = "IMMUTABLE"
image_scanning_configuration { scan_on_push = true }
tags = local.common_tags}
# Kubernetes namespaceresource "kubernetes_namespace" "service" { metadata { name = var.service_name labels = { "team" = var.team "environment" = var.environment } }}
# Resource quota for namespaceresource "kubernetes_resource_quota" "service" { metadata { name = "${var.service_name}-quota" namespace = kubernetes_namespace.service.metadata[0].name }
spec { hard = { "requests.cpu" = "${local.sizes[var.instance_size].cpu * local.sizes[var.instance_size].replicas * 2}" "requests.memory" = "${local.sizes[var.instance_size].memory * local.sizes[var.instance_size].replicas * 2}" "limits.cpu" = "${local.sizes[var.instance_size].cpu * local.sizes[var.instance_size].replicas * 4}" "limits.memory" = "${local.sizes[var.instance_size].memory * local.sizes[var.instance_size].replicas * 4}" "pods" = "${local.sizes[var.instance_size].replicas * 4}" } }}
# Network policyresource "kubernetes_network_policy" "service" { metadata { name = "${var.service_name}-policy" namespace = kubernetes_namespace.service.metadata[0].name }
spec { pod_selector {}
ingress { from { namespace_selector { match_labels = { "name" = "ingress-nginx" } } } }
egress { to { namespace_selector { match_labels = { "environment" = var.environment } } } }
policy_types = ["Ingress", "Egress"] }}
# Optional databaseresource "aws_db_instance" "service" { count = var.enable_database ? 1 : 0
identifier = "${var.service_name}-${var.environment}" engine = "postgres" engine_version = "15" instance_class = var.environment == "production" ? "db.t3.medium" : "db.t3.micro"
allocated_storage = 20 max_allocated_storage = var.environment == "production" ? 100 : 50 storage_encrypted = true
db_name = replace(var.service_name, "-", "_") username = "app" password = random_password.db[0].result
vpc_security_group_ids = [aws_security_group.db[0].id] db_subnet_group_name = data.aws_db_subnet_group.main.name
backup_retention_period = var.environment == "production" ? 30 : 7 skip_final_snapshot = var.environment != "production"
tags = local.common_tags}
# Outputs for team consumptionoutput "ecr_repository_url" { description = "ECR repository URL for pushing images" value = aws_ecr_repository.service.repository_url}
output "namespace" { description = "Kubernetes namespace" value = kubernetes_namespace.service.metadata[0].name}
output "database_endpoint" { description = "Database endpoint (if enabled)" value = var.enable_database ? aws_db_instance.service[0].endpoint : null}Organizational Patterns
Section titled “Organizational Patterns”Team Topologies for IaC
Section titled “Team Topologies for IaC”┌─────────────────────────────────────────────────────────────────┐│ IaC TEAM TOPOLOGIES │├─────────────────────────────────────────────────────────────────┤│ ││ Platform Team (Enabling) ││ ┌─────────────────────────────────────────────────────────┐ ││ │ Responsibilities: │ ││ │ • Build and maintain module library │ ││ │ • Define and enforce policies │ ││ │ • Manage shared infrastructure (networking, security) │ ││ │ • Provide self-service capabilities │ ││ │ • Train and support stream-aligned teams │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ ┌───────────────┼───────────────┐ ││ ▼ ▼ ▼ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │Stream-Aligned│ │Stream-Aligned│ │Stream-Aligned│ ││ │ Team Alpha │ │ Team Beta │ │ Team Gamma │ ││ ├─────────────┤ ├─────────────┤ ├─────────────┤ ││ │ Own service │ │ Own service │ │ Own service │ ││ │ Use modules │ │ Use modules │ │ Use modules │ ││ │ Self-service│ │ Self-service│ │ Self-service│ ││ │ deployments │ │ deployments │ │ deployments │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ ││ Interaction Modes: ││ • Collaboration: Platform helps team with complex needs ││ • X-as-a-Service: Teams consume modules independently ││ • Facilitation: Platform trains teams on IaC practices ││ │└─────────────────────────────────────────────────────────────────┘Ownership Model
Section titled “Ownership Model”# CODEOWNERS - Define ownership at scale# Platform team owns shared infrastructure/modules/ @platform-team/environments/*/networking/ @platform-team/environments/*/security/ @platform-team/policies/ @platform-team @security-team
# Teams own their own infrastructure/environments/*/team-alpha/ @team-alpha/environments/*/team-beta/ @team-beta/environments/*/team-gamma/ @team-gamma
# Security review required for production/environments/production/ @platform-team @security-team# Enforce ownership via tagsvariable "team" { description = "Team that owns this infrastructure" type = string
validation { condition = contains([ "platform", "team-alpha", "team-beta", "team-gamma" ], var.team) error_message = "Team must be a valid team identifier" }}
resource "aws_instance" "app" { # ...
tags = { Team = var.team ManagedBy = "terraform" Repository = "github.com/company/infrastructure" CostCenter = local.team_cost_centers[var.team] }}War Story: The 47-Team Consolidation
Section titled “War Story: The 47-Team Consolidation”Company: Fast-growing fintech Challenge: 47 teams, 180 repositories, 2,400 Terraform configurations
The Problem:
Before Consolidation:├── 180 repositories with Terraform├── 47 different VPC designs├── 23 different RDS configurations├── 12 different EKS setups├── 0 standardization├── 41% of S3 buckets without versioning├── 23% of databases unencrypted└── 4-month compliance remediationThe Solution:
Phase 1: Assessment (2 weeks)
# Inventory script across all reposfor repo in $(gh repo list company --json name -q '.[].name'); do gh repo clone company/$repo temp-repo find temp-repo -name "*.tf" -exec cat {} \; | \ grep -E "^resource|^module" >> inventory.txt rm -rf temp-repodone
# Result: 2,400 configurations, 847 unique resource typesPhase 2: Module Library (6 weeks)
Consolidated to 15 blessed modules:├── vpc (replaced 47 variants)├── eks (replaced 12 variants)├── rds-postgresql├── rds-mysql├── s3-bucket├── lambda-function├── api-gateway├── cloudfront├── elasticache├── sns-sqs├── ecr├── iam-role├── security-group├── route53└── acm-certificatePhase 3: Migration (12 weeks)
# Migration pattern: Import existing, switch to module# Step 1: Document existingterraform state list > existing-resources.txt
# Step 2: Create new config with modulemodule "vpc" { source = "app.terraform.io/company/vpc/aws" version = "1.0.0" # Match existing settings}
# Step 3: Import existing resourcesterraform import module.vpc.aws_vpc.main vpc-12345
# Step 4: Plan and verify no changesterraform plan # Should show no changes
# Step 5: Remove old config, keep modulePhase 4: Policy Enforcement (4 weeks)
# Sentinel policy requiring module usageimport "tfplan/v2" as tfplan
# Modules that must come from registryrequired_modules = [ "vpc", "eks", "rds-postgresql", "rds-mysql", "s3-bucket"]
# Check module sourcesmodule_sources = rule { all tfplan.module_calls as _, mc { mc.source_type is "registry" or mc.source_type is "remote" }}
main = rule { module_sources}Results After 6 Months:
After Consolidation:├── 15 blessed modules (from 847 variants)├── 100% compliance with security baseline├── 90% reduction in infrastructure tickets├── Plan time: 45 seconds (from 8+ minutes)├── Mean time to provision: 15 minutes (from 3 days)├── $2.1M/year saved in duplicate infrastructure└── 0 compliance findings in next auditCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Monolithic state files | Slow plans, broad blast radius | Split by component/team |
| No module versioning | Breaking changes affect everyone | Semantic versioning, registry |
| Copy-paste modules | Drift, inconsistency | Central module library |
| No policy enforcement | Security/compliance issues | OPA/Sentinel in CI/CD |
| Teams reinvent wheels | Wasted effort, inconsistency | Self-service with golden paths |
| No ownership model | Nobody responsible, nobody fixes | CODEOWNERS, team tags |
| All-or-nothing permissions | Too broad or too restrictive | Environment-scoped IAM |
| Manual state management | Corruption, lost state | Remote backend with locking |
1. What are the three main repository strategies for IaC at scale?
Answer:
-
Monorepo: All infrastructure in one repository
- Pros: Single source of truth, atomic changes, consistent tooling
- Cons: Large repo, requires strong access controls
-
Polyrepo: Separate repository per team/project
- Pros: Clear ownership, independent releases, simpler permissions
- Cons: Module fragmentation, inconsistent practices
-
Hybrid: Central modules + team-owned configurations
- Pros: Shared standards with team autonomy
- Cons: More complex to set up
2. Why should state files be split by component rather than having one large state file?
Answer: Splitting state files:
- Faster operations: Plan/apply only processes relevant resources
- Reduced blast radius: Errors affect only one component
- Parallel changes: Teams can work simultaneously
- Clearer ownership: Each state file has defined owners
- Easier debugging: Smaller state files are easier to inspect
- Less lock contention: Different teams don’t block each other
A 50,000-resource state file might take 10+ minutes to plan; split into 50 files of 1,000 resources each, plans complete in seconds.
3. What is the purpose of semantic versioning for Terraform modules?
Answer: Semantic versioning (MAJOR.MINOR.PATCH):
- MAJOR: Breaking changes - consumers must update their code
- MINOR: New features - backward compatible additions
- PATCH: Bug fixes - backward compatible fixes
This allows consumers to:
- Pin exact versions for stability (
version = "2.1.0") - Accept patches automatically (
version = "~> 2.1.0") - Accept minor updates (
version = "~> 2.0") - Make informed decisions about upgrades
- Avoid surprise breaking changes
4. Calculate the time savings if 47 teams each spend 2 hours/week on infrastructure requests, and self-service reduces this to 15 minutes/week.
Answer:
- Before: 47 teams × 2 hours/week = 94 hours/week
- After: 47 teams × 0.25 hours/week = 11.75 hours/week
- Weekly savings: 94 - 11.75 = 82.25 hours/week
- Annual savings: 82.25 × 52 = 4,277 hours/year
- At $100/hour engineering cost: $427,700/year saved
Plus qualitative benefits: faster time-to-market, reduced context switching, fewer errors.
5. What is the role of a Platform Team in IaC at scale?
Answer: Platform Team responsibilities:
- Build module library: Create and maintain blessed modules
- Define policies: Security, compliance, and architectural standards
- Manage shared infrastructure: Networking, security, identity
- Enable self-service: Build tooling for team autonomy
- Provide guardrails: Policy as code enforcement
- Support teams: Training, documentation, troubleshooting
- Reduce cognitive load: Abstract complexity behind simple interfaces
They operate as an “enabling team” that makes stream-aligned teams more effective.
6. What is Terragrunt and when should you use it?
Answer: Terragrunt is a thin wrapper for Terraform that helps with:
- DRY configuration: Generate backend, provider blocks automatically
- Dependencies: Manage cross-module dependencies explicitly
- Multi-environment: Apply same code across environments with different inputs
- Remote state: Consistent state configuration across all modules
Use it when:
- You have many environments with similar configurations
- You need to manage complex module dependencies
- You want to reduce boilerplate in terraform configurations
- You need run-all commands across multiple modules
7. How do CODEOWNERS files help with IaC governance at scale?
Answer: CODEOWNERS:
- Automatic review assignment: PRs automatically request review from owners
- Enforce approval requirements: Changes require owner approval
- Clear responsibility: Everyone knows who owns what
- Security boundaries: Security team reviews sensitive changes
- Compliance: Audit trail of who approved what
Example governance:
- Platform team must approve module changes
- Security team must approve production changes
- Teams can self-serve within their owned directories
8. A company has 180 Terraform repositories with 847 unique resource configurations. They want to consolidate to a module library. What would be a reasonable number of modules to target?
Answer: Target 15-25 modules covering:
- Core networking (VPC, subnets, security groups)
- Compute (EKS, ECS, Lambda, EC2)
- Databases (RDS variants, DynamoDB, ElastiCache)
- Storage (S3, EFS)
- Messaging (SQS, SNS, EventBridge)
- Security (IAM roles, KMS, ACM)
- CDN/API (CloudFront, API Gateway)
Reasoning:
- 847 variants → 15-25 modules = ~95% reduction
- Each module should be cohesive and reusable
- Too few = modules too complex, too many = no standardization benefit
- Pareto principle: 20% of resource types cover 80% of use cases
Hands-On Exercise
Section titled “Hands-On Exercise”Objective: Design and implement a scalable IaC structure for a multi-team organization.
Part 1: Repository Structure
Section titled “Part 1: Repository Structure”# Create scalable directory structuremkdir -p iac-at-scale/{modules,environments,policies,tests}
# Create module structurefor module in vpc eks rds s3; do mkdir -p iac-at-scale/modules/$module/{v1.0.0,tests}done
# Create environment structurefor env in dev staging production; do for team in platform alpha beta; do mkdir -p iac-at-scale/environments/$env/$team donedone
# Create CODEOWNERScat > iac-at-scale/CODEOWNERS << 'EOF'# Platform team owns shared infrastructure/modules/ @platform-team/environments/*/platform/ @platform-team/policies/ @platform-team @security-team
# Production requires security review/environments/production/ @platform-team @security-team
# Teams own their directories/environments/*/alpha/ @team-alpha/environments/*/beta/ @team-betaEOF
# View structurefind iac-at-scale -type d | head -30Part 2: Create Base Module
Section titled “Part 2: Create Base Module”# Create VPC modulecat > iac-at-scale/modules/vpc/v1.0.0/main.tf << 'EOF'variable "environment" { type = string}
variable "team" { type = string}
variable "vpc_cidr" { type = string default = "10.0.0.0/16"}
locals { tags = { Environment = var.environment Team = var.team ManagedBy = "terraform" Module = "vpc/v1.0.0" }}
resource "aws_vpc" "main" { cidr_block = var.vpc_cidr enable_dns_hostnames = true enable_dns_support = true tags = merge(local.tags, { Name = "${var.environment}-${var.team}-vpc" })}
output "vpc_id" { value = aws_vpc.main.id}EOFPart 3: Create Team Configuration
Section titled “Part 3: Create Team Configuration”# Create team configuration using modulecat > iac-at-scale/environments/dev/alpha/main.tf << 'EOF'terraform { backend "s3" { bucket = "company-terraform-state" key = "dev/alpha/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-locks" }}
module "vpc" { source = "../../../modules/vpc/v1.0.0"
environment = "dev" team = "alpha" vpc_cidr = "10.1.0.0/16"}EOFPart 4: Create Policy
Section titled “Part 4: Create Policy”# Create OPA policy for required tagscat > iac-at-scale/policies/required_tags.rego << 'EOF'package terraform.tags
required_tags := ["Environment", "Team", "ManagedBy"]
deny[msg] { resource := input.resource_changes[_] resource.change.actions[_] == "create"
tags := object.get(resource.change.after, "tags", {})
missing := [tag | tag := required_tags[_] not tags[tag] ]
count(missing) > 0 msg := sprintf("%s missing tags: %v", [resource.address, missing])}EOFSuccess Criteria
Section titled “Success Criteria”- Directory structure supports multiple teams and environments
- CODEOWNERS defines clear ownership boundaries
- Modules are versioned and reusable
- Teams can independently manage their infrastructure
- Policies enforce organizational standards
Key Takeaways
Section titled “Key Takeaways”- Choose repository strategy wisely - Monorepo, polyrepo, or hybrid based on organization size
- Split state files - By component/team for speed and reduced blast radius
- Version your modules - Semantic versioning enables safe updates
- Use a module registry - Central source of truth for blessed modules
- Policy as code - Enforce standards automatically in CI/CD
- Enable self-service - Platform team builds, stream teams consume
- Define ownership clearly - CODEOWNERS and tags for accountability
- Standardize gradually - Consolidate over time, don’t big-bang
- Measure and iterate - Track time-to-provision, compliance, satisfaction
- Document everything - Golden paths need clear documentation
Did You Know?
Section titled “Did You Know?”Module Reuse Statistics: Organizations with mature IaC practices reuse modules an average of 47 times each, compared to 3 times for organizations without a module strategy.
State File Growth: A typical enterprise Terraform state file grows by approximately 1,000 resources per year. Without splitting, plan times can exceed 30 minutes after 5 years.
Team Topology Origins: The Platform Team model comes from the book “Team Topologies” (2019), which identified four fundamental team types including the “Platform Team” that provides internal services to other teams.
Cost of Inconsistency: A 2023 study found that organizations with standardized IaC modules spent 67% less time on infrastructure incidents compared to those with ad-hoc configurations.
Next Module
Section titled “Next Module”Continue to Module 6.5: Drift Detection and Remediation to learn how to detect, prevent, and automatically fix infrastructure drift from your desired state.