Module 8.1: Multi-Account Architecture & Org Design
Complexity:
[COMPLEX]Time to Complete: 2.5 hours
Prerequisites: Cloud Architecture Patterns, familiarity with at least one hyperscaler (AWS, GCP, or Azure)
Track: Advanced Cloud Operations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design multi-account organization structures using AWS Organizations, GCP Folders, and Azure Management Groups
- Implement automated account vending pipelines that provision new cloud accounts with security guardrails built in
- Configure cross-account networking with Transit Gateway, Shared VPC, and VNet peering for hub-spoke topologies
- Evaluate account-per-team vs account-per-environment strategies for blast radius isolation and compliance
Why This Module Matters
Section titled “Why This Module Matters”March 2019. A mid-sized fintech company. 42 engineers. One AWS account.
Everything lived together: production databases, staging environments, CI/CD pipelines, developer sandboxes, and the shared services that glued it all together. One Friday afternoon, a junior developer running load tests in what they believed was a staging environment accidentally saturated the NAT Gateway that production traffic also depended on. Payment processing halted for 93 minutes. The incident cost $2.1 million in failed transactions and triggered a PCI-DSS audit that consumed two months of engineering time.
The root cause was not the load test. It was the architecture — or rather, the lack of it. When everything lives in a single account, there are no blast radius boundaries. IAM policies become impossibly complex. Cost attribution is guesswork. Audit trails are a tangled mess of production and development activity. And one mistake in any environment can cascade into every other.
This module teaches you how to design multi-account architectures across AWS, GCP, and Azure. You will learn to build organizational hierarchies that enforce isolation by default, centralize what should be shared (logging, security, networking), and keep what should be separate truly separate. More importantly, you will understand how these decisions directly impact your Kubernetes clusters — where they live, how they communicate, and who controls their lifecycle.
The Single-Account Trap
Section titled “The Single-Account Trap”Before we design multi-account architectures, let’s understand why teams end up in single-account messes. The pattern is always the same:
- Start a project. Create one account. Deploy everything.
- Grow the team. Add more workloads. Still one account.
- Need staging. Create a namespace or tag. Still one account.
- Need compliance. Realize IAM policies are spaghetti. Panic.
The single-account model works for a solo developer building a side project. It stops working the moment you need any of these: environment isolation, cost visibility, compliance boundaries, or team autonomy.
Stop and think: Consider a scenario where an attacker compromises a developer’s IAM credentials in a single-account setup. Even if the developer only has permissions for staging resources, how might the shared underlying control plane (like API rate limits or centralized networking) still allow the attacker to impact production availability?
SINGLE-ACCOUNT ANTI-PATTERN════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────┐ │ AWS Account: 123456789012 │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Prod │ │ Staging │ │ Dev Sandbox │ │ │ │ EKS + │ │ EKS + │ │ Random EC2s + │ │ │ │ RDS + │ │ RDS + │ │ Load tests + │ │ │ │ S3 │ │ S3 │ │ Experiments │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ │ │ │ Shared: VPC, NAT GW, IAM roles, CloudTrail │ │ Result: One blast radius. One bill. One nightmare. │ └─────────────────────────────────────────────────────┘
Problems: - Dev load test saturates prod NAT Gateway - IAM role for staging accidentally gets prod DB access - Cost report says "$84,000 this month" but WHO spent it? - CloudTrail logs mix dev experiments with prod audit eventsThe multi-account model solves all of these by creating hard boundaries. An AWS account, a GCP project, or an Azure subscription is the strongest isolation boundary each cloud offers below the organization level.
Organizational Hierarchies Across Clouds
Section titled “Organizational Hierarchies Across Clouds”Every hyperscaler provides a hierarchy for organizing accounts. The terminology differs, but the concept is the same: nest accounts inside groupings that inherit policies downward.
The Rosetta Stone of Cloud Organization
Section titled “The Rosetta Stone of Cloud Organization”ORGANIZATIONAL HIERARCHY COMPARISON════════════════════════════════════════════════════════════════
AWS GCP Azure─────────────────── ─────────────────── ───────────────────Organization Organization Tenant (Entra ID) │ │ │ ├─ Root OU ├─ Folder ├─ Management Group │ │ │ │ │ │ │ ├─ OU │ ├─ Folder │ ├─ Management Group │ │ │ │ │ │ │ │ │ │ │ ├─ Account │ │ ├─ Project │ │ ├─ Subscription │ │ └─ Account │ │ └─ Project │ │ └─ Subscription │ │ │ │ │ │ │ └─ OU │ └─ Folder │ └─ Management Group │ │ │ │ │ │ │ └─ Account │ └─ Project │ └─ Subscription │ │ │ └─ Root OU └─ Folder └─ Management Group
Policy Mechanism: Policy Mechanism: Policy Mechanism: Service Control Organization Azure Policy Policies (SCPs) Policies (assigned at MG level) Inherited downward Inherited downward Inherited downwardKey Differences That Matter
Section titled “Key Differences That Matter”| Feature | AWS (Organizations) | GCP (Resource Manager) | Azure (Management Groups) |
|---|---|---|---|
| Isolation unit | Account | Project | Subscription |
| Max nesting depth | 5 levels of OUs | 10 levels of folders | 6 levels of MGs |
| Policy mechanism | SCPs (deny-only) | Org Policies (boolean/list) | Azure Policy (deny + audit) |
| Billing boundary | Account-level | Project-level or Billing Account | Subscription-level |
| Hard resource limits | Per-account quotas | Per-project quotas | Per-subscription quotas |
| Cross-boundary networking | VPC Peering, Transit GW | Shared VPC, VPC Peering | VNet Peering, Virtual WAN |
One critical nuance: AWS SCPs can only deny — they cannot grant permissions. This means your SCP strategy is about guardrails, not access grants. GCP Organization Policies work differently: they constrain resource configurations (e.g., “VMs can only be created in these regions”). Azure Policy can both deny and audit, making it the most flexible but also the most complex to reason about.
Designing Your OU Structure
Section titled “Designing Your OU Structure”The organizational unit (OU) structure is the skeleton of your cloud architecture. Get it wrong, and you will fight it for years. Get it right, and it becomes invisible — quietly enforcing isolation, compliance, and cost boundaries.
The Reference Architecture
Section titled “The Reference Architecture”Here is a battle-tested OU structure used by organizations running 20-200 AWS accounts. The same pattern maps to GCP folders and Azure management groups.
RECOMMENDED OU STRUCTURE════════════════════════════════════════════════════════════════
Root├── Security OU│ ├── Log Archive Account (centralized logging)│ ├── Security Tooling Account (GuardDuty, SecurityHub)│ └── Audit Account (read-only cross-account)│├── Infrastructure OU│ ├── Network Hub Account (Transit Gateway, DNS)│ ├── Shared Services Account (CI/CD, artifact registries)│ └── Identity Account (SSO, directory services)│├── Workloads OU│ ├── Production OU│ │ ├── Team-A Prod Account (EKS cluster + workloads)│ │ ├── Team-B Prod Account (EKS cluster + workloads)│ │ └── Data Platform Prod (analytics + ML)│ ││ ├── Staging OU│ │ ├── Team-A Staging Account│ │ └── Team-B Staging Account│ ││ └── Development OU│ ├── Team-A Dev Account│ └── Team-B Dev Account│├── Sandbox OU│ ├── Developer-1 Sandbox│ └── Developer-2 Sandbox│└── Suspended OU (decommissioned accounts)Why This Structure Works
Section titled “Why This Structure Works”Security OU at the top: Security accounts have the most restrictive SCPs. The Log Archive account is write-only for other accounts and read-only for security teams. No one can delete CloudTrail logs. No one can disable GuardDuty findings.
Infrastructure OU is separate from Workloads: The networking team manages Transit Gateways and DNS without needing access to application workloads. The CI/CD pipeline runs in a shared services account, pushing artifacts that workload accounts pull.
Workloads OU splits by environment, not team: This is critical. If you split by team first, you end up with Team-A-Prod, Team-A-Staging, Team-A-Dev all in one OU. This makes it impossible to apply environment-specific policies (like “production accounts cannot have public S3 buckets”) without per-account exceptions.
Pause and predict: If an organization structures its top-level OUs by business unit (e.g., Marketing, Engineering, HR) instead of environment (Prod, Staging, Dev), how will the cloud platform team have to manage SCPs for organization-wide security mandates? What operational bottlenecks will this create during compliance audits?
Sandbox OU has aggressive cost controls: Sandbox accounts get auto-nuke policies (using tools like aws-nuke) that clean up resources older than 72 hours. Budget alarms fire at $50/month. This gives developers freedom to experiment without burning money.
Setting Up AWS Organizations
Section titled “Setting Up AWS Organizations”# Create the organization (from management account)aws organizations create-organization --feature-set ALL
# Create the OU structureROOT_ID=$(aws organizations list-roots --query 'Roots[0].Id' --output text)
# Create top-level OUsSECURITY_OU=$(aws organizations create-organizational-unit \ --parent-id $ROOT_ID \ --name "Security" \ --query 'OrganizationalUnit.Id' --output text)
INFRA_OU=$(aws organizations create-organizational-unit \ --parent-id $ROOT_ID \ --name "Infrastructure" \ --query 'OrganizationalUnit.Id' --output text)
WORKLOADS_OU=$(aws organizations create-organizational-unit \ --parent-id $ROOT_ID \ --name "Workloads" \ --query 'OrganizationalUnit.Id' --output text)
# Create environment sub-OUs under WorkloadsPROD_OU=$(aws organizations create-organizational-unit \ --parent-id $WORKLOADS_OU \ --name "Production" \ --query 'OrganizationalUnit.Id' --output text)
STAGING_OU=$(aws organizations create-organizational-unit \ --parent-id $WORKLOADS_OU \ --name "Staging" \ --query 'OrganizationalUnit.Id' --output text)
DEV_OU=$(aws organizations create-organizational-unit \ --parent-id $WORKLOADS_OU \ --name "Development" \ --query 'OrganizationalUnit.Id' --output text)
# Create a new account and move it to the Production OUaws organizations create-account \ --email "team-a-prod@company.com" \ --account-name "Team-A-Production"
# Move account to Production OU (once created)aws organizations move-account \ --account-id 111122223333 \ --source-parent-id $ROOT_ID \ --destination-parent-id $PROD_OUGCP Equivalent with Folders
Section titled “GCP Equivalent with Folders”# Create folder structureORG_ID=$(gcloud organizations list --format="value(ID)")
# Create top-level foldersgcloud resource-manager folders create \ --display-name="Security" \ --organization=$ORG_ID
gcloud resource-manager folders create \ --display-name="Infrastructure" \ --organization=$ORG_ID
WORKLOADS_FOLDER=$(gcloud resource-manager folders create \ --display-name="Workloads" \ --organization=$ORG_ID \ --format="value(name)")
# Create environment sub-foldersgcloud resource-manager folders create \ --display-name="Production" \ --folder=$WORKLOADS_FOLDER
gcloud resource-manager folders create \ --display-name="Staging" \ --folder=$WORKLOADS_FOLDER
# Create a project in the Production foldergcloud projects create team-a-prod-2026 \ --folder=$PROD_FOLDER_ID \ --name="Team A Production"Workload Isolation Patterns
Section titled “Workload Isolation Patterns”Not every team needs its own account. And not every workload needs its own cluster. The art is matching isolation level to actual requirements.
Isolation Decision Matrix
Section titled “Isolation Decision Matrix”| Requirement | Same Account, Same Cluster | Same Account, Separate Clusters | Separate Accounts |
|---|---|---|---|
| Team autonomy | Low (shared RBAC) | Medium (cluster admin) | High (account admin) |
| Blast radius | Pod/Namespace level | Cluster level | Account level |
| Compliance boundary | Cannot achieve PCI/HIPAA | Possible with effort | Clean boundary |
| Cost visibility | Tags only | Tags + cluster | Account-level billing |
| Network isolation | NetworkPolicy | VPC/subnet separation | VPC per account |
| Resource contention | High risk | Medium risk | Zero risk |
| Operational overhead | Low | Medium | High |
The War Story: When Namespace Isolation Isn’t Enough
Section titled “The War Story: When Namespace Isolation Isn’t Enough”A healthcare company ran production workloads in a single EKS cluster, using namespaces for isolation between teams. Their compliance team signed off because NetworkPolicies were in place. Then during a PCI audit, the auditor asked: “Can a pod in namespace team-b read the Kubernetes API and discover that namespace team-a-pci exists?” The answer was yes. kubectl get namespaces works for any authenticated service account by default. The auditor flagged it as a data leakage risk — not because data was exposed, but because the existence of a PCI workload was discoverable.
The fix required separate clusters. But by then, 14 teams had built tooling assuming a single cluster. The migration took five months.
Lesson: decide your isolation boundaries before you have tenants, not after.
Stop and think: NetworkPolicies in Kubernetes can restrict traffic between namespaces, but they cannot restrict access to the Kubernetes API itself. If two distinct compliance zones (like PCI and non-PCI) share a cluster, what specific API discovery techniques could a compromised non-PCI pod use to map out the PCI infrastructure, even with perfectly configured NetworkPolicies?
Kubernetes Lifecycle in a Multi-Account World
Section titled “Kubernetes Lifecycle in a Multi-Account World”Each account that runs Kubernetes clusters needs a clear lifecycle model:
K8S CLUSTER LIFECYCLE PER ACCOUNT════════════════════════════════════════════════════════════════
Shared Services Account Workload Account (Team-A Prod)┌────────────────────────┐ ┌────────────────────────────┐│ │ │ ││ Terraform/Crossplane │──creates──▶ EKS/GKE/AKS Cluster ││ (IaC source of truth) │ │ ││ │ │ ┌──────────────────────┐ ││ ArgoCD / Flux │──deploys──▶ │ Workloads │ ││ (GitOps controller) │ │ │ - app deployments │ ││ │ │ │ - ingress configs │ ││ ECR / Artifact Reg │──images───▶ │ - secrets (ESO) │ ││ (shared registry) │ │ └──────────────────────┘ ││ │ │ ││ Central Logging │◀──logs────│ Fluentbit / OTel agent ││ (Log Archive account) │ │ │└────────────────────────┘ └────────────────────────────┘
Key principle: Clusters are CATTLE, not pets.The IaC in Shared Services can recreate any cluster from scratch.The critical decision is whether each team manages their own cluster infrastructure or whether a platform team provisions clusters centrally. Most organizations that scale past five teams find that central provisioning with team-owned workload deployment strikes the best balance.
Centralized Logging & Audit
Section titled “Centralized Logging & Audit”In a multi-account world, logging becomes both more important and more complex. You need a single pane of glass for security events, but you also need to ensure that no individual account can tamper with its own logs.
The Immutable Log Archive Pattern
Section titled “The Immutable Log Archive Pattern”CENTRALIZED LOGGING ARCHITECTURE════════════════════════════════════════════════════════════════
Workload Account A Workload Account B ┌───────────────────┐ ┌───────────────────┐ │ CloudTrail ──────────────────────────────────────┐ │ VPC Flow Logs ───────────────────────────────────┤ │ EKS Audit Logs ──────────────────────────────────┤ └───────────────────┘ └───────────────────┘ │ │ ▼ Log Archive Account (Security OU) ┌──────────────────────────────┐ │ S3 Bucket (Object Lock) │ │ - Governance mode: 1 year │ │ - No delete, even by root │ │ │ │ Athena / OpenSearch for │ │ query and investigation │ │ │ │ SCP prevents: │ │ - Disabling CloudTrail │ │ - Deleting log buckets │ │ - Modifying Object Lock │ └──────────────────────────────┘AWS: Organization-Wide CloudTrail
Section titled “AWS: Organization-Wide CloudTrail”# Create organization trail (from management account)aws cloudtrail create-trail \ --name org-trail \ --s3-bucket-name company-org-cloudtrail-logs \ --is-organization-trail \ --is-multi-region-trail \ --enable-log-file-validation \ --kms-key-id arn:aws:kms:us-east-1:999888777666:key/mrk-abc123
aws cloudtrail start-logging --name org-trail
# SCP to prevent member accounts from disabling CloudTrailcat <<'EOF' > deny-cloudtrail-changes.json{ "Version": "2012-10-17", "Statement": [ { "Sid": "ProtectCloudTrail", "Effect": "Deny", "Action": [ "cloudtrail:StopLogging", "cloudtrail:DeleteTrail", "cloudtrail:UpdateTrail" ], "Resource": "arn:aws:cloudtrail:*:*:trail/org-trail" } ]}EOF
aws organizations create-policy \ --name "ProtectCloudTrail" \ --description "Prevent member accounts from disabling org CloudTrail" \ --type SERVICE_CONTROL_POLICY \ --content file://deny-cloudtrail-changes.json
# Attach SCP to the root (applies to ALL accounts)aws organizations attach-policy \ --policy-id p-1234567890 \ --target-id $ROOT_IDPause and predict: An attacker gains full administrative access to a workload account and discovers they cannot disable CloudTrail due to an organizational SCP. Given that they still control the local compute resources, what alternative tactics might they employ to obscure their malicious activities or degrade the central logging system without ever touching the CloudTrail configuration?
GCP: Organization-Level Log Sinks
Section titled “GCP: Organization-Level Log Sinks”# Create organization-level log sinkgcloud logging sinks create org-audit-sink \ storage.googleapis.com/company-org-audit-logs \ --organization=$ORG_ID \ --include-children \ --log-filter='logName:"cloudaudit.googleapis.com"'
# Grant the sink's service account write access to the bucket# (The sink creates a unique service account automatically)SINK_SA=$(gcloud logging sinks describe org-audit-sink \ --organization=$ORG_ID \ --format="value(writerIdentity)")
gsutil iam ch $SINK_SA:objectCreator gs://company-org-audit-logsEKS Audit Logs to Central Logging
Section titled “EKS Audit Logs to Central Logging”Kubernetes audit logs are separate from cloud-level audit trails. You need both.
# Fluentbit ConfigMap to ship EKS audit logs to central accountapiVersion: v1kind: ConfigMapmetadata: name: fluent-bit-config namespace: loggingdata: fluent-bit.conf: | [SERVICE] Flush 5 Log_Level info Parsers_File parsers.conf
[INPUT] Name tail Tag kube.audit.* Path /var/log/kubernetes/audit/*.log Parser json Refresh_Interval 10 Mem_Buf_Limit 50MB
[OUTPUT] Name s3 Match kube.audit.* bucket central-audit-logs-cross-account region us-east-1 role_arn arn:aws:iam::999888777666:role/audit-log-writer total_file_size 50M upload_timeout 60s s3_key_format /eks-audit/$TAG/%Y/%m/%d/%H/$UUID.gz compression gzipShared Services: What to Centralize
Section titled “Shared Services: What to Centralize”Not everything should be isolated. Some resources are natural shared services that benefit from centralization. The challenge is identifying which ones and building the right access patterns.
Centralize vs. Distribute Decision Framework
Section titled “Centralize vs. Distribute Decision Framework”| Resource | Centralize | Distribute | Reasoning |
|---|---|---|---|
| Container registry | Yes | One source of truth for images, scan once | |
| CI/CD pipelines | Yes | Consistent build process, shared runners | |
| DNS management | Yes | Single delegation, avoid split-brain | |
| Secrets management | Hybrid | Hybrid | Central vault, local caching (ESO pattern) |
| Service mesh control | Depends | Depends | Centralize if cross-cluster, distribute if single |
| Monitoring stack | Yes | Unified dashboards, correlation across clusters | |
| Cluster provisioning (IaC) | Yes | Consistent configs, version control | |
| Application deployment | Yes | Teams own their deploy cadence |
Stop and think: Centralizing CI/CD pipelines in a shared services account establishes a single source of truth, but it also means the deployment runners require highly privileged cross-account access to modify production resources. How must you design the IAM trust boundaries so that a compromised runner cannot arbitrarily pivot and destroy resources across the entire organization?
The Shared VPC Pattern (GCP)
Section titled “The Shared VPC Pattern (GCP)”GCP’s Shared VPC is one of the cleanest implementations of centralized networking. A host project owns the VPC, and service projects attach to it.
# Enable Shared VPC in the host project (network hub)gcloud compute shared-vpc enable network-hub-project
# Associate a service project (workload account)gcloud compute shared-vpc associated-projects add team-a-prod \ --host-project=network-hub-project
# Grant the service project's GKE service account access to the shared subnetgcloud projects add-iam-policy-binding network-hub-project \ --member="serviceAccount:service-TEAM_A_PROJECT_NUM@container-engine-robot.iam.gserviceaccount.com" \ --role="roles/container.hostServiceAgentUser"
# Create a GKE cluster in the service project using the shared VPCgcloud container clusters create team-a-prod \ --project=team-a-prod \ --network=projects/network-hub-project/global/networks/shared-vpc \ --subnetwork=projects/network-hub-project/regions/us-central1/subnetworks/team-a-subnet \ --cluster-secondary-range-name=pods \ --services-secondary-range-name=servicesThis gives you centralized network management (firewall rules, routes, IP allocation) while letting each team own their cluster and workloads. The networking team sees all traffic flows. The application team sees only their project.
Hierarchical Billing & Cost Allocation
Section titled “Hierarchical Billing & Cost Allocation”Multi-account architecture gives you the most accurate cost attribution possible — costs are naturally isolated to the account that incurred them.
AWS: Consolidated Billing with Cost Allocation Tags
Section titled “AWS: Consolidated Billing with Cost Allocation Tags”# Enable cost allocation tags at the organization levelaws ce update-cost-allocation-tags-status \ --cost-allocation-tags-status \ TagKey=Environment,Status=Active \ TagKey=Team,Status=Active \ TagKey=CostCenter,Status=Active
# Create a budget per workload accountaws budgets create-budget \ --account-id 111122223333 \ --budget '{ "BudgetName": "team-a-prod-monthly", "BudgetLimit": {"Amount": "15000", "Unit": "USD"}, "TimeUnit": "MONTHLY", "BudgetType": "COST" }' \ --notifications-with-subscribers '[ { "Notification": { "NotificationType": "ACTUAL", "ComparisonOperator": "GREATER_THAN", "Threshold": 80, "ThresholdType": "PERCENTAGE" }, "Subscribers": [ {"SubscriptionType": "EMAIL", "Address": "team-a-lead@company.com"}, {"SubscriptionType": "SNS", "Address": "arn:aws:sns:us-east-1:111122223333:budget-alerts"} ] } ]'Cost Hierarchy Visualization
Section titled “Cost Hierarchy Visualization”HIERARCHICAL BILLING STRUCTURE════════════════════════════════════════════════════════════════
Organization Payer Account├── Security OU ────────────────── $2,100/month│ ├── Log Archive ──── $1,400 (S3 storage, Athena queries)│ ├── Security Tools ── $500 (GuardDuty, SecurityHub)│ └── Audit ─────────── $200 (read-only access tooling)│├── Infrastructure OU ──────────── $8,300/month│ ├── Network Hub ───── $3,200 (Transit GW, NAT GWs, DNS)│ ├── Shared Services ─ $4,100 (CI/CD runners, ECR, ArgoCD)│ └── Identity ──────── $1,000 (SSO, directory sync)│├── Workloads OU ───────────────── $63,500/month│ ├── Production OU ── $48,000│ │ ├── Team-A ───── $22,000 (3 EKS clusters, RDS, ElastiCache)│ │ ├── Team-B ───── $18,000 (2 EKS clusters, DynamoDB)│ │ └── Data ──────── $8,000 (EMR, Redshift)│ ├── Staging OU ────── $9,500│ └── Development OU ── $6,000│└── Sandbox OU ─────────────────── $1,200/month ───────── Total: $75,100/month
With single-account: "$75,100 — but we have no idea where it goes"With multi-account: Per-team, per-environment breakdown by defaultPro tip: Tagging standards across accounts
Section titled “Pro tip: Tagging standards across accounts”Even with multi-account, you still need consistent tags for cross-cutting views. Define a tagging policy at the organization level.
# AWS: Create a tag policy (enforced via Organizations)cat <<'EOF' > tag-policy.json{ "tags": { "Environment": { "tag_key": {"@@assign": "Environment"}, "tag_value": {"@@assign": ["production", "staging", "development", "sandbox"]}, "enforced_for": {"@@assign": ["ec2:instance", "eks:cluster", "rds:db"]} }, "Team": { "tag_key": {"@@assign": "Team"}, "enforced_for": {"@@assign": ["ec2:instance", "eks:cluster"]} }, "CostCenter": { "tag_key": {"@@assign": "CostCenter"}, "enforced_for": {"@@assign": ["ec2:instance", "eks:cluster", "rds:db", "s3:bucket"]} } }}EOF
aws organizations create-policy \ --name "RequiredTags" \ --type TAG_POLICY \ --content file://tag-policy.json
aws organizations attach-policy \ --policy-id p-tag12345 \ --target-id $WORKLOADS_OUDid You Know?
Section titled “Did You Know?”-
AWS Control Tower can provision a landing zone in under 60 minutes that creates your management account, log archive account, audit account, and baseline SCPs. What used to take weeks of manual setup is now a wizard. GCP has a similar concept called “Fabric FAST” and Azure has “Enterprise-Scale Landing Zones” — all three try to codify the multi-account best practices described in this module.
-
GCP projects have a soft limit of 30 projects per billing account but this can be raised to thousands. The real constraint is that every project gets its own set of quotas (API calls, resource limits), which means you sometimes need to spread workloads across projects to avoid hitting per-project ceilings — not for organizational reasons, but for capacity.
-
The AWS “Root” account email is the most powerful credential in your organization and cannot be protected by SCPs. If someone compromises the root email address of your management account, they control your entire organization. Best practice: use a distribution list email (not a personal inbox), enable MFA with a hardware token, and store the credentials in a physical safe. This is not hyperbole.
-
Azure Management Groups support “deny assignments” that are even stronger than role assignments. A deny assignment at a management group level prevents any user, even Owner, from performing specific actions on resources below. This is how Azure enforces compliance for regulated industries — the hierarchy physically prevents non-compliant configurations.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Organizing OUs by team instead of environment | Feels natural to team ownership | Structure by environment first, team second. Apply environment policies at the OU level. |
| Running everything in the management/payer account | ”It’s the first account, might as well use it” | Management account should run NOTHING except billing and organization management. Zero workloads. |
| Not creating a Suspended OU | Forgot about account decommissioning | Create a Suspended OU with SCPs that deny all actions. Move decommissioned accounts here instead of closing them (closing has a 90-day reopen window). |
| Sharing VPCs across environments | Trying to save on NAT Gateway costs | Separate VPCs per environment. The $30/month NAT Gateway savings is not worth the blast radius. |
| Manual account creation | ”We only need a few accounts” | Automate with Account Factory (Control Tower), Terraform, or Crossplane from day one. Even if you only have three accounts. |
| Forgetting centralized DNS | Each account creates its own hosted zone | Create a central DNS account with Route53/Cloud DNS. Delegate subdomains to workload accounts via NS records. |
| No SCP/policy guardrails on day one | ”We’ll add governance later” | Apply baseline SCPs immediately: deny disabling CloudTrail, deny leaving the organization, restrict regions. |
| One IAM Identity Center permission set for all environments | ”Admin is admin” | Create separate permission sets for prod (read-only default, break-glass for write) vs dev (broader access). |
1. Scenario: You have inherited an AWS environment where the primary data warehouse runs in the organization's management (payer) account to "save on NAT gateway costs." Your security team wants to apply a new Service Control Policy (SCP) to restrict access to certain regions globally. How does the placement of this data warehouse impact your security posture?
The management account is strictly exempt from Service Control Policies (SCPs), meaning any workload running within it operates completely outside the organization’s automated guardrails. If an SCP is applied to restrict regions, the data warehouse and any other resources in the management account will entirely bypass these restrictions. Furthermore, running workloads in this account needlessly expands the attack surface of your most privileged environment, which inherently possesses access to billing data and org-wide settings. To maintain strict security boundaries, the management account must be reserved solely for billing and organization administration, with zero active workloads.
2. Scenario: Your cloud engineering team is migrating a multi-account architecture from AWS to GCP. In AWS, you relied on an SCP that explicitly denied the `s3:DeleteBucket` action to prevent accidental data loss. A junior engineer proposes creating an identical "deny action" Organization Policy in GCP for Cloud Storage. Why will this approach fail, and what is the fundamental difference in how these two mechanisms operate?
AWS SCPs function as strict IAM permission boundaries that can only explicitly deny API actions, effectively filtering what user policies are allowed to execute regardless of their granted permissions. Conversely, GCP Organization Policies do not evaluate IAM actions; instead, they constrain the actual configuration state of resources, such as enforcing that buckets must have uniform bucket-level access enabled or restricting resource creation to specific regions. The engineer’s approach will fail because GCP Organization Policies cannot deny specific API calls like bucket deletion. You must adapt your strategy to use GCP’s resource configuration constraints combined with proper IAM role scoping to achieve the same data protection goals.
3. Scenario: A fast-growing software company has 8 development teams. Each team manages 2 EKS clusters: one for staging and one for production. The CTO suggests creating 8 AWS accounts (one per team) to keep the billing simple. What critical security and operational risks does this 8-account strategy introduce, and what is the recommended alternative?
Using an 8-account strategy mixes staging and production environments within the same blast radius, meaning a misconfigured IAM role or a runaway process in staging could directly compromise or degrade production resources. Furthermore, this structure makes it impossible to apply environment-wide Service Control Policies (SCPs), such as enforcing strict public access blocks exclusively on all production accounts, without creating complex, per-account exceptions. The recommended alternative is to use 16 accounts—one per team per environment—which establishes a hard boundary that naturally aligns with environment-specific SCPs. The operational overhead of managing 16 accounts instead of 8 is negligible when utilizing automated account vending solutions like AWS Control Tower or Terraform.
4. Scenario: An enterprise currently allows each of its 15 workload accounts to host its own container registry. A recent security audit revealed that 40% of deployed images contain critical vulnerabilities, and patching them requires coordinating with all 15 account owners. How would migrating to a centralized container registry in a shared services account resolve this operational bottleneck?
A centralized container registry establishes a single, authoritative source of truth for all container images across the organization, eliminating image sprawl and version inconsistencies. By consolidating images, security teams can implement a unified scanning pipeline where an image is scanned exactly once upon push, and vulnerabilities are caught before the image is distributed to workload accounts. This architecture also drastically simplifies CI/CD workflows, as pipelines only need to push to one destination while workload accounts securely pull images using cross-account IAM resource policies. Ultimately, this reduces storage costs for duplicate images and shifts the security enforcement point to a single, manageable chokepoint.
5. Scenario: A platform team identifies an orphaned AWS account previously used by a departed contractor. The account contains an active Amazon Route 53 hosted zone serving production DNS records. To immediately stop the billing charges, an administrator clicks "Close Account" without deleting the resources. What are the immediate and long-term consequences of this action?
When an AWS account is closed, it enters a 90-day suspended state where it becomes completely inaccessible via the console or API, but the underlying resources are not immediately destroyed. During this period, the active Route 53 hosted zone will continue to route traffic, but you will be entirely unable to modify or manage those DNS records if an emergency arises. After the 90-day window, AWS will permanently close the account and begin a non-deterministic deletion of resources, potentially causing a catastrophic outage when the DNS records are eventually purged. To avoid this, best practices dictate moving the account to a Suspended OU with a “deny all” SCP to stop activity, allowing administrators to safely identify, migrate, or manually delete critical resources before initiating the final closure.
6. Scenario: A multi-national corporation is designing a hub-and-spoke network topology to connect 50 regional workload environments. The AWS architecture team plans to use VPC Peering between every account, while the GCP team proposes a Shared VPC model. What architectural scaling challenges will the AWS team face with their approach compared to the GCP team's strategy?
The AWS team’s VPC Peering approach relies on a decentralized, point-to-point model, meaning connecting 50 environments requires creating and managing an N-squared mesh of peering connections, each with its own independent route tables and security groups. This rapidly becomes an operational nightmare to maintain, audit, and troubleshoot at scale, which is why AWS Transit Gateway is typically recommended for this volume. In contrast, GCP’s Shared VPC uses a centralized model where a single host project manages one unified VPC, its subnets, and its firewall rules, while service projects simply deploy resources into those shared subnets. This allows the GCP networking team to maintain centralized visibility and control over all traffic flows without the compounding complexity of point-to-point peering.
7. Scenario: To boost developer velocity, an organization provides 50 engineers with their own personal AWS sandbox accounts, granting them full administrative access to experiment. After three months, the monthly cloud bill spikes by $15,000, primarily driven by forgotten GPU instances and unattached EBS volumes. How does implementing automated resource cleanup solve this issue beyond just reducing costs?
While the immediate benefit of automated resource cleanup is stopping the financial bleed caused by abandoned infrastructure, its deeper value lies in enforcing an ephemeral mindset and reducing the attack surface. Forgotten resources like unpatched EC2 instances or exposed load balancers inevitably become critical security vulnerabilities over time, providing attackers with easy footholds into the organization. By automatically wiping resources older than 48 to 72 hours, you proactively eliminate these lingering security risks while simultaneously removing the “guilt barrier” for developers, allowing them to freely experiment knowing the system will clean up after them. This practice ensures sandbox environments remain safe, cost-effective scratchpads rather than permanent, unmanaged technical debt.
Hands-On Exercise: Design a Multi-Account Architecture
Section titled “Hands-On Exercise: Design a Multi-Account Architecture”In this exercise, you will design and partially implement a multi-account architecture for a fictional company.
Scenario
Section titled “Scenario”Company: CloudBrew (a SaaS analytics platform)
- 6 engineering teams
- 3 environments: production, staging, development
- Compliance requirement: SOC2 Type II (audit logs must be immutable for 1 year)
- Budget: team-level cost attribution required for quarterly planning
- Kubernetes: each team runs 1-2 EKS clusters per environment
Task 1: Draw the OU Structure
Section titled “Task 1: Draw the OU Structure”Design the OU hierarchy for CloudBrew. Include all accounts, organized by OU.
Solution
Root├── Security OU│ ├── log-archive (immutable S3, 1-year retention for SOC2)│ ├── security-tooling (GuardDuty, SecurityHub, Inspector)│ └── audit-readonly (auditor access, no write permissions)│├── Infrastructure OU│ ├── network-hub (Transit Gateway, central DNS, NAT)│ ├── shared-services (CI/CD, ECR, ArgoCD management)│ └── identity (IAM Identity Center, Okta/Entra connector)│├── Workloads OU│ ├── Production OU (SCP: no public S3, no IMDSv1, no large instances w/o approval)│ │ ├── analytics-prod│ │ ├── ingestion-prod│ │ ├── api-prod│ │ ├── ml-prod│ │ ├── frontend-prod│ │ └── data-prod│ ├── Staging OU (SCP: relaxed, but still no public S3)│ │ └── (mirror of prod accounts)│ └── Development OU (SCP: region-restricted, instance-size limited)│ └── (mirror of prod accounts)│├── Sandbox OU (SCP: 72hr auto-nuke, $100/month budget, restricted regions)│ └── (one per developer, auto-provisioned)│└── Suspended OU (SCP: deny all)Total accounts: 3 (security) + 3 (infra) + 18 (workloads: 6 teams x 3 envs) + N (sandboxes) = 24 + sandboxes
Task 2: Write the Baseline SCP
Section titled “Task 2: Write the Baseline SCP”Write an SCP that should apply to ALL member accounts (attached at the root).
Solution
{ "Version": "2012-10-17", "Statement": [ { "Sid": "DenyCloudTrailTampering", "Effect": "Deny", "Action": [ "cloudtrail:StopLogging", "cloudtrail:DeleteTrail", "cloudtrail:UpdateTrail" ], "Resource": "arn:aws:cloudtrail:*:*:trail/org-trail" }, { "Sid": "DenyLeavingOrganization", "Effect": "Deny", "Action": "organizations:LeaveOrganization", "Resource": "*" }, { "Sid": "DenyDisablingGuardDuty", "Effect": "Deny", "Action": [ "guardduty:DeleteDetector", "guardduty:DisassociateFromMasterAccount", "guardduty:UpdateDetector" ], "Resource": "*" }, { "Sid": "RestrictToAllowedRegions", "Effect": "Deny", "NotAction": [ "iam:*", "organizations:*", "sts:*", "support:*", "billing:*" ], "Resource": "*", "Condition": { "StringNotEquals": { "aws:RequestedRegion": [ "us-east-1", "us-west-2", "eu-west-1" ] } } } ]}Task 3: Configure Cross-Account ECR Access
Section titled “Task 3: Configure Cross-Account ECR Access”Write the ECR repository policy that allows all workload accounts to pull images from the shared services account’s ECR.
Solution
# In the shared-services account, set the ECR repository policyaws ecr set-repository-policy \ --repository-name company/api-service \ --policy-text '{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowWorkloadAccountsPull", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::111111111111:root", "arn:aws:iam::222222222222:root", "arn:aws:iam::333333333333:root" ] }, "Action": [ "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:BatchCheckLayerAvailability" ] } ] }'
# Better approach: use an organization conditionaws ecr set-repository-policy \ --repository-name company/api-service \ --policy-text '{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowOrgPull", "Effect": "Allow", "Principal": "*", "Action": [ "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:BatchCheckLayerAvailability" ], "Condition": { "StringEquals": { "aws:PrincipalOrgID": "o-abc1234567" } } } ] }'The organization condition is superior because you don’t need to update the policy every time a new account is created.
Task 4: Implement Centralized Logging for EKS Audit
Section titled “Task 4: Implement Centralized Logging for EKS Audit”Write a Terraform snippet that creates the centralized logging infrastructure: an S3 bucket with Object Lock in the log archive account and a cross-account role that workload accounts can assume to write logs.
Solution
# In the log-archive accountresource "aws_s3_bucket" "audit_logs" { bucket = "cloudbrew-org-audit-logs"
object_lock_enabled = true
tags = { Environment = "security" Purpose = "immutable-audit-logs" CostCenter = "security-ops" }}
resource "aws_s3_bucket_object_lock_configuration" "audit_logs" { bucket = aws_s3_bucket.audit_logs.id
rule { default_retention { mode = "GOVERNANCE" days = 365 } }}
resource "aws_s3_bucket_versioning" "audit_logs" { bucket = aws_s3_bucket.audit_logs.id versioning_configuration { status = "Enabled" }}
resource "aws_s3_bucket_policy" "audit_logs" { bucket = aws_s3_bucket.audit_logs.id
policy = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "AllowOrgWrite" Effect = "Allow" Principal = { AWS = "arn:aws:iam::*:role/audit-log-writer" } Action = ["s3:PutObject"] Resource = "${aws_s3_bucket.audit_logs.arn}/*" Condition = { StringEquals = { "aws:PrincipalOrgID" = "o-abc1234567" } } }, { Sid = "DenyDeleteForEveryone" Effect = "Deny" Principal = "*" Action = ["s3:DeleteObject", "s3:DeleteObjectVersion"] Resource = "${aws_s3_bucket.audit_logs.arn}/*" } ] })}
# IAM role in each workload account (deployed via StackSets)resource "aws_iam_role" "audit_log_writer" { name = "audit-log-writer"
assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Principal = { Service = "eks.amazonaws.com" } Action = "sts:AssumeRole" } ] })}
resource "aws_iam_role_policy" "audit_log_writer" { name = "write-to-central-bucket" role = aws_iam_role.audit_log_writer.id
policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Action = ["s3:PutObject"] Resource = "arn:aws:s3:::cloudbrew-org-audit-logs/*" } ] })}Task 5: Calculate Cost Allocation
Section titled “Task 5: Calculate Cost Allocation”Given this simplified monthly cost data, calculate per-team costs including shared infrastructure allocation.
| Account | Monthly Cost |
|---|---|
| Network Hub | $3,200 |
| Shared Services | $4,100 |
| Team-Alpha Prod | $12,000 |
| Team-Alpha Staging | $2,400 |
| Team-Beta Prod | $8,000 |
| Team-Beta Staging | $1,600 |
Solution
Shared infrastructure: $3,200 + $4,100 = $7,300
Allocation method: proportional to direct workload spend.
Total workload spend: $12,000 + $2,400 + $8,000 + $1,600 = $24,000
Team Alpha direct: $12,000 + $2,400 = $14,400 (60% of workload spend) Team Beta direct: $8,000 + $1,600 = $9,600 (40% of workload spend)
Team Alpha shared allocation: $7,300 x 0.60 = $4,380 Team Beta shared allocation: $7,300 x 0.40 = $2,920
Team Alpha total: $14,400 + $4,380 = $18,780/month Team Beta total: $9,600 + $2,920 = $12,520/month
Grand total: $18,780 + $12,520 = $31,300 (matches $24,000 + $7,300)
This proportional model is the most common approach. Alternatives include equal split (unfair if teams have different scale) or usage-based (accurate but complex to measure).
Success Criteria
Section titled “Success Criteria”- OU structure includes Security, Infrastructure, Workloads (with sub-OUs), Sandbox, and Suspended
- Baseline SCP protects CloudTrail, prevents leaving the org, and restricts regions
- ECR cross-account policy uses organization condition (not hardcoded account IDs)
- Centralized logging uses S3 Object Lock for immutability
- Cost allocation includes proportional shared infrastructure distribution
Next Module
Section titled “Next Module”Module 8.2: Advanced Cloud Networking & Transit Hubs — Learn how to connect all these accounts without creating a networking nightmare. Hub-and-spoke, transit gateways, and the art of routing traffic across organizational boundaries.