Skip to content

Module 1.3: Cluster Topology Planning

Complexity: [COMPLEX] | Time: 60 minutes

Prerequisites: Module 1.2: Server Sizing, CKA Part 1: Cluster Architecture


After completing this module, you will be able to:

  1. Design multi-cluster topologies that balance blast radius isolation against operational complexity
  2. Evaluate single-cluster vs. multi-cluster architectures based on team structure, compliance boundaries, and failure domains
  3. Plan control plane placement across racks and availability zones for high availability
  4. Implement cluster segmentation strategies that align with business domains and security requirements

In 2021, a European insurance company ran a single 400-node Kubernetes cluster in their on-premises datacenter. Everything — customer portal, claims processing, actuarial calculations, internal tooling — ran on one cluster. When they upgraded from Kubernetes 1.24 to 1.25, the removal of the PodSecurityPolicy API caused 60% of their workloads to fail admission. The entire company was down for 4 hours. Their postmortem identified the root cause as “catastrophic blast radius” — a single cluster meant a single failure domain for 200+ applications across 15 business units.

They spent the next six months splitting into 7 clusters: one per business domain plus a shared platform cluster. The migration cost $800K in engineering time. The CTO’s lesson: “The most expensive architecture decision is the one you make on day one and have to undo on day 300.”

How many clusters should you run? Where should the control planes live? Should clusters span racks or stay within one? These topology decisions are hard to change later and have cascading implications for networking, storage, security, and operations.

The City Planning Analogy

Cluster topology is like city planning. One massive city (monocluster) has traffic congestion, single points of failure, and one mayor who controls everything. Multiple smaller cities (multi-cluster) have independent governance, isolated failures, and clear boundaries — but need highways (networking) and trade agreements (service mesh) between them. The right answer depends on your population size and how much autonomy each district needs.


  • Single cluster vs multi-cluster decision framework
  • Control plane placement strategies for on-premises
  • Rack-aware topology and failure domain design
  • etcd topology patterns (stacked vs external)
  • Namespace-based vs cluster-based isolation trade-offs
  • How to plan for cluster lifecycle (creation, upgrade, decommission)

┌─────────────────────────────────────────────────────────────┐
│ SINGLE CLUSTER │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Kubernetes Cluster │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Team A │ │ Team B │ │ Team C │ │ │
│ │ │ ns: app-a│ │ ns: app-b│ │ ns: app-c│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ Pros: │ │
│ │ ✓ Simple operations (1 cluster to manage) │ │
│ │ ✓ Easy service discovery (DNS within cluster) │ │
│ │ ✓ Shared resources (better utilization) │ │
│ │ ✓ Single control plane cost │ │
│ │ │ │
│ │ Cons: │ │
│ │ ✗ Blast radius = everything │ │
│ │ ✗ Noisy neighbors (one team's load spike hits all) │ │
│ │ ✗ Upgrade = upgrade everything at once │ │
│ │ ✗ RBAC complexity scales with teams │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Best for: < 100 nodes, < 5 teams, homogeneous workloads │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ MULTI-CLUSTER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Platform │ │ Prod │ │ Staging │ │ Dev │ │
│ │ Cluster │ │ Cluster │ │ Cluster │ │ Cluster │ │
│ │ │ │ │ │ │ │ │ │
│ │ Shared │ │ Customer │ │ Pre-prod │ │ Sandbox │ │
│ │ services │ │ facing │ │ testing │ │ for devs │ │
│ │ (CI/CD, │ │ workloads│ │ │ │ │ │
│ │ observ) │ │ │ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Best for: │
│ - Strict environment isolation (prod vs non-prod) │
│ - Multi-tenant platforms (cluster per tenant/BU) │
│ - Different K8s versions per environment │
│ - Regulatory boundaries (PCI scope isolation) │
│ - > 200 nodes (split for operational sanity) │
│ │
└─────────────────────────────────────────────────────────────┘

Pause and predict: Your company has 120 nodes, 6 teams, and a mix of PCI-scoped payment processing and general web applications. Before reading the decision matrix, would you recommend a single cluster or multiple clusters? What is the single biggest factor driving your decision?

FactorSingle ClusterMulti-Cluster
Teams< 55+ or strict isolation needed
Nodes< 100-200200+ or split by purpose
EnvironmentsNamespace separation OKNeed hard isolation (prod/staging/dev)
ComplianceNo PCI/HIPAA scope concernsNeed regulatory boundary isolation
K8s versionsAll teams on same versionTeams need different versions
Blast radius toleranceHigh (startup mentality)Low (enterprise, regulated)
Operational team size2-3 engineers4+ engineers

On-premises, you decide where control plane nodes physically live. This decision determines your failure tolerance.

Control plane components and etcd run on the same nodes:

┌─────────────────────────────────────────────────────────────┐
│ STACKED CONTROL PLANE │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐│
│ │ CP Node 1 │ │ CP Node 2 │ │ CP Node 3 ││
│ │ │ │ │ │ ││
│ │ API Server │ │ API Server │ │ API Server ││
│ │ Controller Mgr │ │ Controller Mgr │ │ Controller Mgr ││
│ │ Scheduler │ │ Scheduler │ │ Scheduler ││
│ │ etcd │ │ etcd │ │ etcd ││
│ │ │ │ │ │ ││
│ │ 8 cores, 32GB │ │ 8 cores, 32GB │ │ 8 cores, 32GB ││
│ │ 200GB NVMe │ │ 200GB NVMe │ │ 200GB NVMe ││
│ └────────────────┘ └────────────────┘ └────────────────┘│
│ │
│ ✓ Simple: fewer servers to manage │
│ ✓ kubeadm default: easy to set up │
│ ✗ etcd failure = CP node failure (coupled) │
│ ✗ Cannot scale etcd independently │
│ │
│ Best for: clusters < 200 nodes │
│ │
└─────────────────────────────────────────────────────────────┘

etcd runs on dedicated servers with NVMe, separate from API servers:

┌─────────────────────────────────────────────────────────────┐
│ EXTERNAL ETCD TOPOLOGY │
│ │
│ API Server Nodes: etcd Nodes: │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ API Server 1 │──────────────│ etcd Node 1 │ │
│ │ Ctrl Mgr │ │ 4 cores, 16GB│ │
│ │ Scheduler │ │ 200GB NVMe │ │
│ │ 8 cores, 16GB│ └──────────────┘ │
│ └──────────────┘ ┌──────────────┐ │
│ ┌──────────────┐ │ etcd Node 2 │ │
│ │ API Server 2 │──────────────│ 4 cores, 16GB│ │
│ │ 8 cores, 16GB│ │ 200GB NVMe │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ API Server 3 │──────────────│ etcd Node 3 │ │
│ │ 8 cores, 16GB│ │ 4 cores, 16GB│ │
│ └──────────────┘ │ 200GB NVMe │ │
│ └──────────────┘ │
│ │
│ ✓ etcd on dedicated NVMe (no resource contention) │
│ ✓ Scale API servers independently from etcd │
│ ✓ etcd failures don't take down API server process │
│ ✗ More servers (6 instead of 3) │
│ ✗ More complex setup │
│ │
│ Best for: clusters > 200 nodes, high-throughput workloads │
│ │
└─────────────────────────────────────────────────────────────┘

One management cluster hosts control planes for multiple tenant clusters:

┌─────────────────────────────────────────────────────────────┐
│ MANAGEMENT CLUSTER PATTERN │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Management Cluster (3 nodes) │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Cluster │ │ Cluster │ │ Cluster │ │ │
│ │ │ API: Dev│ │API: Stg │ │API: Prod│ │ │
│ │ │ etcd │ │ etcd │ │ etcd │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ └───────┼─────────────┼───────────┼─────────┘ │
│ │ │ │ │
│ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │
│ │Workers │ │Workers │ │Workers │ │
│ │Dev (5) │ │Stg (10) │ │Prod (50)│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Technologies: vCluster, Kamaji, Cluster API │
│ Savings: 3 servers for N cluster control planes │
│ │
└─────────────────────────────────────────────────────────────┘

On-premises, you must design for physical failure domains that cloud abstracts away.

┌─────────────────────────────────────────────────────────────┐
│ FAILURE DOMAIN HIERARCHY │
│ │
│ Datacenter ──── Entire site power/cooling failure │
│ │ │
│ ├── Room ──── Fire suppression, cooling zone │
│ │ │ │
│ │ ├── Row ──── PDU circuit, top-of-row switch │
│ │ │ │ │
│ │ │ ├── Rack ──── PDU, ToR switch, single UPS │
│ │ │ │ │ │
│ │ │ │ └── Server ──── PSU, disk, NIC, CPU │
│ │ │ │ │
│ │ │ └── Rack │
│ │ │ │
│ │ └── Row │
│ │ │
│ └── Room │
│ │
│ Rule: Spread control plane across failure domains │
│ Minimum: 1 CP node per rack (survive rack failure) │
│ Ideal: CP nodes across rows or rooms │
│ │
└─────────────────────────────────────────────────────────────┘

Stop and think: You have 3 racks, each with its own PDU and ToR switch. Your cluster has 3 control plane nodes. If you put all 3 CP nodes in rack A (to simplify cabling), what happens when rack A loses power? Now consider: what happens if you spread them one per rack and rack A loses power?

Kubernetes topology labels map physical datacenter layout into the scheduling system. By labeling nodes with their rack, row, and room, you enable the scheduler to spread replicas across failure domains — so a single rack failure does not take down all instances of a critical service:

# Label nodes with physical topology
kubectl label node worker-01 \
topology.kubernetes.io/zone=rack-a \
topology.kubernetes.io/region=dc-east \
node.kubernetes.io/room=server-room-1 \
node.kubernetes.io/row=row-3
# Use topology spread constraints to distribute pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone # = rack
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payment-api
# This ensures: max 1 pod difference between racks
# With 6 replicas across 3 racks: 2-2-2 distribution
# If rack-a fails: 0-2-2 (4 replicas survive immediately)
#
# WARNING: With DoNotSchedule, the 2 replacement pods will
# stay Pending — placing them would create skew 0-3-3 (skew=3),
# violating maxSkew:1. For HA, use ScheduleAnyway instead,
# which treats the constraint as a preference rather than
# a hard requirement during scheduling.
┌─────────────────────────────────────────────────────────────┐
│ RACK LAYOUT (per rack) │
│ │
│ 42U Rack │
│ ┌────────────────────────────────────┐ │
│ │ U42: Patch panel (fiber/copper) │ │
│ │ U41: ToR Switch 1 (25GbE) │ │
│ │ U40: ToR Switch 2 (25GbE, redundant)│ │
│ │ U39: ── empty (airflow) ── │ │
│ │ U38: Management switch (1GbE) │ │
│ │ U37: ── empty ── │ │
│ │ U36-U35: Control plane node │ 2U │
│ │ U34-U33: Worker node 1 │ 2U │
│ │ U32-U31: Worker node 2 │ 2U │
│ │ U30-U29: Worker node 3 │ 2U │
│ │ U28-U27: Worker node 4 │ 2U │
│ │ U26-U25: Worker node 5 │ 2U │
│ │ U24-U23: Worker node 6 │ 2U │
│ │ U22-U21: Storage node (Ceph OSD) │ 2U │
│ │ U20-U01: ── expansion space ── │ 20U spare │
│ │ U00: PDU (2x redundant, A+B feed) │ │
│ └────────────────────────────────────┘ │
│ │
│ Power budget: ~8-12 kW per rack (check PDU rating) │
│ Cooling: 1 ton per 3.5 kW of IT load (rule of thumb) │
│ Weight: ~1,200 lbs fully loaded (check floor rating) │
│ │
└─────────────────────────────────────────────────────────────┘

etcd is a distributed consensus system. Its topology determines your cluster’s durability and performance.

Pause and predict: Your CTO wants to deploy 6 etcd members “for extra safety.” Based on how Raft consensus works, would 6 members be more or less resilient than 5? What is the downside of even-numbered membership?

etcd MembersQuorumTolerates FailuresRecommended For
110Dev/test only
321Standard production
532Mission-critical
743Rarely needed (higher write latency)

Critical: etcd latency between members must be < 10ms RTT. Do not stretch etcd across datacenters unless they have dedicated low-latency links (< 2ms RTT).

Terminal window
# Check etcd member health and latency
ETCDCTL_API=3 etcdctl \
--endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379 \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
endpoint health --write-out=table
# Check etcd performance
ETCDCTL_API=3 etcdctl check perf --endpoints=https://10.0.1.10:2379 \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt

  • Google runs approximately 15,000 Kubernetes clusters internally (Borg/GKE hybrid). They do not run one giant cluster — they use the multi-cluster pattern with automated lifecycle management. Even at Google’s scale, the operational overhead of one massive cluster is worse than many smaller ones.

  • The maximum tested cluster size for Kubernetes is 5,000 nodes. Beyond that, the API server’s watch cache, etcd’s storage, and the scheduler’s throughput become bottlenecks. Most production clusters stay under 500 nodes and split beyond that.

  • etcd’s Raft consensus requires a majority quorum for every write. With 5 members, every write must be acknowledged by 3 before it is committed. Adding a 6th member does not improve fault tolerance (still tolerates 2 failures) but increases write latency. Always use odd numbers.

  • Spotify runs 150+ Kubernetes clusters across their infrastructure, each scoped to a team or service domain. They invest heavily in automation via Backstage (which they created) to manage the lifecycle of all these clusters.


MistakeProblemSolution
One giant clusterBlast radius = entire companySplit by environment, then by domain
Too many clustersOperational overhead exceeds team capacity1 engineer can manage ~5-10 clusters with good automation
CP nodes in same rackRack failure = cluster downSpread CP across racks or rows
Stretching etcd across DCsLatency kills consensus performanceetcd in one DC; use federation for multi-DC
No lifecycle automationManual cluster creation takes daysCluster API + GitOps for declarative lifecycle
Namespace isolation onlyNamespaces don’t provide hard security boundariesUse clusters for trust boundaries, namespaces for organization
Not labeling nodesCannot use topology spread constraintsLabel every node with rack, row, room, DC
Even number of etcd membersSplit-brain risk with no tiebreakerAlways odd: 3, 5, or 7

Your company has 300 nodes, 8 teams, and a regulatory requirement to isolate PCI-scoped workloads. How many clusters would you recommend?

Answer

Minimum 3 clusters, recommended 4-5:

  1. PCI cluster — Dedicated to payment processing workloads. Hard isolation boundary for audit scope. Minimal node count (only what payment services need). Separate control plane, separate network segment.

  2. Production cluster — Non-PCI production workloads. The bulk of your 300 nodes.

  3. Non-production cluster — Dev, staging, QA. Can share one cluster with namespace isolation.

  4. Platform cluster (optional) — CI/CD, monitoring, logging, GitOps controllers. Separates platform tooling from application workloads.

  5. Management cluster (optional) — Hosts Cluster API controllers, manages lifecycle of other clusters.

The PCI cluster is non-negotiable — regulatory scope isolation requires a hard boundary. The prod/non-prod split prevents staging incidents from affecting production. The platform cluster is a maturity decision.

You have 3 racks in one datacenter. Where do you place your 3 control plane nodes?

Answer

One control plane node per rack. This ensures that a rack failure (PDU, ToR switch, or cooling issue) takes down at most 1 of 3 CP nodes. The remaining 2 maintain quorum (2/3 majority).

Rack A: CP-1 + Workers
Rack B: CP-2 + Workers
Rack C: CP-3 + Workers

If you only have 2 racks, place 2 CP nodes in one rack and 1 in the other. A failure in the 2-node rack will lose quorum, but a failure in the 1-node rack will not. This is not ideal — 3 racks is the minimum for proper CP distribution.

Label nodes with topology.kubernetes.io/zone=rack-{a,b,c} and use topology spread constraints to distribute application pods across racks.

When should you use external etcd instead of stacked?

Answer

Use external etcd when:

  1. Cluster size > 200 nodes — etcd write throughput becomes critical; dedicated NVMe servers with no resource contention are essential.

  2. etcd performance is paramount — Financial services, real-time systems where API latency matters. External etcd eliminates CPU/memory contention with API server.

  3. You need to scale API servers independently — If API server load is high (many controllers, webhooks, CRDs) but etcd is not the bottleneck, you can add more API server nodes without adding etcd members.

  4. You want independent failure domains — API server crash should not affect etcd data integrity and vice versa.

Use stacked when:

  • Cluster < 200 nodes
  • Simplicity is valued over maximum performance
  • You have limited server count (3 servers = 3 stacked CP nodes)

What is the maximum recommended RTT latency between etcd members, and why?

Answer

< 10ms round-trip time, ideally < 2ms.

etcd uses the Raft consensus protocol, which requires a leader to replicate log entries to a majority of members on every write. The default heartbeat interval is 100ms, and the election timeout is 1,000ms (10x heartbeat).

If network latency between etcd members exceeds ~10ms, the time for a write to be committed (leader → majority acknowledgment) approaches the heartbeat interval. Under load, this causes:

  1. Slow API server responses (every kubectl command waits for etcd)
  2. Leader election instability (heartbeats arrive too late)
  3. Write throughput collapse (Raft serializes writes through the leader)

This is why etcd should never be stretched across datacenters unless they have a dedicated low-latency link (dark fiber, < 2ms RTT). For multi-DC, use separate clusters with federation or replication at the application layer.


Hands-On Exercise: Design a Cluster Topology

Section titled “Hands-On Exercise: Design a Cluster Topology”

Task: Given an organization’s requirements, design a complete cluster topology with physical placement.

A manufacturing company is deploying Kubernetes on-premises:

  • 2 datacenters (DC-East and DC-West, 50km apart, 5ms RTT)
  • 150 total nodes needed
  • 4 teams: Platform, Product, Data Science, QA
  • PCI compliance for payment processing (20 nodes)
  • GPU workloads for quality inspection ML models (10 nodes)
  • Need to survive a full datacenter failure
  1. Determine cluster count:

    • PCI cluster (dedicated, DC-East): 15 worker nodes + 3 CP = 18
    • Production cluster (DC-East primary): 50 worker nodes + 10 GPU nodes + 3 CP = 63
    • DR/Standby cluster (DC-West): 35 worker nodes + 3 CP = 38
    • Non-prod cluster (DC-West): 28 worker nodes + 3 CP = 31
    • Total: 18 + 63 + 38 + 31 = 150 nodes
  2. Place control planes:

Terminal window
# DC-East (3 racks) — 81 nodes total (PCI: 18, Prod: 63)
# Rack A: PCI CP-1, Prod CP-1, 5 PCI workers, 20 Prod workers (27 nodes)
# Rack B: PCI CP-2, Prod CP-2, 5 PCI workers, 20 Prod workers (27 nodes)
# Rack C: PCI CP-3, Prod CP-3, 5 PCI workers, 10 Prod workers + 10 GPU (27 nodes)
# DC-West (3 racks) — 69 nodes total (DR: 38, NonProd: 31)
# Rack D: DR CP-1, NonProd CP-1, 12 DR workers, 9 NonProd workers (23 nodes)
# Rack E: DR CP-2, NonProd CP-2, 12 DR workers, 9 NonProd workers (23 nodes)
# Rack F: DR CP-3, NonProd CP-3, 11 DR workers, 10 NonProd workers (23 nodes)
  1. Label nodes:
Terminal window
# DC-East nodes
kubectl label node east-rack-a-01 \
topology.kubernetes.io/region=dc-east \
topology.kubernetes.io/zone=rack-a \
node.kubernetes.io/purpose=worker
# GPU nodes — label and taint to isolate
kubectl label node east-rack-c-gpu-01 \
topology.kubernetes.io/region=dc-east \
topology.kubernetes.io/zone=rack-c \
node.kubernetes.io/gpu=nvidia-a100 \
node.kubernetes.io/purpose=gpu
kubectl taint nodes east-rack-c-gpu-01 \
node.kubernetes.io/gpu=nvidia-a100:NoSchedule
  1. Define topology spread:
# Production deployment spread across racks
# Using ScheduleAnyway so replacements can schedule after a rack failure
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
  • Cluster count justified with reasoning
  • PCI workloads in dedicated cluster
  • Control planes spread across failure domains (racks)
  • etcd not stretched across DCs (< 10ms RTT within cluster)
  • DR strategy handles full DC failure
  • Node labels defined for topology-aware scheduling
  • GPU nodes isolated with labels and taints

Continue to Module 1.4: TCO & Budget Planning to learn how to build a comprehensive cost model for your on-premises Kubernetes platform.