Module 1.4: TCO & Budget Planning
Цей контент ще не доступний вашою мовою.
Complexity:
[MEDIUM]| Time: 45 minutesPrerequisites: Module 1.2: Server Sizing, FinOps Fundamentals
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design a comprehensive TCO model that captures hardware, staffing, facilities, networking, and opportunity costs
- Evaluate cloud vs. on-premises economics with accurate 3-5 year projections including hidden costs
- Plan budget proposals that account for colocation, power, cooling, maintenance contracts, and staffing overhead
- Diagnose common budgeting blind spots that cause on-premises projects to exceed estimates by 2-3x
Why This Module Matters
Section titled “Why This Module Matters”In 2023, a logistics company presented their board with a plan to move from AWS to on-premises Kubernetes. The infrastructure team calculated hardware costs: $380,000 for servers, networking, and storage. The CFO approved. Twelve months later, the actual spend was $1.1 million. The team had not budgeted for colocation fees ($72K/year), redundant internet circuits ($36K/year), a part-time facilities manager ($45K/year), two additional engineers ($350K/year), hardware maintenance contracts ($38K/year), or the $180K in professional services for the initial buildout. The “savings” over cloud disappeared entirely.
The infrastructure team was not incompetent — they were engineers, not financial modelers. They calculated what they knew (hardware) and missed what they did not (everything else). This module teaches you to build a complete TCO model so your budget reflects reality.
The Iceberg Analogy
Hardware cost is the tip of the iceberg — visible and easy to calculate. Below the waterline: staffing, power, cooling, colocation, networking, maintenance contracts, licenses, insurance, and opportunity cost. The submerged portion is typically 3-5x the hardware cost.
What You’ll Learn
Section titled “What You’ll Learn”- The complete cost taxonomy for on-premises Kubernetes
- How to build a 3-year and 5-year TCO model
- CapEx vs OpEx trade-offs and financial implications
- Power and cooling calculations
- Staffing models and the cost of expertise
- Cloud breakeven analysis methodology
Cost Taxonomy
Section titled “Cost Taxonomy”Capital Expenditure (CapEx)
Section titled “Capital Expenditure (CapEx)”One-time costs that are depreciated over the asset’s life (typically 3-5 years):
┌─────────────────────────────────────────────────────────────┐│ CAPEX COMPONENTS ││ ││ COMPUTE ││ ├── Servers (CPU, RAM, NVMe) $8K-25K per node ││ ├── GPU accelerators (if needed) $10K-40K per GPU ││ └── Spare parts (drives, PSUs, RAM) 5-10% of server cost││ ││ NETWORKING ││ ├── Spine switches (100GbE) $15K-40K each ││ ├── ToR switches (25GbE) $3K-8K each ││ ├── Management switches (1GbE) $500-2K each ││ ├── Cabling (fiber + copper) $2K-5K per rack ││ ├── Firewalls / load balancers $5K-50K ││ └── Out-of-band management network $1K-3K ││ ││ STORAGE (if not using server-local) ││ ├── SAN/NAS (enterprise) $50K-500K ││ └── Dedicated Ceph nodes $10K-20K each ││ ││ INFRASTRUCTURE ││ ├── Racks (42U) $1K-3K each ││ ├── PDUs (redundant A+B) $500-2K per rack ││ ├── UPS (if not colo-provided) $5K-20K ││ └── KVM / console access $1K-5K ││ │└─────────────────────────────────────────────────────────────┘Pause and predict: A team presents their on-premises budget to the CFO showing only the CapEx line items above. The CFO approves. What categories are missing from this budget that will cause the project to exceed estimates? List at least four before reading the OpEx section below.
Operating Expenditure (OpEx)
Section titled “Operating Expenditure (OpEx)”Recurring monthly or annual costs:
┌─────────────────────────────────────────────────────────────┐│ OPEX COMPONENTS ││ ││ FACILITY (monthly) ││ ├── Colocation (per rack) $800-2,500/mo ││ ├── Power (per kW) $80-150/kW/mo ││ ├── Cooling (included in colo or ~40% of power cost ││ │ separate HVAC cost) ││ ├── Internet circuits (redundant) $500-3,000/mo ││ └── Cross-connects to other cages $200-500/mo each ││ ││ STAFFING (annual, fully loaded) ││ ├── Infrastructure engineer $150K-220K ││ ├── Network engineer (shared) $70K-110K (0.5 FTE) ││ ├── Security engineer (shared) $80K-120K (0.5 FTE) ││ └── On-call compensation $15K-30K/person/yr ││ ││ MAINTENANCE (annual) ││ ├── Hardware support contract 3-5% of hardware ││ │ (Dell ProSupport, HPE Pointnext) cost per year ││ ├── Software licenses $0-50K (RHEL, etc.) ││ └── Professional services $20K-100K (one-time) ││ ││ HIDDEN COSTS ││ ├── Opportunity cost (eng not Hard to quantify ││ │ building product) ││ ├── Insurance (equipment) $2K-10K/yr ││ ├── Compliance/audit $10K-50K/yr ││ └── Training & conferences $5K-15K/yr ││ │└─────────────────────────────────────────────────────────────┘Power and Cooling Calculations
Section titled “Power and Cooling Calculations”Per-Server Power
Section titled “Per-Server Power”┌─────────────────────────────────────────────────────────────┐│ POWER ESTIMATION ││ ││ Component Idle Power Full Load ││ ───────────────── ────────── ───────── ││ Dual Xeon/EPYC 120W 500-600W ││ RAM (512GB DDR5) 20W 40W ││ NVMe (4x 3.84TB) 8W 20W ││ NICs (2x 25GbE) 5W 15W ││ Fans + PSU loss 30W 60W ││ ───────────────── ────────── ───────── ││ Total per server ~180W ~700W ││ ││ GPU server (4x A100): add 1,200W at full load ││ ││ Typical K8s utilization: 40-60% = ~400W average ││ ││ Per rack (8 servers): ││ Average: 8 x 400W = 3.2 kW ││ Peak: 8 x 700W = 5.6 kW ││ ││ Monthly power cost: ││ 3.2 kW x 730 hours x $0.12/kWh = $280/month ││ Plus cooling (PUE 1.4): $280 x 1.4 = $392/month ││ │└─────────────────────────────────────────────────────────────┘PUE (Power Usage Effectiveness)
Section titled “PUE (Power Usage Effectiveness)”Total facility powerPUE = ───────────────── IT equipment power
PUE 1.0 = perfect (impossible)PUE 1.2 = excellent (hyperscaler DCs)PUE 1.4 = good (modern colo)PUE 1.6 = average (older facility)PUE 2.0 = poor (needs upgrade)
Your effective power cost = IT power cost × PUEStaffing Models
Section titled “Staffing Models”Stop and think: Your organization currently manages 80 nodes with 2 engineers using Ansible and shell scripts. You want to grow to 200 nodes. Should you hire 2 more engineers, or invest in Cluster API and GitOps automation? What factors beyond cost would influence this decision?
The Nodes-Per-Engineer Ratio
Section titled “The Nodes-Per-Engineer Ratio”| Automation Level | Nodes per Engineer | Tools |
|---|---|---|
| Manual (kubectl + scripts) | 20-30 | Shell scripts, manual provisioning |
| Partially automated | 50-80 | Ansible, Terraform, basic CI/CD |
| Fully automated | 100-200 | Cluster API, GitOps, machine health checks |
| Platform team at scale | 200-500 | Custom controllers, self-service portal |
┌─────────────────────────────────────────────────────────────┐│ STAFFING COST MODEL ││ ││ Cluster Size Engineers Annual Staff Cost ││ ──────────── ───────── ───────────────── ││ 20 nodes 2 (shared) $200K (0.5 FTE dedicated) ││ 50 nodes 2 $350K ││ 100 nodes 3 $525K ││ 200 nodes 4 $700K ││ 500 nodes 6-8 $1.2M ││ 1000+ nodes 10-15 $2M+ ││ ││ Note: "shared" means engineers also do other work. ││ Fully loaded cost = salary × 1.3-1.5 (benefits, tax, etc.)││ │└─────────────────────────────────────────────────────────────┘Pause and predict: Before looking at the TCO model below, estimate what percentage of total 3-year on-premises cost is hardware vs. staffing for a 100-node cluster. Write down your guess, then check it against the model.
3-Year TCO Model
Section titled “3-Year TCO Model”Template
Section titled “Template”┌─────────────────────────────────────────────────────────────┐│ 3-YEAR TCO MODEL (100 K8s nodes on 13 bare-metal servers) ││ (Each server: 64 cores, 256GB → ~8 K8s nodes per server) ││ ││ YEAR 0 (SETUP) ││ ──────────────────────────────────────── ││ Servers (13 x $15K) $195,000 ││ Networking (switches + cabling) $45,000 ││ Storage (Ceph nodes, 3 x $12K) $36,000 ││ Racks + PDUs $8,000 ││ Professional services (buildout) $60,000 ││ ──────────────────────────────────────── ││ CapEx subtotal $344,000 ││ ││ ANNUAL OPEX ││ ──────────────────────────────────────── ││ Colocation (3 racks) $54,000 ││ Power + cooling (7kW IT x PUE 1.4 $10,300 ││ x $0.12/kWh x 8,760 hrs) ││ Internet (2x circuits) $24,000 ││ Staffing (3 engineers) $525,000 ││ Maintenance contracts $16,500 ││ Software licenses $12,000 ││ Insurance + compliance $15,000 ││ ──────────────────────────────────────── ││ OpEx subtotal (per year) $656,800 ││ ││ 3-YEAR TOTAL ││ ──────────────────────────────────────── ││ CapEx $344,000 ││ OpEx (3 years) $1,970,400 ││ ──────────────────────────────────────── ││ TOTAL $2,314,400 ││ Per month $64,289 ││ ││ CLOUD COMPARISON (100 x m6i.2xlarge @ $0.384/hr) ││ ──────────────────────────────────────── ││ On-demand: 100 x $280/mo $1,008,000 (3 yr) ││ 1-year RI (no upfront, ~35% off) $655,000 (3 yr) ││ 3-year RI (all upfront, ~55% off) $454,000 (3 yr) ││ + EKS fee + storage + data xfer ~$200,000 (3 yr) ││ ││ VERDICT at 100 nodes (13 servers): ││ Cloud (3yr RI + extras) ≈ $650K vs On-prem ≈ $2.3M ││ → Cloud wins decisively (staffing dominates on-prem) ││ ││ Breakeven: ~300-500 nodes where staffing amortizes ││ and cloud egress/storage costs grow linearly ││ │└─────────────────────────────────────────────────────────────┘Did You Know?
Section titled “Did You Know?”-
Colocation pricing varies 3-4x by market. A full rack in Northern Virginia (Ashburn) costs ~$800-1,200/month. The same rack in London costs ~$1,500-2,500/month. In Singapore, it can exceed $3,000/month. Location choice has a massive TCO impact.
-
Power costs are the fastest-growing component of on-prem TCO. Electricity prices have increased 20-40% in many markets since 2020. GPU servers make this worse — a rack of 8 GPU servers can draw 15-20 kW, costing $1,500-3,000/month in power alone.
-
Hardware refresh cycles are getting longer. In 2010, servers were refreshed every 3 years. Today, 5-year cycles are common because modern servers (DDR5, PCIe Gen5) depreciate more slowly and firmware updates extend useful life. This improves the TCO case for on-prem.
-
The biggest hidden cost is not hardware, it is hiring. Finding engineers who can operate bare metal Kubernetes (Linux, networking, storage, hardware troubleshooting) takes 3-6 months and costs $20K-40K in recruiting fees per hire. Many organizations underestimate this lead time.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Hardware-only budgeting | Actual cost is 3-5x hardware | Use the full cost taxonomy |
| Forgetting PUE | Power bill is 1.2-1.6x raw IT power | Multiply power by facility PUE |
| Linear staffing model | Assuming 1 engineer per 50 nodes always | Automation reduces the ratio dramatically |
| Ignoring opportunity cost | Engineers maintaining infra not building product | Include in decision, even if hard to quantify |
| Not negotiating colo | Taking the first quote from the colo provider | Get 3+ quotes; negotiate multi-year commits for 15-25% discounts |
| Comparing list prices | Cloud list vs on-prem actual | Use your real cloud bill; apply RI/CUD discounts |
| No hardware buffer budget | First year always has surprise needs | Add 15-20% contingency to CapEx |
| Forgetting decommission costs | Old hardware needs disposal, data wiping | Budget for certified data destruction and e-waste recycling |
Question 1
Section titled “Question 1”Your on-prem cluster uses 5 kW of IT power in a colocation facility with PUE 1.5. Electricity costs $0.12/kWh. What is the total monthly power cost?
Answer
$657/month.
Calculation:
- IT power: 5 kW
- Total facility power: 5 kW × PUE 1.5 = 7.5 kW
- Hours per month: 730 (24 × 365/12)
- Monthly cost: 7.5 kW × 730 hours × $0.12/kWh = $657
The PUE multiplier accounts for cooling, lighting, UPS losses, and other facility overhead. A PUE of 1.5 means for every watt of IT power, you pay for 1.5 watts total.
Question 2
Section titled “Question 2”You need 3 infrastructure engineers ($175K fully loaded each) to manage 100 nodes. At what node count does this staffing model break even against cloud at $500/node/month?
Answer
Staffing alone costs $525K/year = $43,750/month. At $500/node/month cloud cost:
- Staffing per node: $43,750 / N nodes per month
- At 100 nodes: $437.50/node/month in staffing alone (88% of cloud cost before adding hardware, colo, power)
- At 200 nodes: $218.75/node/month in staffing (with same 3 engineers + automation)
- At 300 nodes: $145.83/node/month in staffing
With automation, 3 engineers can manage 200-300 nodes. The staffing cost per node drops to $145-218, leaving room for hardware ($100-200/node/month amortized) and facility ($50-100/node/month) while still being cheaper than $500/node/month cloud.
Breakeven: approximately 200 nodes where staffing + hardware + facility < cloud cost.
Question 3
Section titled “Question 3”A vendor offers “free” Kubernetes distribution but charges $2,000/node/year for enterprise support. Is this a good deal for a 100-node cluster?
Answer
Probably not. $2,000/node/year × 100 nodes = $200,000/year — almost the cost of an additional engineer.
Compare:
- Vanilla Kubernetes (free) + 1 additional engineer ($175K/year): more flexible, builds internal expertise
- Enterprise K8s ($200K/year): less operational burden but vendor lock-in, annual renewal pressure
The calculation changes if:
- You have < 2 K8s engineers (vendor support covers the gap)
- You need SLA-backed support for regulatory compliance
- The distribution includes features you would otherwise build (multi-tenancy, security hardening)
For most organizations > 50 nodes, investing in team capability is better long-term than paying per-node license fees. The exception is highly regulated environments where vendor support contracts are a compliance requirement.
Question 4
Section titled “Question 4”Should you own your datacenter or use colocation?
Answer
Colocation for almost everyone. Building a datacenter costs $10-50 million and takes 12-24 months. It only makes sense at massive scale (1,000+ servers) or for specific regulatory requirements (government, defense).
Colocation advantages:
- No construction/permitting costs
- Shared cooling, power, physical security
- Redundant utility feeds and generators included
- Flexible scaling (add/remove racks)
- Multiple colo providers per market (avoid vendor lock)
Own datacenter advantages:
- Full physical control (government/classified requirements)
- Lower per-kW cost at massive scale
- Custom cooling (liquid cooling for high-density GPU racks)
- No monthly rent
Rule of thumb: if you need < 50 racks, colocation wins. If you need 200+ racks and have a 10+ year horizon, evaluate building.
Hands-On Exercise: Build a Complete TCO Model
Section titled “Hands-On Exercise: Build a Complete TCO Model”Task: Create a 3-year TCO model for your organization’s on-premises Kubernetes deployment and compare to cloud.
-
Define your requirements:
- Number of worker nodes
- CPU and RAM per node
- Storage requirements (persistent + ephemeral)
- GPU requirements (if any)
- Number of clusters
- Number of datacenters/sites
-
Calculate CapEx using the cost taxonomy above
-
Calculate annual OpEx including all hidden costs
-
Calculate 3-year cloud alternative using:
- Your cloud provider’s pricing calculator
- Apply realistic discounts (RI/CUD: 30-60%)
- Include data transfer, managed K8s fees, storage
-
Compare and document the breakeven point
Success Criteria
Section titled “Success Criteria”- All CapEx categories populated
- All OpEx categories populated (including staffing)
- Power calculated with PUE
- Cloud comparison uses discounted pricing (not list)
- Breakeven point identified
- Sensitivity analysis: what if staffing costs +20%? What if cloud pricing -30%?
Key Takeaways
Section titled “Key Takeaways”- Hardware is 20-30% of total cost — staffing, facility, and operations dominate
- PUE multiplies your power bill — always ask for the facility’s PUE
- Automation determines your staffing ratio — invest in Cluster API and GitOps early
- Breakeven is typically 150-250 nodes — below that, cloud is simpler and often cheaper
- Get 3 quotes — for colo, hardware, and support contracts. Never take the first offer.
Next Module
Section titled “Next Module”Continue to Module 2.1: Datacenter Fundamentals to learn the physical infrastructure that supports your Kubernetes cluster.