Перейти до вмісту

Module 1.5: Scaling Platform Organizations

Цей контент ще не доступний вашою мовою.

Discipline Module | Complexity: [ADVANCED] | Time: 55-65 min

Before starting this module:


After completing this module, you will be able to:

  • Design organizational structures that scale platform teams from 5 to 50+ engineers
  • Implement governance models that balance platform consistency with team autonomy at scale
  • Build platform operating models that work across business units, regions, and cloud environments
  • Evaluate when to centralize versus federate platform capabilities as the organization grows

A payments company had one of the best platform teams in fintech. Five engineers who built and maintained a deployment platform, a monitoring stack, and a self-service database provisioning system. Developers loved them. Adoption was 95%. The team was a case study in doing platform engineering right.

Then the company grew from 80 developers to 350 in 18 months.

The platform team went from heroes to bottleneck overnight. Every new team needed onboarding. Every product launch required platform support. The backlog grew from 15 items to 120. Response time for support requests went from hours to weeks. The five engineers were working 60-hour weeks and still falling behind.

The company’s response was to hire more platform engineers. They went from 5 to 25 in a year. But 25 people cannot operate as one team. Communication broke down. Three different engineers built overlapping solutions to the same problem. Standards diverged. The platform that was elegant at 5 people became chaotic at 25.

Scaling a platform organization is not just hiring. It requires deliberate structural decisions about governance, ownership, cost allocation, and team boundaries. Get these wrong and you create a bureaucratic mess that is worse than having no platform at all. This module teaches you how to scale without losing what made your platform team effective in the first place.


Dunbar’s number (roughly 150) is the cognitive limit to the number of people with whom one can maintain stable social relationships. When a platform organization exceeds ~8-12 people, informal coordination breaks down and you need explicit processes. Most platform scaling problems happen at exactly this inflection point.

Amazon’s “two-pizza team” rule was not about team size — it was about ownership scope. Jeff Bezos’ insight was that small teams with clear ownership make better decisions and move faster than large teams with shared ownership. The full case study lives in Building Platform Teams — this principle applies directly to platform sub-teams.

According to the 2024 Puppet State of Platform Engineering report, organizations with mature platform governance (clear SLOs, documented standards, explicit decision-making processes) scale their platform teams 2.4x more efficiently than organizations that scale by headcount alone.

Google’s infrastructure division employs thousands of engineers, yet maintains consistency through a combination of automated enforcement, design reviews, and a strong culture of documentation. They do not achieve consistency through management oversight — they achieve it through systems.


Stop and think: Think about the last time your organization added headcount to solve a delivery bottleneck. Did the new hires proportionally increase delivery speed, or did communication overhead eat into the gains?

flowchart TD
subgraph "PLATFORM TEAM"
direction LR
A["Eng A"]
B["Eng B"]
C["Eng C"]
end
Note["Everyone does everything. <br/> Communication is informal. <br/> Knowledge is shared by sitting next to each other."]
style Note fill:transparent,stroke:none

Characteristics:

  • Everyone knows everything
  • No formal processes needed
  • Fast decision-making
  • High bus-factor risk
  • Works for 20-80 developers

What to focus on: Build the core platform. Validate product-market fit with internal users. Do not worry about governance or structure.

Stage 2: The Specialized Team (5-12 people)

Section titled “Stage 2: The Specialized Team (5-12 people)”
flowchart TD
subgraph "PLATFORM TEAM"
subgraph "Infra/K8s Engineers"
A["Eng A"]
B["Eng B"]
end
subgraph "CI/CD & Developer Experience"
C["Eng C"]
D["Eng D"]
end
TL["+ Tech Lead / Manager"]
PM["+ Product Manager (part-time)"]
end
Note["Informal specialization emerges. <br/> Communication requires effort. <br/> Documentation becomes necessary."]
style Note fill:transparent,stroke:none

Characteristics:

  • Engineers develop specializations
  • You need a tech lead or manager
  • Documentation becomes necessary (not everyone knows everything)
  • Weekly team meetings become important
  • Works for 80-200 developers

What to focus on: Formalize ownership areas. Start documenting architecture decisions. Introduce lightweight processes (standups, architecture reviews). Hire or designate a product manager.

Stage 3: The Multi-Team Organization (12-30 people)

Section titled “Stage 3: The Multi-Team Organization (12-30 people)”
flowchart TD
subgraph "PLATFORM ORGANIZATION"
subgraph T1 ["Infrastructure Team (5-7 people)"]
I1["• Kubernetes <br> • Networking <br> • Storage"]
end
subgraph T2 ["Developer Experience Team (5-7 people)"]
D1["• CI/CD <br> • Portal <br> • Templates"]
end
subgraph T3 ["Data Platform Team (4-6 people)"]
DP1["• Databases <br> • Streaming <br> • Analytics"]
end
subgraph T4 ["Security & Compliance Team (3-5 people)"]
S1["• IAM <br> • Policies <br> • Scanning"]
end
Roles["+ Director of Platform Engineering <br> + Platform Product Manager <br> + Platform Architecture (shared role)"]
end

Characteristics:

  • Multiple teams with distinct ownership
  • You need a director (manages managers, sets strategy)
  • Cross-team coordination becomes a challenge
  • Standards and governance are essential
  • Works for 200-500 developers

What to focus on: Clear team charters and ownership boundaries. Explicit API contracts between platform teams. Architecture reviews for cross-cutting concerns. Shared on-call and incident response.

Stage 4: The Platform Division (30+ people)

Section titled “Stage 4: The Platform Division (30+ people)”
flowchart TD
VP["VP of Platform Engineering"]
PL["Platform Product Lead"]
PA["Platform Architecture Team"]
VP --- PL
VP --- PA
subgraph "PLATFORM DIVISION"
subgraph IG ["Infrastructure Group"]
direction TB
K8sT["K8s Team"]
NetT["Net Team"]
CloudT["Cloud Team"]
end
subgraph DXG ["Developer Experience Group"]
direction TB
CICDT["CI/CD Team"]
PortalT["Portal Team"]
DXT["DX Team"]
end
subgraph SPG ["Specialized Platforms Group"]
direction TB
DataT["Data Team"]
MLT["ML Team"]
SecT["Sec Team"]
end
end

Characteristics:

  • Multiple layers of management
  • Dedicated architecture and product functions
  • Significant coordination overhead
  • Governance and standards are critical
  • Works for 500+ developers

What to focus on: Federated governance (see below). Clear escalation paths. Investment in platform-for-the-platform (internal tooling for your own team). Preventing bureaucracy.


Federated vs Centralized Platform Governance

Section titled “Federated vs Centralized Platform Governance”

Pause and predict: If a centralized board has to approve every infrastructure change, what happens to the lead time for new applications? How might developers attempt to bypass this process?

ModelDescriptionProsCons
CentralizedOne team sets all standards and makes all decisionsConsistent, efficientSlow, disconnected from users
FederatedStandards set centrally, decisions made locallyBalanced speed and consistencyRequires coordination
DecentralizedEach team makes its own decisionsFast, autonomousInconsistent, duplicated effort
Section titled “The Federated Model (Recommended at Scale)”
flowchart TD
subgraph "CENTRAL GOVERNANCE"
direction TB
C1["Owns:<br>• Architecture standards (RFCs, ADRs)<br>• Security policies (mandatory guardrails)<br>• Technology radar (approved, trial, hold)<br>• Shared infrastructure (K8s clusters, networking)<br>• Cost governance (budgets, chargeback rules)<br>• Platform SLOs (availability targets)"]
C2["Decides: 'What are the boundaries?'"]
end
C2 --> TA
C2 --> TB
C2 --> TC
subgraph TA ["Team A"]
A1["Owns:<br>• Their domain<br>• Their roadmap<br>• Their tooling choices (within bounds)"]
A2["Decides: 'How do we build within boundaries?'"]
end
subgraph TB ["Team B"]
B1["Owns:<br>• Their domain<br>• Their roadmap<br>• Their tooling choices (within bounds)"]
B2["Decides: 'How do we build within boundaries?'"]
end
subgraph TC ["Team C"]
C1_["Owns:<br>• Their domain<br>• Their roadmap<br>• Their tooling choices (within bounds)"]
C2_["Decides: 'How do we build within boundaries?'"]
end
DecisionCentralizeFederateWhy
Kubernetes versionYesSecurity, compatibility
Monitoring stackYesConsistent observability
CI/CD toolYesShared investment
Programming language for platform toolsYesTeams know their domain best
API design standardsYesInteroperability
Feature prioritizationYesTeams know their users best
Cost budgetsYes (total)Yes (per-team allocation)Central budget, local spending
Incident response processYesConsistency in crisis
Hiring criteriaYes (framework)Yes (specific skills)Consistent bar, diverse needs

A technology radar classifies tools and technologies into four categories:

flowchart TD
subgraph "TECHNOLOGY RADAR"
direction TB
subgraph Adopt ["ADOPT — Use in production"]
A1["• ArgoCD &nbsp;&nbsp; • Kyverno <br> • Terraform &nbsp;&nbsp; • Backstage <br> • Prometheus &nbsp;&nbsp; • GitHub Actions"]
end
subgraph Trial ["TRIAL — Evaluate in pilot"]
T1["• Crossplane &nbsp;&nbsp; • OpenTofu <br> • Cilium &nbsp;&nbsp; • Score"]
end
subgraph Assess ["ASSESS — Research, no production"]
AS1["• Wasm on K8s &nbsp;&nbsp; • Radius <br> • eBPF mesh &nbsp;&nbsp; • KCP"]
end
subgraph Hold ["HOLD — Do not adopt"]
H1["• Jenkins &nbsp;&nbsp; • Helm v2 <br> • Custom CRDs (without review) <br> • Docker Swarm"]
end
Info["Updated: Quarterly <br> Decided by: Architecture review board"]
end

Without SLOs, platform reliability is invisible until it is terrible. With SLOs, you can:

  • Prove your platform is reliable (data, not feelings)
  • Prioritize reliability work based on budget burn
  • Set expectations with development teams
  • Justify investment in reliability engineering
Platform ServiceSLI (what you measure)SLO (target)
Kubernetes APIAPI request success rate99.95%
CI/CD pipelinesPipeline success rate (infrastructure-caused failures)99.9%
Container registryImage pull success rate99.95%
Deployment systemDeployment success rate99.9%
Developer portalPortal availability99.9%
Secret managementVault API availability99.99%
DNSDNS resolution success rate99.99%
Monitoring stackMetric ingestion success rate99.9%
SLO (Service Level Objective)SLA (Service Level Agreement)
WhatInternal reliability targetFormal commitment (sometimes contractual)
AudiencePlatform teamDevelopment teams, leadership
Consequence of missPrioritize reliability workRemediation plan, possible escalation
TrackingAutomated dashboardsMonthly or quarterly reviews
FlexibilityCan adjust based on learningChanges require negotiation

If your platform is large enough (serving 100+ developers), formalize an internal SLA:

PLATFORM TEAM INTERNAL SLA

Coverage: Monday-Friday, 9 AM - 6 PM [timezone] Emergency: 24/7 for P1 incidents

Response Times:

  • P1 (Platform outage): 15 minutes
  • P2 (Degraded performance): 2 hours
  • P3 (Feature request): 5 business days
  • P4 (General inquiry): 10 business days

Availability Targets:

  • Core platform (K8s, CI/CD, monitoring): 99.9%
  • Developer portal: 99.5%
  • Non-production environments: 99%

Maintenance Windows:

  • Planned: Tuesday 2-4 AM [timezone], with 48h notice
  • Emergency: As needed, with best-effort notice

Escalation:

  • Level 1: #platform-support Slack channel
  • Level 2: Platform on-call (PagerDuty)
  • Level 3: Platform engineering manager

Stop and think: If the platform is entirely free for developers, but highly available and infinitely scalable, what incentive do they have to write efficient code or clean up abandoned environments?

Without cost allocation:

  • Platform costs are a black hole in the budget
  • No incentive for teams to use resources efficiently
  • Platform team cannot prove ROI
  • Leadership sees platform as “just expense”

With cost allocation:

  • Teams understand the cost of their infrastructure
  • Natural incentive to optimize
  • Platform team can demonstrate value vs cost
  • Budget conversations are data-driven
ModelDescriptionProsCons
Central fundingPlatform costs in one budgetSimple, no per-team trackingNo cost awareness, hard to justify
ShowbackShow teams their costs, don’t charge themAwareness without frictionNo financial incentive to optimize
ChargebackCharge teams for their actual usageStrong optimization incentiveComplex to implement, creates friction
HybridBase platform centrally funded, usage-based extras chargedBalanced incentivesMedium complexity
flowchart TD
subgraph "COST ALLOCATION MODEL"
direction TB
subgraph Central ["CENTRALLY FUNDED (platform tax)"]
C1["• Kubernetes control plane <br> • Shared monitoring and logging <br> • CI/CD infrastructure <br> • Developer portal <br> • Platform team salaries <br> • Base security and compliance tooling"]
end
subgraph Charged ["CHARGED TO TEAMS (usage-based)"]
C2["• Compute resources (CPU, memory) <br> • Storage (persistent volumes, object storage) <br> • Data transfer (cross-region, egress) <br> • Specialized services (GPU instances, databases) <br> • Non-production environments beyond standard"]
end
subgraph Visibility ["COST VISIBILITY"]
C3["Every team gets a monthly cost report: <br> • Total cost <br> • Cost per service <br> • Month-over-month trend <br> • Optimization recommendations <br> • Comparison with similar teams"]
end
end

Cost allocation requires tagging resources to teams. Enforce this with automated policies:

# Example: Kyverno policy to enforce labels
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-cost-labels
spec:
validationFailureAction: Enforce
rules:
- name: check-labels
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
validate:
message: "All workloads must have team, environment, service, and cost-center labels."
pattern:
metadata:
labels:
team: "?*"
environment: "?*"
service: "?*"
cost-center: "?*"

Track these metrics monthly or quarterly:

CategoryMetricTargetCurrent
Adoption% of teams on platform> 80%___
Adoption% of deploys via platform> 90%___
SpeedTime to first deploy (new service)< 2 hours___
SpeedLead time for changes< 1 day___
ReliabilityPlatform availability> 99.9%___
ReliabilityPlatform-caused incidents< 2/month___
SatisfactionDeveloper NPS> 40___
SatisfactionSupport satisfaction> 4/5___
EfficiencySelf-service ratio> 90%___
EfficiencySupport ticket volume (trend)Decreasing___
CostPlatform cost per developer< $X/month___
CostInfrastructure cost savings> $Y/quarter___

In addition to user-facing metrics, track team health:

MetricHealthyWarningCritical
Sprint velocityStable or increasingDecliningErratic
On-call load< 2 pages/week3-5 pages/week> 5 pages/week
Toil ratio< 30%30-50%> 50%
Attrition< 10%/year10-20%/year> 20%/year
Bus factor3+ per system2 per system1 per system
Tech debt ratio< 20% of work20-40%> 40%

Pause and predict: If your platform team decides to build a custom internal CI/CD runner to perfectly match your environment, what are the hidden long-term costs that will emerge two years from now?

Every platform capability faces this question: do we build it ourselves, buy a commercial product, or partner with an open-source community?

FactorBuildBuyPartner (Open Source)
ControlFullLimitedMedium (contribute upstream)
Cost (upfront)High (engineering time)Medium (license fee)Low (free + integration time)
Cost (ongoing)High (maintenance)Medium (subscription)Medium (upgrades, customization)
Time to valueMonthsWeeksWeeks to months
CustomizationUnlimitedVendor roadmap dependentFork-and-extend (risky)
TalentMust hire specialistsVendor provides supportCommunity knowledge
RiskMaintenance burden, bus factorVendor lock-in, price increasesProject abandonment, complexity

Build when:

  • Your requirements are truly unique (rare — most infrastructure is not unique)
  • The capability is a core differentiator for your platform
  • You have the talent to build AND maintain it long-term
  • The commercial and open-source options genuinely do not fit

Warning signs you should NOT build:

  • “We could build a better version” (you probably cannot maintain it)
  • “The vendor is too expensive” (compare total cost including engineering time)
  • “We need full control” (you rarely actually do)
  • “It’ll be a fun project” (this is not a valid business reason)

Buy when:

  • The capability is well-served by commercial products
  • Your team should focus on higher-value work
  • You need support and SLAs
  • Time to value is critical

Warning signs you should NOT buy:

  • The vendor requires deep integration that creates lock-in
  • The product requires extensive customization to fit your needs
  • The vendor’s roadmap does not align with your direction
  • The total cost (license + integration + customization) exceeds building

Partner with open source when:

  • Active community with multiple corporate backers
  • Extensible architecture that supports your customization
  • You can contribute upstream (gives you influence over direction)
  • You have talent to integrate and maintain

Warning signs:

  • Single-company-backed project (risk of abandonment or relicensing)
  • Small community (limited support, slow bug fixes)
  • Frequent breaking changes (high maintenance burden)
  • License changes (recent examples: HashiCorp BSL, Redis SSPL)

For each capability, score these factors (1-5):

Capability: _______________
Build Score:
Team has expertise: ___/5
Requirements are unique: ___/5
Long-term maintenance ok: ___/5
Time to build acceptable: ___/5
Total: ___/20
Buy Score:
Good product exists: ___/5
Budget available: ___/5
Integration effort low: ___/5
Vendor reliable: ___/5
Total: ___/20
Open Source Score:
Good project exists: ___/5
Active community: ___/5
License acceptable: ___/5
Team can contribute: ___/5
Total: ___/20
Recommendation: Highest score → [ ] Build [ ] Buy [ ] Open Source

Exercise 1: Platform Organization Design (45 min)

Section titled “Exercise 1: Platform Organization Design (45 min)”

Design the structure for a platform organization at different scales:

Scenario A: 100 developers, 5 platform engineers

Team structure:
Specializations:
Communication cadence:
Key risk:

Scenario B: 300 developers, 15 platform engineers

Team structure:
Number of sub-teams:
Team charters:
Governance model:
Key risk:

Scenario C: 800 developers, 40 platform engineers

Organizational structure:
Groups and teams:
Governance bodies:
Central vs federated decisions:
Key risk:

For each scenario, answer:

  1. What is the biggest organizational risk at this scale?
  2. What governance mechanisms are needed?
  3. How do teams communicate across boundaries?
  4. What is the role of the platform leader?

Exercise 2: Platform SLO Workshop (40 min)

Section titled “Exercise 2: Platform SLO Workshop (40 min)”

Define SLOs for your platform services:

Step 1: List your platform services

1. _______________
2. _______________
3. _______________
4. _______________
5. _______________

Step 2: For each service, define SLIs and SLOs

Service: _______________
SLI (what you measure): _______________
SLO (target): _______________
Measurement method: _______________
Error budget (per month): _______________
Error budget policy: _______________
If < 50% budget remaining: _______________
If budget exhausted: _______________

Step 3: Create an SLO dashboard layout

flowchart TD
subgraph "PLATFORM SLO DASHBOARD"
direction TB
S1["Service 1: _______________<br>SLO: ____% Current: ____% Budget: ___<br>[────────────────█─────] 30-day view"]
S2["Service 2: _______________<br>SLO: ____% Current: ____% Budget: ___<br>[────────────────█─────] 30-day view"]
end

Exercise 3: Build vs Buy Analysis (30 min)

Section titled “Exercise 3: Build vs Buy Analysis (30 min)”

Evaluate a platform capability using the decision matrix:

Choose one: Service mesh, developer portal, CI/CD, monitoring, secrets management, database provisioning

Capability: _______________
Build:
Effort: ___ person-months
Maintenance: ___ people ongoing
Risk: _______________
Score: ___/20
Buy:
Product: _______________
Cost: $___ /year
Integration: ___ weeks
Risk: _______________
Score: ___/20
Open Source:
Project: _______________
Integration: ___ weeks
Customization: ___ weeks
Risk: _______________
Score: ___/20
Recommendation: _______________
Justification: _______________

Exercise 4: Cost Allocation Model Design (30 min)

Section titled “Exercise 4: Cost Allocation Model Design (30 min)”

Design a cost allocation model for your platform:

COST ALLOCATION MODEL
---------------------
Centrally funded items:
1. _______________ ($___/month)
2. _______________ ($___/month)
3. _______________ ($___/month)
Total central: $___/month
Charged to teams:
1. _______________ (per ___)
2. _______________ (per ___)
3. _______________ (per ___)
Monthly team report includes:
[ ] Total cost
[ ] Cost breakdown by service
[ ] Trend vs previous month
[ ] Optimization recommendations
[ ] Comparison with similar teams
Implementation plan:
Month 1: _______________
Month 2: _______________
Month 3: _______________

War Story: Scaling From 5 to 50 Platform Engineers

Section titled “War Story: Scaling From 5 to 50 Platform Engineers”

Company: Large SaaS company, ~1,200 engineers, publicly traded

Situation: The original platform team of 5 engineers was legendary within the company. They had built a deployment platform that developers loved, achieving 92% voluntary adoption in 2 years. They were fast, responsive, and had deep relationships with every development team.

Then the company decided to expand the platform to cover data infrastructure, ML pipelines, and security tooling. The plan: grow from 5 to 50 platform engineers over 18 months.

Timeline:

Month 1-3: Rapid hiring (5 to 15)

  • Hired 10 engineers in 3 months
  • All joined the “platform team” as one group
  • Slack channel went from 5 people to 15
  • Standup went from 10 minutes to 35 minutes
  • Decision-making slowed dramatically

Month 4-6: First split (15 to 25)

  • Split into 3 teams: Infrastructure, Developer Experience, Data Platform
  • Each team had a tech lead but shared a manager
  • Immediately hit coordination problems: Infrastructure team changed a Kubernetes configuration that broke Data Platform’s pipelines
  • No API contracts between teams

Month 7-9: Growing pains (25 to 35)

  • Created “Platform Architecture” role to coordinate across teams
  • Introduced bi-weekly architecture reviews
  • Established a technology radar (Adopt, Trial, Assess, Hold)
  • Hired a platform product manager
  • Defined SLOs for all platform services
  • Added a Security team

Month 10-12: Governance (35 to 45)

  • Created a “Platform RFC” process for major changes
  • Introduced cost allocation (showback model)
  • Published an internal SLA
  • Created a “Platform Council” — monthly meeting of all team leads
  • Started measuring platform scorecard

Month 13-18: Maturity (45 to 50)

  • Promoted the original tech lead to Director of Platform Engineering
  • Each team had its own manager, tech lead, and roadmap
  • Federated governance: central architecture standards, local team decisions
  • Platform scorecard reviewed monthly with VP of Engineering

What worked:

  • Splitting teams early (at 12-15 people, not 25)
  • Hiring a platform architect to own cross-cutting concerns
  • Technology radar prevented tool sprawl
  • SLOs made reliability a measurable commitment
  • Cost allocation made platform costs visible and defensible

What they would do differently:

  • Split teams at 8-10, not 15 (the first split was too late)
  • Establish API contracts between teams before the first split
  • Hire a product manager earlier (should have been hire #6, not #20)
  • Start with federated governance, not centralized (the transition was painful)

Business impact: The platform organization reduced time-to-production from 3 weeks to 2 days, reduced infrastructure incidents by 65%, and saved an estimated 4M/yearinengineeringproductivity.Thecostoftheplatformorg:4M/year in engineering productivity. The cost of the platform org: 8M/year in salaries. ROI: positive within the first year.

Lessons:

  1. Split early: 8-10 people is the inflection point, not 15-20
  2. Define boundaries before you need them: API contracts, ownership, and governance should precede team growth
  3. Product management is not optional: A 50-person engineering org without product management builds impressive things nobody asked for
  4. Governance scales with headcount: What works at 5 people (informal, trust-based) fails at 50 (need explicit processes)
  5. Measure ruthlessly: Without the platform scorecard, leadership would have questioned a $8M/year investment

Scenario: Your company’s internal platform team recently grew from 6 to 11 engineers. Lately, developers have complained about inconsistent answers from different platform engineers, and a recent incident occurred because two engineers made conflicting changes to the Kubernetes cluster without realizing it. What organizational threshold has this team crossed, and why are these specific issues emerging now?

Show Answer

The team has crossed the 8-12 person threshold, where informal coordination mechanisms typically break down. At this size, it is no longer possible for everyone to know what everyone else is doing simply by sitting in the same room or sharing a single Slack channel. Because informal communication is failing, engineers are making local decisions that conflict with others, leading to incidents and inconsistent support. To resolve this, the team must introduce explicit team charters, formal documentation, and regular cross-team coordination processes to replace reliance on informal knowledge sharing. By establishing these formal boundaries and practices, the team can scale its operations without sacrificing reliability or velocity.

Scenario: You are the Director of Platform Engineering for an organization with 400 developers. Currently, your architecture review board must approve every tool choice and implementation detail for the four platform sub-teams. This is causing massive bottlenecks, yet you fear that relaxing these rules will lead to a chaotic, fragmented platform. Which governance model should you adopt to resolve this tension, and how does it function?

Show Answer

You should adopt a federated governance model. Federated governance works by centralizing the definition of broad standards, security policies, and boundaries, while delegating the actual implementation decisions to individual teams. This model is preferred at scale because a centralized board cannot make decisions fast enough for multiple teams without becoming a severe bottleneck. By giving teams autonomy within strict, centrally defined boundaries, you maintain overall platform consistency while preserving the speed and ownership that engineers need to remain effective.

Scenario: The CFO has mandated that the platform organization (now serving 500 developers) must implement a strict, 100% chargeback model for all platform services, including CI/CD usage, developer portal access, and the platform team’s salaries. As the platform lead, you are concerned this will drive teams away from the platform. What alternative cost model should you propose, and why is it superior for platform adoption?

Show Answer

You should propose a hybrid cost allocation model. In a hybrid model, base platform capabilities (like the Kubernetes control plane, CI/CD, and platform team salaries) are centrally funded as a “platform tax,” while only variable, usage-based resources (like compute, storage, and specialized databases) are charged back to individual teams. This is recommended over pure chargeback because pure chargeback penalizes teams for adopting shared infrastructure, creating perverse incentives where teams might try to build their own cheaper, less secure solutions. The hybrid model aligns financial incentives correctly: it encourages efficient resource usage through variable chargebacks while removing financial friction from adopting the core platform standards.

Scenario: Your developer experience team wants to build a custom internal developer portal from scratch. They argue that the open-source Backstage project is too complex to configure, and commercial offerings are expensive. They estimate it will take three engineers about four months to build a “perfectly tailored” solution. Based on the build vs. buy framework, why is this proposal likely a mistake, and what criteria should truly justify building a capability internally?

Show Answer

This proposal is likely a mistake because the team is underestimating the long-term maintenance burden and the opportunity cost of dedicating three engineers to a non-differentiating tool. You should only build a capability internally when your requirements are genuinely unique, it serves as a core competitive differentiator, and you have the dedicated talent to maintain it indefinitely. Most organizations’ needs for developer portals, CI/CD, or monitoring are not unique enough to justify the massive ongoing cost of internal development. Unless the total cost of ownership—including years of future maintenance and lost productivity on other projects—is demonstrably lower, the organization should buy or adopt an existing solution.

Scenario: Your platform organization recently reorganized into four specialized sub-teams (Infrastructure, DX, Data, Security) with a total of 30 engineers. Three months later, you discover that the Infrastructure team and the Security team have both spent weeks building separate, incompatible systems for managing Kubernetes secrets. What organizational failure caused this duplication, and what specific steps must you take to resolve and prevent it?

Show Answer

This duplication occurred because the organization lacks explicit team charters and clear ownership boundaries. When teams do not have strictly defined domains, they naturally gravitate toward interesting problems that overlap with other teams’ implicit responsibilities. Because infrastructure and security domains frequently intersect—especially around secrets management—the absence of a clear boundary guaranteed overlapping work. To fix this, leadership must immediately define and document explicit boundaries for each team’s scope. Furthermore, you must implement an RFC (Request for Comments) process or a regular cross-team architecture review so that all new capabilities are proposed, discussed, and assigned to the correct team before any engineering work begins.

Scenario: The platform team recently adopted the same SLO framework used by the customer-facing product teams. However, when the shared Kubernetes cluster experienced a 10-minute API outage, the platform team declared it “within budget” because their SLO is 99.5%. Meanwhile, three different product teams missed their own 99.9% SLOs because they couldn’t deploy critical hotfixes during that window. Why did this happen, and how must platform SLOs be designed differently from standard application SLOs?

Show Answer

This incident occurred because the platform team set their reliability targets lower than the targets of the applications depending on them. Platform SLOs differ from application SLOs because platform services are foundational dependencies; their blast radius affects multiple teams simultaneously. Therefore, a platform’s SLO must be mathematically stricter (e.g., 99.95% or higher) than the tightest SLO of any application running on top of it. Additionally, platform SLOs measure the infrastructure’s internal availability to developers (like deployment success rates or Kubernetes API uptime) rather than the end-user experience. Setting appropriately high reliability targets ensures that product teams always have the dependable foundation they need to meet their own customer commitments.

Scenario: Your infrastructure team wants to adopt a highly capable open-source infrastructure-as-code tool. The tool is wildly popular, but you notice that 95% of the commits come from employees of a single startup that holds the trademark. Given recent trends in the open-source ecosystem, what specific risks does this dependency introduce, and how should you evaluate whether to proceed?

Show Answer

Relying on a project with a single corporate backer introduces severe risks regarding license changes and project abandonment. The backing company could unilaterally change the license (e.g., from an open-source license to a Business Source License or SSPL) to protect their commercial interests, potentially forcing your organization into unexpected licensing fees or complex migrations. Furthermore, if the startup pivots or is acquired, the project may be abruptly abandoned, leaving your infrastructure stranded without security updates. To mitigate this, you must evaluate the project’s governance model, looking for neutral foundation backing (like the CNCF or Apache) and a diverse maintainer base. You must also weigh the cost of potentially having to fork and maintain the code yourself if the corporate backer changes direction.

Scenario: Following a rough financial quarter, the VP of Engineering approaches you. They note that the platform division now costs $5 million annually in salaries and infrastructure. They ask, “Can we cut this team in half? Product teams can just manage their own deployments.” How do you systematically build a business case to defend the platform team’s ROI using both direct savings and productivity metrics?

Show Answer

To defend the platform team’s ROI, you must build a business case grounded in measurable impact rather than abstract activity. First, quantify direct savings by highlighting infrastructure cost optimizations the platform enforces, as well as the eliminated costs of duplicate licenses and redundant tools across the organization. Next, calculate productivity gains by demonstrating the reduction in onboarding time, the decrease in time-to-production for new services, and the hours saved by developers not having to firefight infrastructure incidents. You can also point to the avoidance of costly compliance and security incidents that a standardized platform prevents. Finally, format this argument as a clear financial equation: show how the $5 million investment directly enables millions more in recovered developer hours, avoided security breaches, and faster time-to-market, proving a net-positive return on investment.


Scaling a platform organization requires deliberate structural design at every growth stage. What works at 5 people — informal communication, shared ownership, trust-based coordination — breaks at 15. What works at 15 breaks at 50.

Key principles:

  • Split teams early: 8-10 people, not 15-20
  • Federate governance: Central standards, local decisions
  • Define SLOs: Make reliability measurable and visible
  • Allocate costs transparently: Hybrid model balances incentives
  • Measure everything: Platform scorecard proves value to leadership
  • Build vs buy honestly: Most infrastructure is not unique enough to build
  • Invest in governance before you need it: Team charters, RFCs, technology radar

Congratulations on completing the Platform Leadership discipline. You now have frameworks for building teams, designing developer experiences, running your platform as a product, driving adoption, and scaling your organization.

Recommended next steps:


“A platform organization’s job is to be invisible. When developers don’t think about infrastructure, you’ve succeeded.”