Module 1.4: Toil and Automation
Цей контент ще не доступний вашою мовою.
Discipline Module | Complexity:
[MEDIUM]| Time: 30-35 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 1.1: What is SRE? — Understanding SRE fundamentals
- Recommended: Systems Thinking Track — Understanding system leverage
- Helpful: Some experience with scripting or automation
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Evaluate operational tasks against the SRE toil taxonomy to prioritize automation
- Design automation strategies that eliminate the highest-impact toil first
- Implement self-healing systems that resolve common incidents without human intervention
- Measure toil reduction over time and build a business case for continued automation investment
Why This Module Matters
Section titled “Why This Module Matters”You’re drowning in repetitive work. Every day:
- Reset user passwords (again)
- Restart that flaky service (again)
- Run the same diagnostic commands (again)
- Provision resources manually (again)
This work keeps the lights on, but it’s eating your time. You rarely get to the projects that would make things better.
This is toil. And SRE has a systematic approach to eliminating it.
This module teaches you to identify toil, measure it, and eliminate it through automation — freeing you to work on things that actually matter.
What Is Toil?
Section titled “What Is Toil?”Stop and think: Think about the most annoying task you performed this week. Did it require you to make a creative decision, or were you just acting as a human script runner? If it’s the latter, you were performing toil.
- Manual: Requires human hands
- Repetitive: Done over and over
- Automatable: Could be done by machines
- Tactical: Reactive, not strategic
- Devoid of value: Doesn’t improve the system
- Scales with load: More traffic = more toil
The Toil Test
Section titled “The Toil Test”Ask these questions about any task:
| Question | ”Yes” Points to Toil |
|---|---|
| Could a script do this? | ✓ |
| Do I do this frequently? | ✓ |
| Does it require human judgment? | ✗ (not toil) |
| Does it permanently improve things? | ✗ (not toil) |
| Does it scale linearly with growth? | ✓ |
| Is it the same every time? | ✓ |
Toil vs. Not Toil
Section titled “Toil vs. Not Toil”| Task | Is It Toil? | Why |
|---|---|---|
| Restarting pods manually | Yes | Repetitive, automatable |
| Responding to pages | Depends | Investigation isn’t, remediation might be |
| Writing postmortems | No | Requires judgment, creates permanent value |
| Provisioning users | Yes | Same steps each time, automatable |
| Capacity planning | No | Requires analysis and judgment |
| Running backups manually | Yes | Repetitive, should be automated |
| Designing new system | No | Creative, strategic work |
Overhead vs. Toil
Section titled “Overhead vs. Toil”Not all operational work is toil:
Overhead (necessary, not toil):
- Team meetings
- Code reviews
- Planning sessions
- Training and learning
- Interviews
Toil (should be eliminated):
- Manual deployments
- Repetitive tickets
- Manual scaling
- Routine restarts
- Manual monitoring checks
The 50% Rule
Section titled “The 50% Rule”Pause and predict: If a team spends 90% of their time resolving tickets manually, what happens to the reliability of the system over the next year?
Google’s SRE teams have a hard rule:
The other 50%+ goes to engineering projects that:
- Reduce future toil
- Improve system reliability
- Build better tools
- Automate repetitive tasks
Why 50%?
Section titled “Why 50%?”- Less than 50%: Allows toil to be addressed, not just managed
- More than 50%: The team becomes glorified ops, not engineers
- Way more: Indicates systemic problems — either too much toil or understaffed
When Toil Exceeds 50%
Section titled “When Toil Exceeds 50%”If your team’s toil exceeds 50%, something must change:
- Automate aggressively: Invest in eliminating top toil sources
- Push back on service: Service may need reliability improvements
- Create temporary capacity: Free time for the team to reduce operational load and automation debt
- Rebalance ownership: Revisit service ownership if the current arrangement is unsustainably toil-heavy
The 50% rule is a forcing function that ensures SRE remains an engineering discipline.
Try This: Toil Audit
Section titled “Try This: Toil Audit”List your top 5 repetitive tasks this week:
Task 1: ________________ Time spent: ___ hours Frequency: ___/week Could be automated? Y/N
Task 2: ________________ Time spent: ___ hours Frequency: ___/week Could be automated? Y/N
Task 3: ________________ Time spent: ___ hours Frequency: ___/week Could be automated? Y/N
Task 4: ________________ Time spent: ___ hours Frequency: ___/week Could be automated? Y/N
Task 5: ________________ Time spent: ___ hours Frequency: ___/week Could be automated? Y/N
Total toil time: ___ hours/weekPercentage of work week: ___%Measuring Toil
Section titled “Measuring Toil”You can’t reduce what you don’t measure.
Toil Measurement Framework
Section titled “Toil Measurement Framework”toil_tracking: categories: - name: "User management" tasks: - "Password resets" - "Account provisioning" - "Access revocations"
- name: "Incident response" tasks: - "Alert investigation" - "Service restarts" - "Failover execution"
- name: "Deployments" tasks: - "Manual deployment steps" - "Rollback execution" - "Config updates"
- name: "Maintenance" tasks: - "Certificate renewals" - "Capacity adjustments" - "Backup verification"
tracking_method: - Tool: Time tracking software - Cadence: Weekly review - Metrics: - Hours per category - Trend over time - Percentage of workKey Metrics
Section titled “Key Metrics”| Metric | What It Tells You |
|---|---|
| Toil percentage | How much time goes to repetitive work |
| Toil per team member | Individual burden distribution |
| Toil trend | Is it getting better or worse? |
| Toil per incident | How much manual work per incident? |
| Time to automate | ROI of automation efforts |
Simple Tracking
Section titled “Simple Tracking”Start simple — a shared spreadsheet:
| Week | Category | Task | Time (hrs) | Automatable? |
|---|---|---|---|---|
| 1 | Users | Password resets | 2 | Yes |
| 1 | Deploy | Manual deploy | 4 | Yes |
| 1 | Incident | Service restart | 1 | Yes |
| 1 | Meetings | Team sync | 3 | No |
Automation Strategies
Section titled “Automation Strategies”Stop and think: Is it possible to over-automate? Think about a scenario where automating a task might actually be a bad idea.
The Automation Hierarchy
Section titled “The Automation Hierarchy”Not everything should be automated the same way:
flowchart TD L0["<b>Level 0: Manual</b><br/>Every execution requires human action<br/><i>Example: SSH in and restart service</i>"] L1["<b>Level 1: Documented</b><br/>Written procedure, still manual<br/><i>Example: Runbook with exact commands</i>"] L2["<b>Level 2: Semi-automated</b><br/>Script exists, human triggers it<br/><i>Example: ./restart_service.sh</i>"] L3["<b>Level 3: Auto-triggered</b><br/>System detects need, asks permission<br/><i>Example: 'Service unhealthy. Restart? [Y/n]'</i>"] L4["<b>Level 4: Fully automated</b><br/>System handles automatically<br/><i>Example: Kubernetes self-healing</i>"] L5["<b>Level 5: Self-optimizing</b><br/>System learns and improves<br/><i>Example: Auto-scaling based on patterns</i>"]
L0 --> L1 --> L2 --> L3 --> L4 --> L5When to Automate
Section titled “When to Automate”The ROI calculation:
Automation ROI = (Time saved per occurrence × Occurrences) - Development time
Example: Task: Manual deployment Time per deployment: 30 minutes Deployments per month: 40 Automation development time: 20 hours
Monthly savings: 40 × 0.5 hours = 20 hours Payback period: 20 hours / 20 hours = 1 month
Verdict: Definitely automateXKCD’s classic chart helps decide:
| How often? | Time saved | Automation worth it if takes… |
|---|---|---|
| 50x/day | 5 min | Up to 6 weeks |
| Daily | 5 min | Up to 4 days |
| Weekly | 5 min | Up to 1 day |
| Monthly | 30 min | Up to 4 hours |
| Yearly | 1 hour | Up to 30 minutes |
What to Automate First
Section titled “What to Automate First”Prioritize by combining frequency and complexity:
- High frequency, low complexity: Quick wins (Do these first)
- High frequency, high complexity: Big impact (Major engineering projects)
- Low frequency, low complexity: Steady progress (Good for onboarding or slow days)
- Low frequency, high complexity: Maybe don’t automate (Negative ROI)
quadrantChart title Automation Priority x-axis "Low Frequency" --> "High Frequency" y-axis "Low Complexity" --> "High Complexity" quadrant-1 "2. Big impact" quadrant-2 "4. Maybe don't automate" quadrant-3 "3. Steady progress" quadrant-4 "1. Quick wins"Did You Know?
Section titled “Did You Know?”-
If incident remediation repeatedly requires manual steps, that’s a sign more automation is likely needed.
-
In Google SRE writing, “toil” refers to repetitive operational work, not all operational work.
-
Teams often track toil trends over time. An upward trend is a warning sign that current automation is not keeping up.
-
XKCD’s “Is It Worth the Time?” chart offers a back-of-the-envelope way to think about automation ROI.
War Story: The Automation That Saved the Team
Section titled “War Story: The Automation That Saved the Team”A team I worked with had a toil problem:
The Situation:
- A small SRE team
- Multiple critical services
- Most of the team’s time consumed by toil
- No time for improvements
- Team morale: terrible
The Toil Audit:
| Category | Hours/week | % of Time |
|---|---|---|
| Manual deploys | 40 | 28% |
| Incident response | 35 | 24% |
| User provisioning | 20 | 14% |
| Log investigation | 15 | 10% |
| Certificate mgmt | 8 | 5% |
| Total toil | Most team time | High |
The 90-Day Plan:
Week 1-4: Automate deploys
- Built CI/CD pipeline
- Significant weekly time savings
Week 5-8: Auto-remediation
- Kubernetes for self-healing
- Auto-scaling policies
- Significant weekly time savings
Week 9-12: Self-service
- User provisioning portal
- Self-service log access
- Significant weekly time savings
The Result:
Before: Toil: 81% Project time: 19% Team morale: 2/10
After 90 days: Toil: 28% Project time: 72% Team morale: 8/10Key lesson: The team wasn’t bad at their jobs — they were drowning in toil. Automation gave them their time back.
Automation Patterns
Section titled “Automation Patterns”Pattern 1: Runbook to Script
Section titled “Pattern 1: Runbook to Script”Transform documentation into code:
# Before: Runbook# "To restart the payment service:# 1. SSH to payment-prod# 2. Run: sudo systemctl restart payment# 3. Verify: curl localhost:8080/health# 4. Check logs: tail -f /var/log/payment/app.log"
# After: Script#!/bin/bashset -e
echo "Restarting payment service..."kubectl rollout restart deployment/payment -n production
echo "Waiting for rollout..."kubectl rollout status deployment/payment -n production
echo "Verifying health..."kubectl exec -n production deploy/payment -- curl -s localhost:8080/health
echo "Service restarted successfully"Pattern 2: Chatbot Operations
Section titled “Pattern 2: Chatbot Operations”Let chat trigger safe operations:
User: /restart payment-service productionBot: ⚠️ Restart payment-service in production? This will cause ~30s of degraded service. [Confirm] [Cancel]
User: [Confirm]Bot: ✅ Restarting payment-service... - Old pods terminating - New pods starting - Health check passed Restart complete! (took 45s)Pattern 3: Self-Healing
Section titled “Pattern 3: Self-Healing”Let the system fix itself:
# Kubernetes liveness probelivenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 # Restart after 3 failures
# Result: Kubernetes automatically restarts# unhealthy pods — no human neededPattern 4: Policy-Based Automation
Section titled “Pattern 4: Policy-Based Automation”Define policies, let systems enforce them:
# OPA policy for auto-scalingpackage autoscaling
default scale_up = false
scale_up { input.cpu_utilization > 80 input.pending_requests > 100 input.current_replicas < input.max_replicas}
# Result: System scales up automatically# when conditions are metCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Automate everything immediately | Wastes time on low-value automation | Prioritize by frequency × time |
| Automate before understanding | Script breaks in edge cases | Document first, understand, then automate |
| No monitoring of automation | Automated tasks fail silently | Add alerting to automation |
| Over-complex automation | Hard to maintain, breaks often | Simple, readable automation |
| Not tracking toil | Can’t prove improvements | Measure before and after |
| Hero culture | Individual carries toil burden | Distribute and automate together |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”Scenario: You are an SRE reviewing the weekly task list. One task involves writing a detailed postmortem for a recent database outage, which takes 4 hours. Another task is manually running a database backup verification script every morning, which takes 15 minutes. How do you classify these tasks, and why?
Show Answer
The backup verification is toil, while the postmortem is not.
Toil is defined by work that is manual, repetitive, automatable, and devoid of enduring value, scaling linearly with the system’s growth. Verifying backups every morning fits this definition perfectly because it could be automated and requires no special human judgment. In contrast, writing a postmortem requires deep analysis, human judgment, and produces lasting value by improving system reliability. Therefore, the postmortem is considered engineering or operational overhead, not toil.
Question 2
Section titled “Question 2”Scenario: Your SRE team has been tracking its time and discovers that over the last quarter, the team spent 70% of its hours executing manual user provisioning, responding to routine paging alerts, and manually scaling infrastructure. The team is proud they kept the system running, but management is concerned. How should this situation be handled according to SRE principles?
Show Answer
The team has exceeded the 50% limit for toil and must take immediate action to reduce it.
Google’s SRE model strictly limits toil to 50% of an engineer’s time to ensure that at least half of their time is dedicated to engineering projects that permanently improve the system. When toil exceeds 50%, the team is trapped in a reactive operations mode and cannot build the automation needed to scale. To fix this, the team should push back on service levels, heavily prioritize automation projects, request additional staffing, or temporarily hand operational responsibilities back to the development team until the toil is manageable.
Question 3
Section titled “Question 3”Scenario: You manage a legacy reporting service that requires a manual cache clearing process every Friday afternoon. The process takes you 30 minutes to perform safely. You estimate it would take you 20 hours to write, test, and safely deploy a fully automated script to handle this. Using ROI principles, should you prioritize automating this task right now?
Show Answer
Yes, you should prioritize automating this task because the payback period is reasonable and it eliminates a recurring interruption.
The manual task costs you 2 hours per month (30 minutes × 4 weeks). If you invest 20 hours to automate it, the automation will pay for itself in exactly 10 months (20 hours / 2 hours per month). In SRE, calculating the Return on Investment (ROI) for automation helps prioritize engineering effort. Since this task is highly repetitive and the payback period is less than a year, automating it is a sound engineering decision that will permanently free up your Friday afternoons for higher-value work.
Question 4
Section titled “Question 4”Scenario:
Your team currently handles high CPU alerts by manually SSHing into a server and running a script named ./scale_up.sh. You want to improve this process. First, you configure an alert system that prompts you in Slack: “High CPU detected. Run scale_up.sh? [Y/n]”. Later, you replace this entirely with a Kubernetes HorizontalPodAutoscaler that automatically adds pods when CPU hits 80%. Describe the transitions in automation levels that occurred here.
Show Answer
The process transitioned from Level 2 (Semi-automated) to Level 3 (Auto-triggered), and finally to Level 4 (Fully automated).
Initially, the human had to decide when to run the script and trigger it manually, which corresponds to Level 2. By moving the trigger to Slack, where the system detects the issue and asks for human permission, the automation reached Level 3. Finally, implementing the HorizontalPodAutoscaler removed the human from the loop entirely, allowing the system to detect and remediate the issue on its own, achieving Level 4. This progression demonstrates the ideal path for eliminating toil, moving from human-initiated actions to true self-healing systems.
Hands-On Exercise: Toil Reduction Plan
Section titled “Hands-On Exercise: Toil Reduction Plan”Create a 30-day toil reduction plan.
Part 1: Toil Audit (15 min)
Section titled “Part 1: Toil Audit (15 min)”List all repetitive tasks from the past week:
| Task | Time/occurrence | Frequency | Weekly Hours | Automatable? |
|---|---|---|---|---|
| 1. | ||||
| 2. | ||||
| 3. | ||||
| 4. | ||||
| 5. |
Total weekly toil: ___ hours Percentage of work week (40h): ___%
Part 2: Prioritization (10 min)
Section titled “Part 2: Prioritization (10 min)”Score each task:
| Task | Frequency Score (1-5, 5=daily) | Time Score (1-5, 5=long) | Complexity Score (1-5, 5=simple) | Total |
|---|---|---|---|---|
| 1. | ||||
| 2. | ||||
| 3. |
Priority order (highest total first):
Part 3: 30-Day Plan (15 min)
Section titled “Part 3: 30-Day Plan (15 min)”For your top priority item:
## Automation Plan: [Task Name]
### Current State- Time per occurrence:- Frequency:- Total monthly time:- Current automation level:
### Target State- Automation level after:- Expected time savings:- Monitoring added:
### ImplementationWeek 1: - [ ] Document current process - [ ] Identify edge cases
Week 2: - [ ] Write automation script/config - [ ] Test in non-production
Week 3: - [ ] Deploy with monitoring - [ ] Run in parallel with manual process
Week 4: - [ ] Remove manual process - [ ] Measure actual savings
### Success Metrics- Time savings achieved: ___- Errors reduced: ___- Team satisfaction: ___Success Criteria
Section titled “Success Criteria”- Audited at least 5 tasks
- Calculated total toil percentage
- Prioritized using scoring
- Created detailed plan for #1 priority
- Defined success metrics
Key Takeaways
Section titled “Key Takeaways”- Toil is repetitive work that doesn’t add lasting value — identify it
- The 50% rule ensures time for engineering, not just operations
- Measure toil before and after — you can’t improve what you don’t track
- Prioritize automation by frequency × time × simplicity
- Progress through automation levels — from manual to self-healing
Further Reading
Section titled “Further Reading”Books:
- “Site Reliability Engineering” — Chapter 5: Eliminating Toil
- “The Site Reliability Workbook” — Chapter 6: Eliminating Toil
Articles:
- “Identifying and Tracking Toil” — Google Cloud blog
- “Toil: A Word Every Engineer Should Know” — Medium
Tools:
- Ansible: Automation platform
- Terraform: Infrastructure as code
- Kubernetes Operators: Custom automation
Summary
Section titled “Summary”Toil is the repetitive, automatable work that eats your time without making things better. Left unchecked, it consumes all your time and prevents improvement.
SRE’s approach:
- Measure toil systematically
- Limit toil to 50% of time
- Automate strategically (ROI-based)
- Progress through automation levels
The goal isn’t to eliminate all operational work — it’s to eliminate the grinding, repetitive parts so you can focus on work that matters.
Next Module
Section titled “Next Module”Continue to Module 1.5: Incident Management to learn how to respond effectively when things go wrong.
“Alerts should be actionable, and routine manual intervention is a sign the system or alerting design needs improvement.”
Sources
Section titled “Sources”- cloud.google.com: identifying and tracking toil using sre principles — The Google Cloud blog reproduces the SRE Book definition and its key characteristics directly.
- Meeting Reliability Challenges with SRE Principles — Connects toil reduction to broader SRE reliability practice and operational load management.
- Kubernetes Self-Healing — Useful background for the module’s automation and self-healing examples in Kubernetes environments.