Перейти до вмісту

Module 1.4: Toil and Automation

Цей контент ще не доступний вашою мовою.

Discipline Module | Complexity: [MEDIUM] | Time: 30-35 min

Before starting this module:


After completing this module, you will be able to:

  • Evaluate operational tasks against the SRE toil taxonomy to prioritize automation
  • Design automation strategies that eliminate the highest-impact toil first
  • Implement self-healing systems that resolve common incidents without human intervention
  • Measure toil reduction over time and build a business case for continued automation investment

You’re drowning in repetitive work. Every day:

  • Reset user passwords (again)
  • Restart that flaky service (again)
  • Run the same diagnostic commands (again)
  • Provision resources manually (again)

This work keeps the lights on, but it’s eating your time. You rarely get to the projects that would make things better.

This is toil. And SRE has a systematic approach to eliminating it.

This module teaches you to identify toil, measure it, and eliminate it through automation — freeing you to work on things that actually matter.


Stop and think: Think about the most annoying task you performed this week. Did it require you to make a creative decision, or were you just acting as a human script runner? If it’s the latter, you were performing toil.

Toil is work that:

  • Manual: Requires human hands
  • Repetitive: Done over and over
  • Automatable: Could be done by machines
  • Tactical: Reactive, not strategic
  • Devoid of value: Doesn’t improve the system
  • Scales with load: More traffic = more toil

Ask these questions about any task:

Question”Yes” Points to Toil
Could a script do this?
Do I do this frequently?
Does it require human judgment?✗ (not toil)
Does it permanently improve things?✗ (not toil)
Does it scale linearly with growth?
Is it the same every time?
TaskIs It Toil?Why
Restarting pods manuallyYesRepetitive, automatable
Responding to pagesDependsInvestigation isn’t, remediation might be
Writing postmortemsNoRequires judgment, creates permanent value
Provisioning usersYesSame steps each time, automatable
Capacity planningNoRequires analysis and judgment
Running backups manuallyYesRepetitive, should be automated
Designing new systemNoCreative, strategic work

Not all operational work is toil:

Overhead (necessary, not toil):

  • Team meetings
  • Code reviews
  • Planning sessions
  • Training and learning
  • Interviews

Toil (should be eliminated):

  • Manual deployments
  • Repetitive tickets
  • Manual scaling
  • Routine restarts
  • Manual monitoring checks

Pause and predict: If a team spends 90% of their time resolving tickets manually, what happens to the reliability of the system over the next year?

Google’s SRE teams have a hard rule:

No more than 50% of SRE time should be spent on toil.

The other 50%+ goes to engineering projects that:

  • Reduce future toil
  • Improve system reliability
  • Build better tools
  • Automate repetitive tasks
  • Less than 50%: Allows toil to be addressed, not just managed
  • More than 50%: The team becomes glorified ops, not engineers
  • Way more: Indicates systemic problems — either too much toil or understaffed

If your team’s toil exceeds 50%, something must change:

  1. Automate aggressively: Invest in eliminating top toil sources
  2. Push back on service: Service may need reliability improvements
  3. Create temporary capacity: Free time for the team to reduce operational load and automation debt
  4. Rebalance ownership: Revisit service ownership if the current arrangement is unsustainably toil-heavy

The 50% rule is a forcing function that ensures SRE remains an engineering discipline.


List your top 5 repetitive tasks this week:

Task 1: ________________
Time spent: ___ hours
Frequency: ___/week
Could be automated? Y/N
Task 2: ________________
Time spent: ___ hours
Frequency: ___/week
Could be automated? Y/N
Task 3: ________________
Time spent: ___ hours
Frequency: ___/week
Could be automated? Y/N
Task 4: ________________
Time spent: ___ hours
Frequency: ___/week
Could be automated? Y/N
Task 5: ________________
Time spent: ___ hours
Frequency: ___/week
Could be automated? Y/N
Total toil time: ___ hours/week
Percentage of work week: ___%

You can’t reduce what you don’t measure.

toil_tracking:
categories:
- name: "User management"
tasks:
- "Password resets"
- "Account provisioning"
- "Access revocations"
- name: "Incident response"
tasks:
- "Alert investigation"
- "Service restarts"
- "Failover execution"
- name: "Deployments"
tasks:
- "Manual deployment steps"
- "Rollback execution"
- "Config updates"
- name: "Maintenance"
tasks:
- "Certificate renewals"
- "Capacity adjustments"
- "Backup verification"
tracking_method:
- Tool: Time tracking software
- Cadence: Weekly review
- Metrics:
- Hours per category
- Trend over time
- Percentage of work
MetricWhat It Tells You
Toil percentageHow much time goes to repetitive work
Toil per team memberIndividual burden distribution
Toil trendIs it getting better or worse?
Toil per incidentHow much manual work per incident?
Time to automateROI of automation efforts

Start simple — a shared spreadsheet:

WeekCategoryTaskTime (hrs)Automatable?
1UsersPassword resets2Yes
1DeployManual deploy4Yes
1IncidentService restart1Yes
1MeetingsTeam sync3No

Stop and think: Is it possible to over-automate? Think about a scenario where automating a task might actually be a bad idea.

Not everything should be automated the same way:

flowchart TD
L0["<b>Level 0: Manual</b><br/>Every execution requires human action<br/><i>Example: SSH in and restart service</i>"]
L1["<b>Level 1: Documented</b><br/>Written procedure, still manual<br/><i>Example: Runbook with exact commands</i>"]
L2["<b>Level 2: Semi-automated</b><br/>Script exists, human triggers it<br/><i>Example: ./restart_service.sh</i>"]
L3["<b>Level 3: Auto-triggered</b><br/>System detects need, asks permission<br/><i>Example: 'Service unhealthy. Restart? [Y/n]'</i>"]
L4["<b>Level 4: Fully automated</b><br/>System handles automatically<br/><i>Example: Kubernetes self-healing</i>"]
L5["<b>Level 5: Self-optimizing</b><br/>System learns and improves<br/><i>Example: Auto-scaling based on patterns</i>"]
L0 --> L1 --> L2 --> L3 --> L4 --> L5

The ROI calculation:

Automation ROI = (Time saved per occurrence × Occurrences) - Development time
Example:
Task: Manual deployment
Time per deployment: 30 minutes
Deployments per month: 40
Automation development time: 20 hours
Monthly savings: 40 × 0.5 hours = 20 hours
Payback period: 20 hours / 20 hours = 1 month
Verdict: Definitely automate

XKCD’s classic chart helps decide:

How often?Time savedAutomation worth it if takes…
50x/day5 minUp to 6 weeks
Daily5 minUp to 4 days
Weekly5 minUp to 1 day
Monthly30 minUp to 4 hours
Yearly1 hourUp to 30 minutes

Prioritize by combining frequency and complexity:

  1. High frequency, low complexity: Quick wins (Do these first)
  2. High frequency, high complexity: Big impact (Major engineering projects)
  3. Low frequency, low complexity: Steady progress (Good for onboarding or slow days)
  4. Low frequency, high complexity: Maybe don’t automate (Negative ROI)
quadrantChart
title Automation Priority
x-axis "Low Frequency" --> "High Frequency"
y-axis "Low Complexity" --> "High Complexity"
quadrant-1 "2. Big impact"
quadrant-2 "4. Maybe don't automate"
quadrant-3 "3. Steady progress"
quadrant-4 "1. Quick wins"

  1. If incident remediation repeatedly requires manual steps, that’s a sign more automation is likely needed.

  2. In Google SRE writing, “toil” refers to repetitive operational work, not all operational work.

  3. Teams often track toil trends over time. An upward trend is a warning sign that current automation is not keeping up.

  4. XKCD’s “Is It Worth the Time?” chart offers a back-of-the-envelope way to think about automation ROI.


War Story: The Automation That Saved the Team

Section titled “War Story: The Automation That Saved the Team”

A team I worked with had a toil problem:

The Situation:

  • A small SRE team
  • Multiple critical services
  • Most of the team’s time consumed by toil
  • No time for improvements
  • Team morale: terrible

The Toil Audit:

CategoryHours/week% of Time
Manual deploys4028%
Incident response3524%
User provisioning2014%
Log investigation1510%
Certificate mgmt85%
Total toilMost team timeHigh

The 90-Day Plan:

Week 1-4: Automate deploys

  • Built CI/CD pipeline
  • Significant weekly time savings

Week 5-8: Auto-remediation

  • Kubernetes for self-healing
  • Auto-scaling policies
  • Significant weekly time savings

Week 9-12: Self-service

  • User provisioning portal
  • Self-service log access
  • Significant weekly time savings

The Result:

Before:
Toil: 81%
Project time: 19%
Team morale: 2/10
After 90 days:
Toil: 28%
Project time: 72%
Team morale: 8/10

Key lesson: The team wasn’t bad at their jobs — they were drowning in toil. Automation gave them their time back.


Transform documentation into code:

Terminal window
# Before: Runbook
# "To restart the payment service:
# 1. SSH to payment-prod
# 2. Run: sudo systemctl restart payment
# 3. Verify: curl localhost:8080/health
# 4. Check logs: tail -f /var/log/payment/app.log"
# After: Script
#!/bin/bash
set -e
echo "Restarting payment service..."
kubectl rollout restart deployment/payment -n production
echo "Waiting for rollout..."
kubectl rollout status deployment/payment -n production
echo "Verifying health..."
kubectl exec -n production deploy/payment -- curl -s localhost:8080/health
echo "Service restarted successfully"

Let chat trigger safe operations:

User: /restart payment-service production
Bot: ⚠️ Restart payment-service in production?
This will cause ~30s of degraded service.
[Confirm] [Cancel]
User: [Confirm]
Bot: ✅ Restarting payment-service...
- Old pods terminating
- New pods starting
- Health check passed
Restart complete! (took 45s)

Let the system fix itself:

# Kubernetes liveness probe
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3 # Restart after 3 failures
# Result: Kubernetes automatically restarts
# unhealthy pods — no human needed

Define policies, let systems enforce them:

# OPA policy for auto-scaling
package autoscaling
default scale_up = false
scale_up {
input.cpu_utilization > 80
input.pending_requests > 100
input.current_replicas < input.max_replicas
}
# Result: System scales up automatically
# when conditions are met

MistakeProblemSolution
Automate everything immediatelyWastes time on low-value automationPrioritize by frequency × time
Automate before understandingScript breaks in edge casesDocument first, understand, then automate
No monitoring of automationAutomated tasks fail silentlyAdd alerting to automation
Over-complex automationHard to maintain, breaks oftenSimple, readable automation
Not tracking toilCan’t prove improvementsMeasure before and after
Hero cultureIndividual carries toil burdenDistribute and automate together

Scenario: You are an SRE reviewing the weekly task list. One task involves writing a detailed postmortem for a recent database outage, which takes 4 hours. Another task is manually running a database backup verification script every morning, which takes 15 minutes. How do you classify these tasks, and why?

Show Answer

The backup verification is toil, while the postmortem is not.

Toil is defined by work that is manual, repetitive, automatable, and devoid of enduring value, scaling linearly with the system’s growth. Verifying backups every morning fits this definition perfectly because it could be automated and requires no special human judgment. In contrast, writing a postmortem requires deep analysis, human judgment, and produces lasting value by improving system reliability. Therefore, the postmortem is considered engineering or operational overhead, not toil.

Scenario: Your SRE team has been tracking its time and discovers that over the last quarter, the team spent 70% of its hours executing manual user provisioning, responding to routine paging alerts, and manually scaling infrastructure. The team is proud they kept the system running, but management is concerned. How should this situation be handled according to SRE principles?

Show Answer

The team has exceeded the 50% limit for toil and must take immediate action to reduce it.

Google’s SRE model strictly limits toil to 50% of an engineer’s time to ensure that at least half of their time is dedicated to engineering projects that permanently improve the system. When toil exceeds 50%, the team is trapped in a reactive operations mode and cannot build the automation needed to scale. To fix this, the team should push back on service levels, heavily prioritize automation projects, request additional staffing, or temporarily hand operational responsibilities back to the development team until the toil is manageable.

Scenario: You manage a legacy reporting service that requires a manual cache clearing process every Friday afternoon. The process takes you 30 minutes to perform safely. You estimate it would take you 20 hours to write, test, and safely deploy a fully automated script to handle this. Using ROI principles, should you prioritize automating this task right now?

Show Answer

Yes, you should prioritize automating this task because the payback period is reasonable and it eliminates a recurring interruption.

The manual task costs you 2 hours per month (30 minutes × 4 weeks). If you invest 20 hours to automate it, the automation will pay for itself in exactly 10 months (20 hours / 2 hours per month). In SRE, calculating the Return on Investment (ROI) for automation helps prioritize engineering effort. Since this task is highly repetitive and the payback period is less than a year, automating it is a sound engineering decision that will permanently free up your Friday afternoons for higher-value work.

Scenario: Your team currently handles high CPU alerts by manually SSHing into a server and running a script named ./scale_up.sh. You want to improve this process. First, you configure an alert system that prompts you in Slack: “High CPU detected. Run scale_up.sh? [Y/n]”. Later, you replace this entirely with a Kubernetes HorizontalPodAutoscaler that automatically adds pods when CPU hits 80%. Describe the transitions in automation levels that occurred here.

Show Answer

The process transitioned from Level 2 (Semi-automated) to Level 3 (Auto-triggered), and finally to Level 4 (Fully automated).

Initially, the human had to decide when to run the script and trigger it manually, which corresponds to Level 2. By moving the trigger to Slack, where the system detects the issue and asks for human permission, the automation reached Level 3. Finally, implementing the HorizontalPodAutoscaler removed the human from the loop entirely, allowing the system to detect and remediate the issue on its own, achieving Level 4. This progression demonstrates the ideal path for eliminating toil, moving from human-initiated actions to true self-healing systems.


Create a 30-day toil reduction plan.

List all repetitive tasks from the past week:

TaskTime/occurrenceFrequencyWeekly HoursAutomatable?
1.
2.
3.
4.
5.

Total weekly toil: ___ hours Percentage of work week (40h): ___%

Score each task:

TaskFrequency Score (1-5, 5=daily)Time Score (1-5, 5=long)Complexity Score (1-5, 5=simple)Total
1.
2.
3.

Priority order (highest total first):




For your top priority item:

## Automation Plan: [Task Name]
### Current State
- Time per occurrence:
- Frequency:
- Total monthly time:
- Current automation level:
### Target State
- Automation level after:
- Expected time savings:
- Monitoring added:
### Implementation
Week 1:
- [ ] Document current process
- [ ] Identify edge cases
Week 2:
- [ ] Write automation script/config
- [ ] Test in non-production
Week 3:
- [ ] Deploy with monitoring
- [ ] Run in parallel with manual process
Week 4:
- [ ] Remove manual process
- [ ] Measure actual savings
### Success Metrics
- Time savings achieved: ___
- Errors reduced: ___
- Team satisfaction: ___
  • Audited at least 5 tasks
  • Calculated total toil percentage
  • Prioritized using scoring
  • Created detailed plan for #1 priority
  • Defined success metrics

  1. Toil is repetitive work that doesn’t add lasting value — identify it
  2. The 50% rule ensures time for engineering, not just operations
  3. Measure toil before and after — you can’t improve what you don’t track
  4. Prioritize automation by frequency × time × simplicity
  5. Progress through automation levels — from manual to self-healing

Books:

  • “Site Reliability Engineering” — Chapter 5: Eliminating Toil
  • “The Site Reliability Workbook” — Chapter 6: Eliminating Toil

Articles:

  • “Identifying and Tracking Toil” — Google Cloud blog
  • “Toil: A Word Every Engineer Should Know” — Medium

Tools:

  • Ansible: Automation platform
  • Terraform: Infrastructure as code
  • Kubernetes Operators: Custom automation

Toil is the repetitive, automatable work that eats your time without making things better. Left unchecked, it consumes all your time and prevents improvement.

SRE’s approach:

  • Measure toil systematically
  • Limit toil to 50% of time
  • Automate strategically (ROI-based)
  • Progress through automation levels

The goal isn’t to eliminate all operational work — it’s to eliminate the grinding, repetitive parts so you can focus on work that matters.


Continue to Module 1.5: Incident Management to learn how to respond effectively when things go wrong.


“Alerts should be actionable, and routine manual intervention is a sign the system or alerting design needs improvement.”