Module 6.4: Root Cause Analysis
Цей контент ще не доступний вашою мовою.
Discipline Track | Complexity:
[COMPLEX]| Time: 40-45 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Module 6.3: Event Correlation — Grouping related alerts
- Understanding of directed graphs and traversal
- Basic causal reasoning concepts
- Familiarity with distributed systems
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement automated root cause analysis workflows that correlate symptoms across metrics, logs, and traces
- Design dependency graph analysis that traces failures from user-facing symptoms to infrastructure causes
- Build RCA playbooks that combine automated analysis with structured human investigation
- Evaluate causal inference techniques that distinguish root causes from correlated symptoms
Why This Module Matters
Section titled “Why This Module Matters”Correlation tells you alerts are related. Root Cause Analysis (RCA) tells you why. When 47 services alert simultaneously, which one actually failed? Is it the database, the network, or a misconfigured deployment?
Manual RCA is detective work—following logs, traces, and metrics across systems. AIOps automates this investigation, identifying probable causes in seconds instead of hours. For teams drowning in incidents, automated RCA is the difference between firefighting and actual engineering.
Did You Know?
Section titled “Did You Know?”- Average MTTR for complex incidents is 4+ hours, with most time spent on investigation rather than remediation
- Google’s Monarch system performs millions of RCA queries per second using precomputed dependency graphs
- Causal inference in ML (Pearl’s do-calculus) is increasingly applied to systems—answering “what would happen if we did X”
- Change correlation identifies deployment-related issues 60% faster than traditional methods
The RCA Challenge
Section titled “The RCA Challenge”Symptoms vs Root Causes
Section titled “Symptoms vs Root Causes”THE RCA CHALLENGE─────────────────────────────────────────────────────────────────
WHAT YOU SEE (Symptoms) WHAT YOU NEED (Root Cause)─────────────────────────────────────────────────────────────────
[ALERT] Frontend: Slow ┌─────────────────────────────┐[ALERT] API: High latency │ │[ALERT] API: Timeout errors │ Which one caused the │[ALERT] UserService: DB timeout │ others? │[ALERT] OrderService: DB timeout │ │[ALERT] Cache: Miss rate high │ Was it: │[ALERT] Database: Slow queries │ - Database? │[ALERT] Database: Connection pool │ - Network? │ │ - A deployment? │ │ - Traffic spike? │ │ │ └─────────────────────────────┘
Without RCA: Check each alert, correlate manually (hours)With RCA: "Database slow queries caused cascade" (seconds)Root Cause Categories
Section titled “Root Cause Categories”| Category | Examples | Detection Approach |
|---|---|---|
| Infrastructure | Server down, disk full, network partition | Health checks, resource metrics |
| Application | Memory leak, infinite loop, deadlock | APM, profiling, error rates |
| Data | Corrupted data, schema change, volume spike | Data quality checks, query analysis |
| Configuration | Bad deployment, misconfigured service | Change correlation, config diff |
| External | Third-party outage, DNS issue, certificate expiry | External monitoring, dependency checks |
| Capacity | Traffic spike, resource exhaustion | Capacity metrics, forecasting |
RCA Approaches
Section titled “RCA Approaches”1. Dependency Graph Analysis
Section titled “1. Dependency Graph Analysis”Follow the dependency chain to find the deepest failure:
DEPENDENCY-BASED RCA─────────────────────────────────────────────────────────────────
SERVICE DEPENDENCY GRAPH ALERT ┌─────────┐ │ │ Frontend│ ◀───────────────┤ └────┬────┘ │ │ │ ┌────▼────┐ │ │ API │ ◀───────────────┤ └────┬────┘ │ ┌──────────┼──────────┐ │ │ │ │ │ ┌────▼────┐┌────▼────┐┌────▼────┐ │ │ User ││ Order ││ Product │ ◀───┤ │ Service ││ Service ││ Service │ │ └────┬────┘└────┬────┘└────┬────┘ │ │ │ │ │ └──────────┼──────────┘ │ │ │ ┌────▼────┐ │ │Database │ ◀───────────────┘ └─────────┘ ▲ │ ROOT CAUSE (Deepest alerting node)from collections import defaultdict, deque
class DependencyRCA: """ Root cause analysis using dependency graph traversal.
Principle: Root cause is the deepest alerting service in the dependency chain (failures propagate upward). """ def __init__(self, dependency_graph): """ dependency_graph: dict mapping service -> list of dependencies """ self.graph = dependency_graph self._build_reverse_graph()
def _build_reverse_graph(self): """Build reverse graph for upstream traversal.""" self.reverse_graph = defaultdict(list) for service, deps in self.graph.items(): for dep in deps: self.reverse_graph[dep].append(service)
def _depth_in_graph(self, service, visited=None): """Calculate depth (distance from leaf nodes).""" if visited is None: visited = set() if service in visited: return 0 visited.add(service)
deps = self.graph.get(service, []) if not deps: return 0 # Leaf node return 1 + max(self._depth_in_graph(d, visited.copy()) for d in deps)
def _get_blast_radius(self, service): """Find all services affected by this service failing.""" affected = set() queue = deque([service])
while queue: current = queue.popleft() if current in affected: continue affected.add(current)
# Add all services that depend on current for dependent in self.reverse_graph.get(current, []): queue.append(dependent)
return affected
def find_root_cause(self, alerting_services): """ Find probable root cause among alerting services.
Algorithm: 1. Score each alerting service by depth 2. Higher depth = more likely root cause 3. Verify blast radius explains other alerts """ if not alerting_services: return None, 0, set()
candidates = [] for service in alerting_services: depth = self._depth_in_graph(service) blast_radius = self._get_blast_radius(service) explained = alerting_services & blast_radius
candidates.append({ 'service': service, 'depth': depth, 'blast_radius': blast_radius, 'explained': explained, 'explanation_ratio': len(explained) / len(alerting_services) })
# Sort by: explanation_ratio desc, then depth desc candidates.sort( key=lambda c: (c['explanation_ratio'], c['depth']), reverse=True )
best = candidates[0] return best['service'], best['explanation_ratio'], best['blast_radius']
# Usagegraph = { 'frontend': ['api'], 'api': ['user-svc', 'order-svc'], 'user-svc': ['database', 'cache'], 'order-svc': ['database', 'kafka'], 'database': [], 'cache': [], 'kafka': []}
rca = DependencyRCA(graph)alerting = {'frontend', 'api', 'user-svc', 'order-svc', 'database'}root, confidence, blast = rca.find_root_cause(alerting)# root = 'database', confidence = 1.0 (explains all alerts)2. Change Correlation
Section titled “2. Change Correlation”Most incidents follow changes (deployments, config updates, traffic shifts):
CHANGE CORRELATION TIMELINE─────────────────────────────────────────────────────────────────
Timeline │10:00 ├── Deploy: user-service v2.3.1 │10:15 ├── Config change: database connection pool size │10:30 ├── [ALERT] user-service: High error rate │ │ │ └──▶ Correlate with changes in last 30 min │ ┌─────────────────────────────────────┐ │ │ Probable cause: Deploy v2.3.1 │ │ │ Confidence: HIGH (time proximity) │ │ │ Suggestion: Check v2.3.1 changelog │ │ │ or rollback │ │ └─────────────────────────────────────┘from datetime import timedelta
class ChangeCorrelationRCA: """ Correlate incidents with recent changes.
Changes include: - Deployments - Config changes - Infrastructure changes - Traffic pattern shifts """ def __init__(self, lookback_minutes=60): self.lookback = timedelta(minutes=lookback_minutes) self.changes = [] # List of change events
def record_change(self, change): """ Record a change event.
change = { 'timestamp': datetime, 'type': 'deployment' | 'config' | 'infrastructure', 'service': str, 'description': str, 'user': str, 'reversible': bool } """ self.changes.append(change)
def find_related_changes(self, incident_time, affected_services): """ Find changes that might have caused the incident. """ cutoff = incident_time - self.lookback
related = [] for change in self.changes: if change['timestamp'] < cutoff: continue if change['timestamp'] > incident_time: continue
# Check if change affects any incident service service_match = change['service'] in affected_services time_proximity = (incident_time - change['timestamp']).total_seconds()
# Score the change score = 0 reasons = []
if service_match: score += 50 reasons.append(f"Affects {change['service']}")
# Closer to incident = higher score if time_proximity < 300: # 5 min score += 30 reasons.append("Within 5 minutes of incident") elif time_proximity < 900: # 15 min score += 20 reasons.append("Within 15 minutes of incident") elif time_proximity < 1800: # 30 min score += 10 reasons.append("Within 30 minutes of incident")
# Deployment changes are higher risk if change['type'] == 'deployment': score += 20 reasons.append("Deployment change") elif change['type'] == 'config': score += 15 reasons.append("Config change")
if score > 0: related.append({ 'change': change, 'score': score, 'reasons': reasons, 'suggestion': self._get_suggestion(change) })
# Sort by score related.sort(key=lambda r: r['score'], reverse=True) return related
def _get_suggestion(self, change): """Generate remediation suggestion for change.""" if change['type'] == 'deployment' and change.get('reversible'): return f"Consider rollback: {change['description']}" elif change['type'] == 'config': return f"Review config change: {change['description']}" else: return f"Investigate change: {change['description']}"3. Metric-Based Causal Analysis
Section titled “3. Metric-Based Causal Analysis”Find metrics that changed before the incident:
METRIC CAUSALITY ANALYSIS─────────────────────────────────────────────────────────────────
INCIDENT: API latency spike at 10:30
METRIC ANALYSIS:
Metric | 10:00 | 10:15 | 10:30 | Verdict────────────────────────────────────────────────────────────────API latency (symptom) | 50ms | 55ms | 500ms | ← EFFECTDatabase query time | 10ms | 50ms | 400ms | ← CAUSE?Database connections | 50 | 80 | 100 | ← CAUSE?Request rate | 1000 | 1000 | 1000 | ← StableMemory usage | 60% | 61% | 62% | ← Stable
FINDING: Database metrics degraded BEFORE API latency suggesting database is root causeimport numpy as npfrom datetime import timedelta
class MetricCausalityAnalyzer: """ Analyze metrics to find causal relationships.
Key principle: Causes precede effects. If metric A changes before metric B, A may cause B. """ def __init__(self, lookback_minutes=30): self.lookback = timedelta(minutes=lookback_minutes)
def analyze_causality(self, incident_time, metrics_data, symptom_metric): """ Find metrics that changed before the symptom.
metrics_data: dict of metric_name -> list of (timestamp, value) symptom_metric: the metric that alerted (effect) """ # Find when symptom started degrading symptom_change_time = self._find_change_point( metrics_data[symptom_metric], incident_time )
if symptom_change_time is None: return []
# Find metrics that changed BEFORE symptom potential_causes = [] for metric_name, data in metrics_data.items(): if metric_name == symptom_metric: continue
change_time = self._find_change_point(data, incident_time) if change_time is None: continue
# Did this metric change before the symptom? if change_time < symptom_change_time: lead_time = (symptom_change_time - change_time).total_seconds() potential_causes.append({ 'metric': metric_name, 'change_time': change_time, 'lead_time_seconds': lead_time, 'confidence': self._calculate_confidence(lead_time) })
# Sort by confidence potential_causes.sort(key=lambda c: c['confidence'], reverse=True) return potential_causes
def _find_change_point(self, data, reference_time): """ Find when a metric started changing significantly.
Uses simple rolling average comparison. """ if len(data) < 10: return None
# Filter to relevant time window cutoff = reference_time - self.lookback relevant = [(ts, val) for ts, val in data if ts >= cutoff]
if len(relevant) < 5: return None
values = [val for _, val in relevant] timestamps = [ts for ts, _ in relevant]
# Calculate baseline (first 30%) baseline_end = len(values) // 3 baseline_mean = np.mean(values[:baseline_end]) baseline_std = np.std(values[:baseline_end])
if baseline_std == 0: baseline_std = 0.01
# Find first significant deviation for i in range(baseline_end, len(values)): z_score = abs(values[i] - baseline_mean) / baseline_std if z_score > 3: # 3 sigma deviation return timestamps[i]
return None
def _calculate_confidence(self, lead_time_seconds): """ Calculate confidence that this is a cause.
Shorter lead times = higher confidence (more correlated). """ if lead_time_seconds < 60: return 0.9 elif lead_time_seconds < 300: return 0.7 elif lead_time_seconds < 900: return 0.5 else: return 0.34. Log-Based Analysis
Section titled “4. Log-Based Analysis”Extract signals from log patterns:
import refrom collections import Counter
class LogBasedRCA: """ Analyze logs to identify root cause patterns. """ # Common error patterns ERROR_PATTERNS = { 'connection': [ r'connection refused', r'connection timed out', r'no route to host', r'connection reset' ], 'resource': [ r'out of memory', r'disk full', r'too many open files', r'resource exhausted' ], 'database': [ r'deadlock detected', r'lock wait timeout', r'too many connections', r'query timeout' ], 'authentication': [ r'authentication failed', r'invalid credentials', r'token expired', r'unauthorized' ] }
def analyze_logs(self, logs, incident_time, window_minutes=30): """ Analyze logs around incident time for root cause signals.
logs: list of {'timestamp': datetime, 'service': str, 'message': str} """ cutoff_start = incident_time - timedelta(minutes=window_minutes) cutoff_end = incident_time + timedelta(minutes=5)
relevant_logs = [ log for log in logs if cutoff_start <= log['timestamp'] <= cutoff_end ]
# Categorize errors error_categories = Counter() first_occurrences = {} service_errors = Counter()
for log in relevant_logs: message = log['message'].lower() service = log['service']
for category, patterns in self.ERROR_PATTERNS.items(): for pattern in patterns: if re.search(pattern, message, re.IGNORECASE): error_categories[category] += 1 service_errors[service] += 1
# Track first occurrence key = (service, category) if key not in first_occurrences: first_occurrences[key] = log['timestamp'] break
# Find earliest error signals sorted_firsts = sorted( first_occurrences.items(), key=lambda x: x[1] )
return { 'error_categories': dict(error_categories), 'service_errors': dict(service_errors), 'timeline': [ { 'service': key[0], 'category': key[1], 'first_seen': ts } for key, ts in sorted_firsts[:10] ], 'probable_root': sorted_firsts[0] if sorted_firsts else None }Combining RCA Strategies
Section titled “Combining RCA Strategies”┌─────────────────────────────────────────────────────────────────┐│ COMPREHENSIVE RCA PIPELINE ││ ││ INPUT: Correlated Incident ││ ┌──────────────────────────────────────────────────────────┐ ││ │ - Alerting services: [A, B, C, D] │ ││ │ - Time range: 10:30 - 10:45 │ ││ │ - Severity: Critical │ ││ └──────────────────────────────────────────────────────────┘ ││ │ ││ PARALLEL ANALYSIS ▼ ││ ┌──────────┬──────────┬──────────┬──────────┐ ││ │ Topology │ Change │ Metric │ Log │ ││ │ RCA │ Correlat.│ Causality│ Analysis │ ││ │ │ │ │ │ ││ │Root: D │Deploy at │DB query │First err:│ ││ │Conf: 90% │10:25 │degraded │D connref │ ││ │ │Score: 85 │at 10:28 │at 10:29 │ ││ └────┬─────┴────┬─────┴────┬─────┴────┬─────┘ ││ │ │ │ │ ││ └──────────┴──────────┴──────────┘ ││ │ ││ SYNTHESIS ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Root Cause Synthesizer │ ││ │ │ ││ │ Weighted voting: │ ││ │ - Service D: 4 signals │ ││ │ - Deployment: 3 signals │ ││ │ │ ││ │ CONCLUSION: Service D failure after 10:25 deployment │ ││ │ Suggest: Rollback to v2.3.0 │ ││ └──────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘class ComprehensiveRCA: """ Combine multiple RCA strategies for higher accuracy. """ def __init__(self, dependency_graph, change_tracker): self.topo_rca = DependencyRCA(dependency_graph) self.change_rca = ChangeCorrelationRCA() self.metric_rca = MetricCausalityAnalyzer() self.log_rca = LogBasedRCA()
def analyze(self, incident): """ Run all RCA strategies and synthesize results.
incident = { 'services': set of alerting services, 'start_time': datetime, 'metrics': dict of metrics data, 'logs': list of log entries } """ results = {}
# 1. Topology analysis topo_root, topo_conf, _ = self.topo_rca.find_root_cause( incident['services'] ) results['topology'] = { 'root': topo_root, 'confidence': topo_conf }
# 2. Change correlation changes = self.change_rca.find_related_changes( incident['start_time'], incident['services'] ) results['changes'] = changes[:3] # Top 3 changes
# 3. Metric causality if 'symptom_metric' in incident: causes = self.metric_rca.analyze_causality( incident['start_time'], incident['metrics'], incident['symptom_metric'] ) results['metrics'] = causes[:3]
# 4. Log analysis if 'logs' in incident: log_results = self.log_rca.analyze_logs( incident['logs'], incident['start_time'] ) results['logs'] = log_results
# Synthesize return self._synthesize(results)
def _synthesize(self, results): """ Combine results from all strategies.
Uses voting + confidence weighting. """ votes = Counter()
# Topology vote (high weight) if results.get('topology', {}).get('root'): root = results['topology']['root'] conf = results['topology']['confidence'] votes[root] += conf * 3 # Weight 3x
# Change correlation (if service specific) for change in results.get('changes', []): service = change['change']['service'] score = change['score'] / 100 # Normalize to 0-1 votes[service] += score * 2 # Weight 2x
# Log analysis log_results = results.get('logs', {}) if log_results.get('probable_root'): service = log_results['probable_root'][0][0] votes[service] += 1
# Get winner if votes: root_cause = votes.most_common(1)[0][0] total_votes = sum(votes.values()) confidence = votes[root_cause] / total_votes else: root_cause = None confidence = 0
return { 'root_cause': root_cause, 'confidence': confidence, 'evidence': results, 'suggestion': self._get_suggestion(root_cause, results) }
def _get_suggestion(self, root_cause, results): """Generate actionable suggestion.""" suggestions = []
# Check if deployment related for change in results.get('changes', []): if change['change']['type'] == 'deployment': suggestions.append(change['suggestion'])
if not suggestions: suggestions.append(f"Investigate {root_cause} for root cause")
return suggestionsCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Only using topology | Misses non-dependency causes | Add change correlation, metric analysis |
| Ignoring time ordering | Effects labeled as causes | Causes must precede effects |
| Stale dependency graph | Wrong root cause identification | Auto-update from traces |
| No change tracking | Can’t correlate with deployments | Track all changes, including config |
| Single strategy | Lower accuracy | Combine multiple strategies with voting |
| Overfitting to patterns | Missing new failure modes | Include log/metric analysis for unknowns |
1. Why is the "deepest alerting service" heuristic effective for topology-based RCA?
Answer: In service dependency graphs:
- Failures propagate upward (from dependencies to dependents)
- A database failure causes API failures, not vice versa
- The deepest service is furthest from leaf nodes
- If it’s alerting, it likely caused alerts in all services above it
This heuristic works because it follows the direction of causality in distributed systems.
2. How does change correlation complement topology-based RCA?
Answer: Topology-based RCA finds what failed. Change correlation finds why it failed:
- Topology: “Database is root cause”
- Change correlation: “Deployment 10 minutes ago changed connection pool settings”
Together they provide:
- Root cause identification (topology)
- Probable trigger (change)
- Remediation path (rollback deployment)
3. What's the key principle behind metric-based causal analysis?
Answer: Causes precede effects. If metric A changes before metric B:
- A might cause B
- Or they share a common cause
The analysis:
- Find when the symptom metric started degrading
- Find metrics that changed before the symptom
- Rank by lead time (shorter = more correlated)
This temporal ordering helps distinguish causes from symptoms.
4. Why should RCA systems combine multiple strategies rather than using one?
Answer: Different strategies catch different failure modes:
| Strategy | Best At | Misses |
|---|---|---|
| Topology | Cascading failures | Config issues |
| Change correlation | Deployment issues | Capacity failures |
| Metric analysis | Gradual degradation | Instant failures |
| Log analysis | Application errors | Infrastructure issues |
Combining with voting:
- Increases accuracy when strategies agree
- Provides confidence scores
- Handles cases where one strategy has incomplete data
Hands-On Exercise: Build an RCA System
Section titled “Hands-On Exercise: Build an RCA System”mkdir rca-system && cd rca-systempython -m venv venvsource venv/bin/activatepip install numpy pandasStep 1: Create Test Scenario
Section titled “Step 1: Create Test Scenario”from datetime import datetime, timedeltaimport random
def create_database_failure_scenario(): """ Simulate a database failure scenario with all data. """ incident_time = datetime(2024, 1, 15, 10, 30, 0)
# Dependency graph graph = { 'frontend': ['api-gateway'], 'api-gateway': ['user-service', 'order-service', 'product-service'], 'user-service': ['postgres', 'redis'], 'order-service': ['postgres', 'kafka'], 'product-service': ['postgres', 'elasticsearch'], 'postgres': [], 'redis': [], 'kafka': [], 'elasticsearch': [] }
# Alerting services (cascade from postgres) alerting_services = { 'postgres', 'user-service', 'order-service', 'product-service', 'api-gateway', 'frontend' }
# Changes (deployment 5 min before incident) changes = [ { 'timestamp': incident_time - timedelta(minutes=5), 'type': 'deployment', 'service': 'postgres', 'description': 'postgres: Update to version 15.2', 'user': 'deploy-bot', 'reversible': True }, { 'timestamp': incident_time - timedelta(hours=2), 'type': 'config', 'service': 'api-gateway', 'description': 'Increase timeout to 30s', 'user': 'jane@example.com', 'reversible': True } ]
# Logs showing progression logs = [] base_time = incident_time - timedelta(minutes=2)
# First signal: postgres logs.append({ 'timestamp': base_time, 'service': 'postgres', 'message': 'FATAL: connection limit exceeded for non-superuser' })
# Cascade for service in ['user-service', 'order-service', 'product-service']: logs.append({ 'timestamp': base_time + timedelta(seconds=30), 'service': service, 'message': 'Connection to postgres refused: too many connections' })
logs.append({ 'timestamp': base_time + timedelta(seconds=60), 'service': 'api-gateway', 'message': 'Upstream service timeout: user-service' })
logs.append({ 'timestamp': base_time + timedelta(seconds=90), 'service': 'frontend', 'message': 'API request failed: 504 Gateway Timeout' })
return { 'incident_time': incident_time, 'graph': graph, 'alerting_services': alerting_services, 'changes': changes, 'logs': logs, 'expected_root_cause': 'postgres', 'expected_change': 'postgres deployment' }Step 2: Implement RCA
Section titled “Step 2: Implement RCA”Use the classes from this module to implement a complete RCA:
from scenario import create_database_failure_scenariofrom datetime import timedeltafrom collections import Counter, defaultdict, deque
# Include the class implementations from this module here# (DependencyRCA, ChangeCorrelationRCA, LogBasedRCA)
class SimpleRCA: """Simplified comprehensive RCA for exercise."""
def __init__(self, graph): self.graph = graph self.reverse_graph = defaultdict(list) for svc, deps in graph.items(): for dep in deps: self.reverse_graph[dep].append(svc)
def find_root_by_topology(self, alerting): """Find deepest alerting service.""" def depth(service, visited=None): if visited is None: visited = set() if service in visited: return 0 visited.add(service) deps = self.graph.get(service, []) if not deps: return 0 return 1 + max(depth(d, visited.copy()) for d in deps)
if not alerting: return None return max(alerting, key=depth)
def correlate_changes(self, changes, incident_time, services): """Find changes related to incident.""" relevant = [] for change in changes: if change['timestamp'] > incident_time: continue if (incident_time - change['timestamp']) > timedelta(hours=1): continue
score = 0 if change['service'] in services: score += 50 if (incident_time - change['timestamp']) < timedelta(minutes=15): score += 30 if change['type'] == 'deployment': score += 20
if score > 0: relevant.append({'change': change, 'score': score})
return sorted(relevant, key=lambda x: x['score'], reverse=True)
def analyze_logs(self, logs, incident_time): """Find first error signals in logs.""" cutoff = incident_time - timedelta(minutes=10) relevant = [l for l in logs if l['timestamp'] >= cutoff] relevant.sort(key=lambda x: x['timestamp'])
if relevant: first = relevant[0] return { 'first_service': first['service'], 'first_message': first['message'], 'timestamp': first['timestamp'] } return None
def run_rca(self, scenario): """Run complete RCA.""" results = {}
# Topology root = self.find_root_by_topology(scenario['alerting_services']) results['topology_root'] = root
# Changes changes = self.correlate_changes( scenario['changes'], scenario['incident_time'], scenario['alerting_services'] ) results['related_changes'] = changes
# Logs log_analysis = self.analyze_logs( scenario['logs'], scenario['incident_time'] ) results['log_analysis'] = log_analysis
# Synthesize votes = Counter() if root: votes[root] += 3
for c in changes: votes[c['change']['service']] += c['score'] / 100 * 2
if log_analysis: votes[log_analysis['first_service']] += 2
if votes: winner = votes.most_common(1)[0] results['final_root_cause'] = winner[0] results['confidence'] = winner[1] / sum(votes.values()) else: results['final_root_cause'] = None results['confidence'] = 0
return results
def main(): scenario = create_database_failure_scenario()
rca = SimpleRCA(scenario['graph']) results = rca.run_rca(scenario)
print("=== RCA Results ===") print(f"Topology root cause: {results['topology_root']}") print()
print("Related changes:") for c in results['related_changes']: print(f" - {c['change']['description']} (score: {c['score']})") print()
if results['log_analysis']: print(f"First log signal: {results['log_analysis']['first_service']}") print(f" Message: {results['log_analysis']['first_message']}") print()
print(f"FINAL ROOT CAUSE: {results['final_root_cause']}") print(f"Confidence: {results['confidence']:.0%}") print()
# Verify expected = scenario['expected_root_cause'] if results['final_root_cause'] == expected: print(f"SUCCESS: Correctly identified {expected}") else: print(f"MISS: Expected {expected}, got {results['final_root_cause']}")
if __name__ == '__main__': main()Success Criteria
Section titled “Success Criteria”You’ve completed this exercise when:
- Created realistic failure scenario with all data types
- Implemented topology-based RCA
- Implemented change correlation
- Implemented log analysis
- Combined strategies with voting
- Correctly identified root cause in test scenario
Key Takeaways
Section titled “Key Takeaways”- Causes precede effects: Use temporal ordering in all analysis
- Deepest service wins: In topology analysis, root cause is deepest alerting node
- Change correlation is key: Most incidents follow changes—track everything
- Combine strategies: Multiple signals increase confidence
- Blast radius explains scope: Root cause should explain all affected services
- Automate the detective work: What humans do in hours, systems can do in seconds
Further Reading
Section titled “Further Reading”- Pearl’s Causal Inference — Theoretical foundation
- Google’s RCA at Scale — Practical approaches
- Uber’s Root Cause Analysis — Scale challenges
- Observability Engineering (Book) — Debug methodology
Summary
Section titled “Summary”Root Cause Analysis automates the detective work of incident response. By combining dependency graph analysis, change correlation, metric causality, and log analysis, AIOps systems can identify probable causes in seconds instead of hours.
The key insight: different strategies catch different failure modes. Combine them with confidence-weighted voting for higher accuracy. And remember—always verify the blast radius explains all symptoms.
Next Module
Section titled “Next Module”Continue to Module 6.5: Predictive Operations to learn how to forecast problems before they impact users.