Module 6.2: Anomaly Detection
Discipline Track | Complexity:
[COMPLEX]| Time: 40-45 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Module 6.1: AIOps Foundations — Core AIOps concepts
- Basic statistics (mean, standard deviation, percentiles)
- Understanding of time series data
- Python basics (for exercises)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement anomaly detection models that identify unusual patterns in metrics, logs, and traces
- Design baseline learning algorithms that adapt to seasonal and trend-based operational patterns
- Configure alert thresholds using statistical methods that reduce false positives without missing real issues
- Evaluate anomaly detection approaches — statistical, ML-based, deep learning — against your data characteristics
Why This Module Matters
Section titled “Why This Module Matters”Traditional monitoring relies on static thresholds: “Alert if CPU > 80%.” But what’s normal? A 60% CPU spike at 3AM is concerning; the same spike during peak traffic is expected. Static thresholds generate noise during normal variations and miss gradual degradation.
Anomaly detection learns what “normal” looks like and alerts on deviations. It handles seasonality, trends, and the messy reality of production systems. This is the foundation of intelligent alerting—without it, you’re either drowning in false positives or missing real problems.
Did You Know?
Section titled “Did You Know?”- Netflix’s anomaly detection system analyzes over 2 billion data points per minute across their streaming infrastructure
- The first anomaly detection algorithms date to the 1960s (Grubbs’ test), but modern ML approaches enable real-time detection at scale
- Facebook’s Prophet was open-sourced in 2017 and became the go-to tool for time series forecasting with seasonality
- Concept drift—when “normal” changes over time—is one of the hardest problems in anomaly detection and causes most production failures
Understanding Anomalies
Section titled “Understanding Anomalies”Types of Anomalies
Section titled “Types of Anomalies”POINT ANOMALY─────────────────────────────────────────────────────────────────
Value │ X ← Single outlier │ /│ │ ────────────────────────────── │ │ │ │ └────────────────────┴──────────────────────────────── Time
CONTEXTUAL ANOMALY─────────────────────────────────────────────────────────────────
Value │ Normal X ← Same value, wrong context │ peak ╱╲ / │ ╱ ╲╱ ╲ ╱╲ ╱╲ ╱ │╱ ╲ ╱ ╲ ╱ ╲ ╱ │ ╲╱ ╲╱ ╲ └──────────────────────────────────────────────────────── Mon Tue Wed Thu Fri Sat Sun
COLLECTIVE ANOMALY─────────────────────────────────────────────────────────────────
Value Unusual pattern │ ╭─────────╮ │ ────────────╮│╭────────╯╭────────────────── │ ╰╯│ │ │ │ ← Individual points normal, │ │ sequence is abnormal └────────────────┴─────────────────────────────────────── TimeWhen Static Thresholds Fail
Section titled “When Static Thresholds Fail”SCENARIO: API LATENCY─────────────────────────────────────────────────────────────────
Static threshold: Alert if latency > 500ms
PROBLEM 1: False Positives────────────────────────────────────────────────────────ms600 ─┼─ - - - - - - - - THRESHOLD - - - - - - - - - - - │ X X X X500 ─┼─ X X X X X X X │ X X X X X400 ─┼─ X X │ X └──────────────────────────────────────────────── Normal for this service during peak hours!
PROBLEM 2: Missed Degradation────────────────────────────────────────────────────────ms600 ─┼─ - - - - - - - - THRESHOLD - - - - - - - - - - - │500 ─┼─ │ ╭───── Slow creep400 ─┼─ ╭────╯ never alerts │ ╭─────╯ until too late300 ─┼─ ╭─────╯ │ ─────╯200 ─┼─ Normal baseline └──────────────────────────────────────────────── Week 1 Week 2 Week 3 Week 4Statistical Approaches
Section titled “Statistical Approaches”Z-Score (Standard Deviation)
Section titled “Z-Score (Standard Deviation)”The simplest approach: how many standard deviations from the mean?
import numpy as np
def zscore_anomaly(values, threshold=3): """ Detect anomalies using Z-score.
Z = (x - mean) / std
Rule of thumb: - |Z| > 2: Unusual (5% of normal data) - |Z| > 3: Anomaly (0.3% of normal data) """ mean = np.mean(values) std = np.std(values)
anomalies = [] for i, x in enumerate(values): z = (x - mean) / std if abs(z) > threshold: anomalies.append({ 'index': i, 'value': x, 'z_score': z }) return anomalies
# Examplelatencies = [100, 105, 98, 102, 95, 103, 500, 101, 99, 104]anomalies = zscore_anomaly(latencies, threshold=3)# Detects: 500ms as anomaly (Z ≈ 3.5)Limitations: Assumes normal distribution, sensitive to outliers affecting mean/std.
Moving Average & Standard Deviation
Section titled “Moving Average & Standard Deviation”Adapts to recent trends:
import numpy as npfrom collections import deque
class MovingAnomalyDetector: """ Detect anomalies using moving statistics.
Adapts to changing baselines while detecting sudden deviations. """ def __init__(self, window_size=100, threshold=3): self.window = deque(maxlen=window_size) self.threshold = threshold
def is_anomaly(self, value): if len(self.window) < 10: # Need minimum data self.window.append(value) return False, 0
mean = np.mean(self.window) std = np.std(self.window)
# Prevent division by zero if std == 0: std = 0.001
z = (value - mean) / std is_anomaly = abs(z) > self.threshold
# Only add non-anomalies to window if not is_anomaly: self.window.append(value)
return is_anomaly, z
# Usagedetector = MovingAnomalyDetector(window_size=100, threshold=3)for latency in stream_of_latencies: is_anomaly, score = detector.is_anomaly(latency) if is_anomaly: alert(f"Anomaly detected: {latency}ms (score: {score})")Interquartile Range (IQR)
Section titled “Interquartile Range (IQR)”More robust to outliers than Z-score:
import numpy as np
def iqr_anomaly(values, k=1.5): """ Detect anomalies using IQR method.
Bounds: [Q1 - k*IQR, Q3 + k*IQR]
k=1.5: Standard outlier detection k=3.0: Extreme outlier detection """ q1 = np.percentile(values, 25) q3 = np.percentile(values, 75) iqr = q3 - q1
lower_bound = q1 - k * iqr upper_bound = q3 + k * iqr
anomalies = [] for i, x in enumerate(values): if x < lower_bound or x > upper_bound: anomalies.append({ 'index': i, 'value': x, 'bounds': (lower_bound, upper_bound) }) return anomaliesHandling Seasonality
Section titled “Handling Seasonality”Real systems have patterns: daily cycles, weekly cycles, monthly variations.
DAILY SEASONALITY─────────────────────────────────────────────────────────────────
Requests/sec │ Peak hours │ ╱────╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │╱ Night ╲ Evening │ ╲────╱ └────────────────────────────────────────────────── 00:00 06:00 12:00 18:00 00:00
Without seasonality awareness:- 3AM traffic = normal- Peak hour traffic = normal- 3AM traffic at peak levels = ANOMALY!Seasonal Decomposition
Section titled “Seasonal Decomposition”from statsmodels.tsa.seasonal import seasonal_decomposeimport pandas as pdimport numpy as np
def detect_with_seasonality(timeseries, period=24, threshold=3): """ Detect anomalies accounting for seasonality.
1. Decompose into trend + seasonal + residual 2. Anomaly detection on residuals only """ # Decompose decomposition = seasonal_decompose( timeseries, model='additive', period=period )
# Detect anomalies in residuals residuals = decomposition.resid.dropna() mean = residuals.mean() std = residuals.std()
z_scores = (residuals - mean) / std anomalies = abs(z_scores) > threshold
return anomalies
# Example usage# Daily data with hourly seasonalitydf = pd.DataFrame({ 'timestamp': pd.date_range('2024-01-01', periods=168, freq='H'), 'requests': daily_pattern_with_anomalies})df.set_index('timestamp', inplace=True)
anomalies = detect_with_seasonality(df['requests'], period=24)Prophet for Forecasting
Section titled “Prophet for Forecasting”Facebook’s Prophet handles multiple seasonalities automatically:
from prophet import Prophetimport pandas as pd
def prophet_anomaly_detection(df, sensitivity=0.95): """ Use Prophet to detect anomalies.
Prophet models: - Trend - Weekly seasonality - Daily seasonality - Holiday effects
Anomalies = points outside confidence interval """ # Prophet requires 'ds' (datestamp) and 'y' (value) columns model = Prophet( interval_width=sensitivity, # Confidence interval daily_seasonality=True, weekly_seasonality=True )
model.fit(df)
# Predict on same data to get expected ranges forecast = model.predict(df)
# Anomaly = actual outside confidence bounds df['yhat'] = forecast['yhat'] df['yhat_lower'] = forecast['yhat_lower'] df['yhat_upper'] = forecast['yhat_upper']
df['anomaly'] = (df['y'] < df['yhat_lower']) | (df['y'] > df['yhat_upper'])
return df
# Usagedf = pd.DataFrame({ 'ds': timestamps, 'y': metric_values})
results = prophet_anomaly_detection(df, sensitivity=0.99)anomalies = results[results['anomaly']]Machine Learning Approaches
Section titled “Machine Learning Approaches”Isolation Forest
Section titled “Isolation Forest”Efficient for high-dimensional data:
from sklearn.ensemble import IsolationForestimport numpy as np
def isolation_forest_detection(data, contamination=0.01): """ Isolation Forest anomaly detection.
Key insight: Anomalies are easier to isolate. - Normal points need many splits to isolate - Anomalies need few splits
contamination: Expected fraction of anomalies """ model = IsolationForest( contamination=contamination, random_state=42, n_estimators=100 )
# Fit and predict (-1 = anomaly, 1 = normal) predictions = model.fit_predict(data)
# Get anomaly scores (lower = more anomalous) scores = model.score_samples(data)
return predictions, scores
# Multi-dimensional example# Detect anomalies considering latency AND error_rate togetherdata = np.column_stack([latencies, error_rates])predictions, scores = isolation_forest_detection(data)
anomalies = data[predictions == -1]LSTM Autoencoders
Section titled “LSTM Autoencoders”For sequence patterns (time series):
import numpy as npfrom tensorflow.keras.models import Sequential, Modelfrom tensorflow.keras.layers import LSTM, Dense, RepeatVector, TimeDistributed
def create_lstm_autoencoder(sequence_length, n_features): """ LSTM Autoencoder for time series anomaly detection.
Architecture: 1. Encoder: Compress sequence to latent representation 2. Decoder: Reconstruct sequence from latent 3. Anomaly = High reconstruction error """ model = Sequential([ # Encoder LSTM(64, activation='relu', input_shape=(sequence_length, n_features), return_sequences=True), LSTM(32, activation='relu', return_sequences=False),
# Latent space RepeatVector(sequence_length),
# Decoder LSTM(32, activation='relu', return_sequences=True), LSTM(64, activation='relu', return_sequences=True), TimeDistributed(Dense(n_features)) ])
model.compile(optimizer='adam', loss='mse') return model
def detect_anomalies(model, sequences, threshold_percentile=99): """ Detect anomalies based on reconstruction error. """ # Get reconstructions reconstructions = model.predict(sequences)
# Calculate reconstruction error per sequence mse = np.mean(np.power(sequences - reconstructions, 2), axis=(1, 2))
# Threshold based on training error distribution threshold = np.percentile(mse, threshold_percentile)
anomalies = mse > threshold return anomalies, mse
# Usage# Prepare sequences (sliding window)sequence_length = 24 # e.g., 24 hourssequences = create_sequences(timeseries_data, sequence_length)
# Train on normal datamodel = create_lstm_autoencoder(sequence_length, n_features=1)model.fit(sequences_train, sequences_train, epochs=50, batch_size=32)
# Detectanomalies, scores = detect_anomalies(model, sequences_test)Handling Concept Drift
Section titled “Handling Concept Drift”“Normal” changes over time. Traffic grows, code changes, user behavior evolves.
CONCEPT DRIFT─────────────────────────────────────────────────────────────────
Latency │ New normal │ ───────────────────────────────────────────────── │ ╱│ │ ╱ │ │ Original normal ╱ │← Gradual drift │ ─────────────────── │ │ │ │ │ └──────────────────────────┴──────────────────────── Jan Feb Mar Apr May Jun
WITHOUT drift handling: May traffic flagged as anomalousWITH drift handling: Model adapts, only sudden changes alertAdaptive Detection
Section titled “Adaptive Detection”class AdaptiveAnomalyDetector: """ Anomaly detector that adapts to concept drift.
Strategies: 1. Sliding window - only recent data matters 2. Exponential decay - older data weighted less 3. Explicit retraining - periodic model updates """ def __init__(self, window_days=7, retrain_interval_hours=24, threshold=3): self.window_days = window_days self.retrain_interval = retrain_interval_hours * 3600 self.threshold = threshold self.last_retrain = None self.model = None self.data_buffer = []
def maybe_retrain(self, current_time): """Retrain if enough time has passed.""" if self.last_retrain is None: return True return (current_time - self.last_retrain) > self.retrain_interval
def train(self, data, timestamp): """Train on recent data only.""" # Filter to window cutoff = timestamp - (self.window_days * 86400) recent_data = [d for d in data if d['timestamp'] > cutoff]
values = [d['value'] for d in recent_data] self.mean = np.mean(values) self.std = np.std(values) self.last_retrain = timestamp
def is_anomaly(self, value, timestamp): """Check if value is anomalous, retraining if needed.""" if self.maybe_retrain(timestamp): self.train(self.data_buffer, timestamp)
if self.std == 0: return False, 0
z = (value - self.mean) / self.std is_anomaly = abs(z) > self.threshold
# Store for future training self.data_buffer.append({ 'value': value, 'timestamp': timestamp, 'anomaly': is_anomaly })
# Trim old data cutoff = timestamp - (self.window_days * 2 * 86400) self.data_buffer = [ d for d in self.data_buffer if d['timestamp'] > cutoff ]
return is_anomaly, zReal-Time Detection Architecture
Section titled “Real-Time Detection Architecture”┌─────────────────────────────────────────────────────────────────┐│ REAL-TIME ANOMALY DETECTION ││ ││ DATA INGESTION ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Kafka / Kinesis / Pulsar │ ││ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ││ │ │ Metrics │ │ Logs │ │ Traces │ │ Events │ │ ││ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ ││ └───────┼───────────┼───────────┼───────────┼─────────────┘ ││ └───────────┴───────────┴───────────┘ ││ │ ││ STREAM PROCESSING ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Flink / Kafka Streams / Apache Beam │ ││ │ │ ││ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ ││ │ │ Windowing │ │ Feature │ │ Detection │ │ ││ │ │ (tumbling, │──│ Extraction │──│ Models │ │ ││ │ │ sliding) │ │ │ │ │ │ ││ │ └────────────┘ └────────────┘ └────────────┘ │ ││ └─────────────────────────┬────────────────────────────────┘ ││ │ ││ MODEL SERVING ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Model Registry │ ││ │ ┌─────────────────┐ ┌─────────────────┐ │ ││ │ │ Statistical │ │ ML Models │ │ ││ │ │ (Z-score, IQR) │ │ (Isolation │ │ ││ │ │ │ │ Forest, LSTM) │ │ ││ │ └─────────────────┘ └─────────────────┘ │ ││ └─────────────────────────┬────────────────────────────────┘ ││ │ ││ OUTPUT ▼ ││ ┌────────────┐ ┌────────────┐ ┌────────────┐ ││ │ Alerts │ │ Dashboard │ │ Storage │ ││ │(PagerDuty) │ │ (Grafana) │ │ (Archive) │ ││ └────────────┘ └────────────┘ └────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Ignoring seasonality | False positives during normal patterns | Use Prophet or seasonal decomposition |
| Training on anomalies | Model learns anomalies as normal | Filter anomalies from training data |
| Single threshold for all | Different metrics have different profiles | Per-metric or per-service thresholds |
| No concept drift handling | Model becomes stale, accuracy drops | Periodic retraining, sliding windows |
| Alerting on all anomalies | Not all anomalies matter | Correlation, severity scoring |
| Univariate only | Misses multi-variate patterns | Isolation Forest, multi-variate LSTM |
1. Why do static thresholds fail for modern systems?
Answer: Static thresholds fail because:
- No seasonality awareness: 3AM traffic differs from noon traffic
- Miss gradual degradation: Slow creep never crosses threshold until it’s too late
- Can’t adapt: Traffic grows, systems change, thresholds become stale
- One size doesn’t fit: Different services have different profiles
ML-based detection learns “normal” dynamically and alerts on deviations.
2. When would you use Isolation Forest vs. LSTM Autoencoder?
Answer:
Isolation Forest:
- High-dimensional data (many metrics)
- Point anomalies
- Faster training and inference
- No temporal dependencies
LSTM Autoencoder:
- Sequential patterns matter (time series)
- Detecting unusual sequences, not just points
- Complex temporal dependencies
- More data and compute needed
Rule of thumb: Start with Isolation Forest; upgrade to LSTM if sequence patterns matter.
3. What is concept drift and why does it matter?
Answer: Concept drift is when “normal” changes over time:
- Traffic patterns evolve
- Code deployments change behavior
- User behavior shifts seasonally
Why it matters: Models trained on old data become stale. What was anomalous becomes normal; what was normal becomes anomalous.
Solutions: Sliding windows, periodic retraining, adaptive algorithms that update continuously.
4. How do you handle seasonality in anomaly detection?
Answer: Three approaches:
- Seasonal decomposition: Separate trend + seasonal + residual; detect anomalies in residuals only
- Prophet: Automatically handles multiple seasonalities (daily, weekly, yearly)
- Comparison windows: Compare to same hour last week, same day last month
Without seasonality handling, you’ll alert on every Monday morning traffic spike.
Hands-On Exercise: Build an Anomaly Detector
Section titled “Hands-On Exercise: Build an Anomaly Detector”Build a production-ready anomaly detector with seasonality awareness:
mkdir anomaly-detector && cd anomaly-detectorpython -m venv venvsource venv/bin/activatepip install numpy pandas scikit-learn matplotlibStep 1: Generate Synthetic Data with Seasonality
Section titled “Step 1: Generate Synthetic Data with Seasonality”import numpy as npimport pandas as pdfrom datetime import datetime, timedelta
def generate_realistic_metrics(days=30, points_per_day=24): """ Generate realistic server metrics with: - Daily seasonality (business hours) - Weekly seasonality (weekdays vs weekends) - Trend (gradual growth) - Noise - Injected anomalies """ np.random.seed(42)
total_points = days * points_per_day timestamps = [ datetime(2024, 1, 1) + timedelta(hours=i) for i in range(total_points) ]
values = [] anomaly_labels = []
for i, ts in enumerate(timestamps): # Base value base = 100
# Daily seasonality (peak at 14:00, low at 03:00) hour = ts.hour daily_factor = 50 * np.sin((hour - 6) * np.pi / 12)
# Weekly seasonality (lower on weekends) weekday = ts.weekday() weekly_factor = -30 if weekday >= 5 else 0
# Trend (1% growth per week) trend = i * 0.01
# Noise noise = np.random.normal(0, 5)
value = base + daily_factor + weekly_factor + trend + noise is_anomaly = False
# Inject anomalies (2% of points) if np.random.random() < 0.02: # Spike anomaly value += np.random.choice([-1, 1]) * np.random.uniform(50, 100) is_anomaly = True
values.append(max(0, value)) # No negative values anomaly_labels.append(is_anomaly)
return pd.DataFrame({ 'timestamp': timestamps, 'value': values, 'is_anomaly': anomaly_labels })
if __name__ == "__main__": df = generate_realistic_metrics() df.to_csv('metrics.csv', index=False) print(f"Generated {len(df)} data points with {df['is_anomaly'].sum()} anomalies")Step 2: Build the Detector
Section titled “Step 2: Build the Detector”import numpy as npimport pandas as pdfrom collections import deque
class SeasonalAnomalyDetector: """ Anomaly detector with: - Hourly seasonality awareness - Adaptive baseline (sliding window) - Multiple detection methods """
def __init__(self, window_hours=168, # 1 week zscore_threshold=3, iqr_multiplier=1.5): self.window_hours = window_hours self.zscore_threshold = zscore_threshold self.iqr_multiplier = iqr_multiplier
# Store data by hour for seasonality self.hourly_data = {h: deque(maxlen=window_hours // 24) for h in range(24)}
def _get_hourly_stats(self, hour): """Get statistics for specific hour.""" data = list(self.hourly_data[hour]) if len(data) < 3: return None, None return np.mean(data), np.std(data)
def _zscore(self, value, mean, std): """Calculate Z-score.""" if std == 0: return 0 return (value - mean) / std
def _iqr_bounds(self, hour): """Calculate IQR bounds for hour.""" data = list(self.hourly_data[hour]) if len(data) < 4: return None, None q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 return q1 - self.iqr_multiplier * iqr, q3 + self.iqr_multiplier * iqr
def detect(self, timestamp, value): """ Detect if value is anomalous.
Returns: (is_anomaly, score, reason) """ hour = timestamp.hour
# Get seasonal statistics mean, std = self._get_hourly_stats(hour) lower, upper = self._iqr_bounds(hour)
is_anomaly = False score = 0 reasons = []
# Z-score check (if enough data) if mean is not None: z = self._zscore(value, mean, std) score = abs(z) if abs(z) > self.zscore_threshold: is_anomaly = True reasons.append(f"Z-score: {z:.2f}")
# IQR check (if enough data) if lower is not None: if value < lower or value > upper: is_anomaly = True reasons.append(f"Outside IQR: [{lower:.1f}, {upper:.1f}]")
# Update model (only with non-anomalies to avoid pollution) if not is_anomaly: self.hourly_data[hour].append(value)
return is_anomaly, score, "; ".join(reasons) if reasons else "Normal"
def warmup(self, df): """Warm up detector with historical data.""" for _, row in df.iterrows(): hour = row['timestamp'].hour self.hourly_data[hour].append(row['value'])
# Run detectionif __name__ == "__main__": df = pd.read_csv('metrics.csv', parse_dates=['timestamp'])
# Split: first 2 weeks for warmup, rest for testing warmup_df = df[df['timestamp'] < df['timestamp'].min() + pd.Timedelta(days=14)] test_df = df[df['timestamp'] >= df['timestamp'].min() + pd.Timedelta(days=14)]
detector = SeasonalAnomalyDetector() detector.warmup(warmup_df)
# Test results = [] for _, row in test_df.iterrows(): is_anomaly, score, reason = detector.detect(row['timestamp'], row['value']) results.append({ 'timestamp': row['timestamp'], 'value': row['value'], 'actual_anomaly': row['is_anomaly'], 'detected_anomaly': is_anomaly, 'score': score, 'reason': reason })
results_df = pd.DataFrame(results)
# Calculate metrics true_positives = ((results_df['actual_anomaly'] == True) & (results_df['detected_anomaly'] == True)).sum() false_positives = ((results_df['actual_anomaly'] == False) & (results_df['detected_anomaly'] == True)).sum() false_negatives = ((results_df['actual_anomaly'] == True) & (results_df['detected_anomaly'] == False)).sum()
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0 recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
print(f"Results:") print(f" True Positives: {true_positives}") print(f" False Positives: {false_positives}") print(f" False Negatives: {false_negatives}") print(f" Precision: {precision:.2%}") print(f" Recall: {recall:.2%}")
results_df.to_csv('detection_results.csv', index=False)Success Criteria
Section titled “Success Criteria”You’ve completed this exercise when:
- Generated synthetic metrics with seasonality
- Built detector with hourly baseline awareness
- Achieved > 50% precision (reduce false positives)
- Achieved > 60% recall (catch real anomalies)
- Understood trade-offs between precision and recall
Key Takeaways
Section titled “Key Takeaways”- Static thresholds don’t scale: Learn dynamic baselines instead
- Seasonality matters: Compare to same hour/day, not global average
- Multiple methods: Combine Z-score, IQR for robustness
- Adapt to drift: Sliding windows, periodic retraining
- Start simple: Statistical methods before ML; ML only when needed
- Precision vs. recall: Optimize for your use case (alert fatigue vs. missed incidents)
Further Reading
Section titled “Further Reading”- Facebook Prophet Paper — Time series forecasting
- Isolation Forest Paper — Original algorithm
- Netflix Anomaly Detection — Scale challenges
- Luminaire — Zillow’s anomaly detection library
Summary
Section titled “Summary”Anomaly detection is the foundation of intelligent alerting. By learning what “normal” looks like—accounting for seasonality, trends, and drift—you can detect problems without predefined thresholds. Start with statistical methods (Z-score, IQR), add seasonality awareness, and only graduate to ML when patterns require it.
The goal isn’t zero false positives—it’s actionable alerts that surface real problems early.
Next Module
Section titled “Next Module”Continue to Module 6.3: Event Correlation to learn how to group thousands of alerts into actionable incidents.