XGBoost & Gradient Boosting
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand time series fundamentals (stationarity, seasonality, trends)
- Master classical forecasting methods (ARIMA, Prophet)
- Build deep learning time series models (LSTM, Transformers)
- Implement temporal feature engineering
- Build anomaly detection systems for time series data
Why Time Series Matters
Section titled “Why Time Series Matters”Imagine you’re a weather forecaster in ancient Egypt, watching the Nile’s water levels. Each year, the river floods, bringing life-giving water to crops. But when will it flood? How high will it rise? Your entire civilization depends on predicting a pattern that repeats over time.
This is time series analysis - finding patterns in sequential data to predict the future. And it’s everywhere:
Where Time Series Lives:├── Finance│ ├── Stock prices (hourly, daily)│ ├── Trading volumes│ └── Economic indicators├── Operations│ ├── Server load prediction│ ├── Inventory forecasting│ └── Energy demand├── IoT & Sensors│ ├── Temperature monitoring│ ├── Equipment vibration│ └── Network traffic└── Business ├── Sales forecasting ├── Customer demand └── Resource planningDid You Know? Amazon’s demand forecasting system processes over 300 million time series daily. Each product in each warehouse is a separate time series. Getting forecasts right by just 1% accuracy improvement saved them over $100 million annually in inventory costs!
The Anatomy of Time Series
Section titled “The Anatomy of Time Series”What Makes Time Series Special?
Section titled “What Makes Time Series Special?”Unlike regular tabular data where rows are independent, time series has a crucial property: temporal dependency. Today’s value depends on yesterday’s value, which depends on the day before, and so on.
Regular Data vs Time Series:
REGULAR TABULAR DATA:┌─────────────────────────────────────┐│ Row 1: [features] → [label] │ Each row is independent│ Row 2: [features] → [label] │ Order doesn't matter│ Row 3: [features] → [label] │ Shuffle = same model└─────────────────────────────────────┘
TIME SERIES DATA:┌─────────────────────────────────────┐│ t=1: value₁ ────────────┐ ││ t=2: value₂ ←───────────┤ │ Each depends on past│ t=3: value₃ ←───────────┤ │ Order is EVERYTHING│ t=4: value₄ ←───────────┘ │ Shuffle = destroy data└─────────────────────────────────────┘The Three Components of Time Series
Section titled “The Three Components of Time Series”Every time series can be decomposed into three fundamental components:
TIME SERIES = TREND + SEASONALITY + RESIDUAL
Original Signal │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ TREND SEASONALITY RESIDUAL(long-term) (repeating) (random noise) │ │ │ │ │ │ ▼ ▼ ▼ ╱ ╱╲╱╲ ∼∼∼∼ ╱ ╱ ╲ ╱ ∼ ∼╱ ╱ ╲ ∼ ∼
Example: Retail Sales─────────────────────TREND: Sales growing 5% per year (business expansion)SEASONALITY: Spikes every December (holiday shopping)RESIDUAL: Random day-to-day variationThe Grocery Store Analogy: Imagine tracking daily milk sales:
- Trend: Sales slowly increasing as neighborhood population grows
- Seasonality: Higher on weekends (families cook more), lower mid-week
- Weekly pattern: People buy on payday (biweekly)
- Residual: Random - maybe a recipe went viral on TikTok today
Stationarity: The Foundation of Forecasting
Section titled “Stationarity: The Foundation of Forecasting”What is Stationarity?
Section titled “What is Stationarity?”A time series is stationary if its statistical properties don’t change over time. Think of it like a calm lake versus a river flowing to the ocean.
STATIONARY (Like a Calm Lake):─────────────────────────────Statistical properties stay constant over time.Mean: μ ≈ constantVariance: σ² ≈ constantAutocorrelation: same pattern
────────────────────────── ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ ╱╲ ╱ ╲╱ ╲╱ ╲╱ ╲╱ ╲╱ ╲╱ ╲ ──────────────────────────────
NON-STATIONARY (Like a River to Ocean):───────────────────────────────────────Properties change over time.Mean: drifting up or downVariance: expanding or contracting
╱╲ ╱╲╱ ╲ ╱╲╱╲╱ ╲╱╲ ╱╲╱╲╱ ╲ ╱╲╱╲╱╲╱ ╲ ╱╲╱╲╱ ╱Why Stationarity Matters
Section titled “Why Stationarity Matters”Most classical forecasting methods require stationarity. Here’s why:
# Non-stationary: Yesterday's patterns don't apply to tomorrow# Because the underlying process is changing!
# Example: Stock price in 2020 vs 2024# - Different economic conditions# - Different company size# - Different market sentiment# Can't just extrapolate!
# Solution: Make it stationary through DIFFERENCING# Instead of predicting price, predict CHANGE in price## price_t → change_t = price_t - price_{t-1}Testing for Stationarity: The ADF Test
Section titled “Testing for Stationarity: The ADF Test”The Augmented Dickey-Fuller (ADF) test tells us if a series is stationary:
ADF TEST INTERPRETATION:────────────────────────
H₀ (Null): Series has a unit root (NON-stationary)H₁ (Alt): Series is stationary
p-value < 0.05 → Reject H₀ → Series IS stationaryp-value ≥ 0.05 → Fail to reject → Series is NOT stationary
Example Results:────────────────Raw stock prices: p = 0.87 → Non-stationary (expected!)First difference: p = 0.001 → Stationary (good for ARIMA)Log returns: p = 0.001 → Stationary (finance standard)Did You Know? The unit root concept comes from the characteristic equation of an AR process. If the root equals 1 (a “unit root”), shocks to the system persist forever rather than dying out. This is why stock prices are non-stationary - a $10 increase today doesn’t mean it’ll drop $10 tomorrow. The change is permanent!
Classical Methods: ARIMA
Section titled “Classical Methods: ARIMA”The ARIMA Family
Section titled “The ARIMA Family”ARIMA stands for AutoRegressive Integrated Moving Average. It’s the Swiss Army knife of time series forecasting.
ARIMA COMPONENTS:─────────────────
AR (AutoRegressive) - p: "Today depends on recent past values" y_t = c + φ₁·y_{t-1} + φ₂·y_{t-2} + ... + ε_t
Like: "Temperature today ≈ 0.9 × temperature yesterday"
I (Integrated) - d: "How many times we difference to make stationary" d=0: Already stationary d=1: First difference (y_t - y_{t-1}) d=2: Second difference (rarely needed)
Like: "Don't predict price, predict price CHANGE"
MA (Moving Average) - q: "Today depends on recent forecast errors" y_t = c + ε_t + θ₁·ε_{t-1} + θ₂·ε_{t-2} + ...
Like: "I was wrong by X yesterday, adjust for that"
ARIMA(p, d, q) notation:────────────────────────ARIMA(1, 0, 0) = AR(1) - Simple autoregressiveARIMA(0, 1, 1) = IMA(1,1) - Random walk with MAARIMA(1, 1, 1) = Common balanced modelARIMA(5, 1, 0) = AR with 5 lags, differenced onceSeasonal ARIMA: SARIMA
Section titled “Seasonal ARIMA: SARIMA”For data with seasonality (most real data!), we use SARIMA:
SARIMA(p, d, q)(P, D, Q, s) │ │ │ └── Seasonal components └── Non-seasonal components
s = seasonal period (12 for monthly, 7 for daily with weekly pattern)
Example: Monthly sales with yearly seasonalitySARIMA(1, 1, 1)(1, 1, 1, 12) │ │ │ │ │ │ │ │ │ │ │ │ │ └── 12-month seasonal period │ │ │ │ │ └── Seasonal MA(1) │ │ │ │ └── Seasonal difference (year-over-year) │ │ │ └── Seasonal AR(1) │ │ └── Non-seasonal MA(1) │ └── Non-seasonal difference └── Non-seasonal AR(1)How to Choose ARIMA Parameters
Section titled “How to Choose ARIMA Parameters”The traditional approach uses ACF and PACF plots:
ACF (Autocorrelation Function):───────────────────────────────Correlation between y_t and y_{t-k} for all lags k
Interpretation:- Slow decay → Non-stationary (need differencing)- Cuts off after lag q → MA(q) model- Decays exponentially → AR model
PACF (Partial Autocorrelation Function):────────────────────────────────────────Correlation between y_t and y_{t-k} AFTER removingeffect of intermediate lags
Interpretation:- Cuts off after lag p → AR(p) model- Decays exponentially → MA model
CHOOSING p AND q:─────────────────┌─────────────┬──────────────┬──────────────┐│ Pattern │ ACF │ PACF │├─────────────┼──────────────┼──────────────┤│ AR(p) │ Exponential │ Cuts off ││ │ decay │ after lag p │├─────────────┼──────────────┼──────────────┤│ MA(q) │ Cuts off │ Exponential ││ │ after lag q │ decay │├─────────────┼──────────────┼──────────────┤│ ARMA(p,q) │ Exponential │ Exponential ││ │ decay │ decay │└─────────────┴──────────────┴──────────────┘Did You Know? Box and Jenkins developed the ARIMA methodology in 1970, and it remained the gold standard for forecasting for over 40 years. Their book “Time Series Analysis: Forecasting and Control” has been cited over 60,000 times. George Box famously said, “All models are wrong, but some are useful.”
Facebook Prophet: Democratizing Forecasting
Section titled “Facebook Prophet: Democratizing Forecasting”Why Prophet Changed Everything
Section titled “Why Prophet Changed Everything”In 2017, Facebook released Prophet, making forecasting accessible to analysts without deep statistical expertise:
TRADITIONAL ARIMA WORKFLOW:───────────────────────────1. Check stationarity (ADF test)2. Apply differencing if needed3. Examine ACF/PACF plots4. Choose p, d, q parameters5. Fit model, check residuals6. If residuals bad, go back to step 37. Handle seasonality separately8. Add external regressors manually9. Deal with missing data10. Hope it works...
PROPHET WORKFLOW:─────────────────1. prophet.fit(df)2. prophet.predict(future)3. Done!How Prophet Works
Section titled “How Prophet Works”Prophet uses a decomposable model with three components:
y(t) = g(t) + s(t) + h(t) + ε_t
Where:g(t) = Trend (growth)s(t) = Seasonality (Fourier series)h(t) = Holidays/eventsε_t = Error term
TREND MODEL:────────────Linear: g(t) = k·t + mLogistic: g(t) = C / (1 + exp(-k(t - m)))
Prophet automatically detects "changepoints" wherethe growth rate k changes!
Before Facebook's Algorithm Change ╱╱╱╱ ╱ ╱ ╱ ──────╱ │ └── Changepoint detected!
SEASONALITY (Fourier Series):─────────────────────────────s(t) = Σ [aₙ·cos(2πnt/P) + bₙ·sin(2πnt/P)]
For yearly seasonality (P=365.25): - 10 Fourier terms by default - Captures complex patterns
For weekly seasonality (P=7): - 3 Fourier terms by default - Captures day-of-week effectsProphet’s Secret Weapons
Section titled “Prophet’s Secret Weapons”PROPHET ADVANTAGES:───────────────────
1. HANDLES MISSING DATA ─────────────────── No need to interpolate! Prophet just ignores gaps.
2. ROBUST TO OUTLIERS ─────────────────── Uses robust regression internally.
3. CHANGEPOINT DETECTION ────────────────────── Automatically finds where trends change.
4. HOLIDAY EFFECTS ──────────────── Built-in support for irregular events. holidays = pd.DataFrame({ 'holiday': ['superbowl', 'thanksgiving'], 'ds': ['2024-02-11', '2024-11-28'] })
5. INTERPRETABLE COMPONENTS ──────────────────────── See exactly what each component contributes.
6. UNCERTAINTY INTERVALS ────────────────────── Automatic prediction intervals!Did You Know? Prophet was developed by Sean Taylor and Ben Letham at Facebook to forecast daily active users and ad revenue. They needed something that “worked out of the box” for thousands of time series with minimal human intervention. The name “Prophet” reflects their goal: to make accurate predictions (prophecies) about the future.
Deep Learning for Time Series
Section titled “Deep Learning for Time Series”When to Use Deep Learning
Section titled “When to Use Deep Learning”CLASSICAL vs DEEP LEARNING DECISION:────────────────────────────────────
Use CLASSICAL (ARIMA, Prophet) when:├── Single time series├── Clear seasonality patterns├── Limited data (<1000 points)├── Interpretability needed├── Fast training required└── Simple relationships
Use DEEP LEARNING when:├── Multiple related time series├── Complex, non-linear patterns├── Lots of data (>10,000 points)├── Multiple input variables├── State-of-the-art accuracy needed└── Willing to trade interpretabilityRecurrent Neural Networks (RNNs)
Section titled “Recurrent Neural Networks (RNNs)”RNNs were designed specifically for sequential data:
VANILLA RNN:────────────Each timestep, the hidden state carries information forward.
h_t = tanh(W_h · h_{t-1} + W_x · x_t + b)y_t = W_y · h_t + b_y
┌───────────────────────────────────────┐ │ │ ▼ │ x_1 ─→ [RNN] ─→ h_1 ─→ [RNN] ─→ h_2 ─→ [RNN] ─→ h_3 ─→ [RNN] ─→ h_4 │ │ │ │ ▼ ▼ ▼ ▼ y_1 y_2 y_3 y_4
PROBLEM: Vanishing Gradients!─────────────────────────────As we backpropagate through many timesteps,gradients get multiplied repeatedly.
0.9 × 0.9 × 0.9 × ... × 0.9 (100 times) ≈ 0.000027
The gradient vanishes! Can't learn long-term dependencies.LSTM: Long Short-Term Memory
Section titled “LSTM: Long Short-Term Memory”LSTMs solve the vanishing gradient problem with gates:
LSTM CELL ARCHITECTURE:───────────────────────
┌─────────────────────────────────────────────────┐ │ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ c_{t-1}───►│ × │────►│ + │────►│ │─────────► c_t│ │ └──┬──┘ └──┬──┘ │ │ │ │ │ │ │ │ │ │ ┌──┴──┐ ┌──┴──┐ │ │ │ │ │ f_t │ │ i_t │ │ │ │ │ │Forget│ │Input│ │ │ │ │ │ Gate │ │ Gate│ │ │ │ │ └──┬──┘ └──┬──┘ │ │ │ │ │ × │ │ × │ │ │ │ │ │ │ │ │ │ │ ┌──┴──┐ │ │ │ │ │ │ │ c̃_t │ │ │ │ │ │ │ │ New │ │ │ │ │ │ │ │Memory│ │ │ │ │ │ │ └──┬──┘ │ └──┬──┘ │ │ │ │ │ │ │ │ └──┬──┴──┬──┘ ┌──┴──┐ │ │ │ │ │ o_t │ │ │ │ │ │Output│ │ │ │ │ │ Gate │ │ │ │ │ └──┬──┘ │ │ │ │ │ │ h_{t-1}─────────►│─────│──────────────►──────────► h_t │ │ │ │ │ │ │ │ │ └─────────┴─────┴────────────────────────────────┘ │ │ x_t x_t
GATE FUNCTIONS:───────────────f_t = σ(W_f · [h_{t-1}, x_t] + b_f) # Forget gatei_t = σ(W_i · [h_{t-1}, x_t] + b_i) # Input gateo_t = σ(W_o · [h_{t-1}, x_t] + b_o) # Output gatec̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c) # New memory
c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t # Cell state updateh_t = o_t ⊙ tanh(c_t) # Hidden state
The cell state c_t acts like a "conveyor belt" -information can flow unchanged through time!Did You Know? LSTMs were invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. For years, the paper was largely ignored because computing power wasn’t sufficient. It wasn’t until 2014-2015 that LSTMs became practical, winning competitions and powering Google Translate. Schmidhuber often jokes that deep learning’s success came 20 years late!
GRU: A Simpler Alternative
Section titled “GRU: A Simpler Alternative”Gated Recurrent Units simplify LSTMs while keeping most benefits:
GRU vs LSTM:────────────
LSTM: 3 gates (forget, input, output) + cell stateGRU: 2 gates (reset, update) + no cell state
GRU EQUATIONS:z_t = σ(W_z · [h_{t-1}, x_t]) # Update gater_t = σ(W_r · [h_{t-1}, x_t]) # Reset gateh̃_t = tanh(W · [r_t ⊙ h_{t-1}, x_t]) # Candidateh_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t # New state
COMPARISON:───────────┌──────────────┬───────────┬───────────┐│ Aspect │ LSTM │ GRU │├──────────────┼───────────┼───────────┤│ Parameters │ More │ Fewer ││ Training │ Slower │ Faster ││ Performance │ ≈ Same │ ≈ Same ││ Long deps │ Slightly │ Slightly ││ │ better │ worse │└──────────────┴───────────┴───────────┘
Rule of thumb: Try GRU first (faster), switch toLSTM if you need longer memory.Transformers for Time Series
Section titled “Transformers for Time Series”Why Transformers Work for Time Series
Section titled “Why Transformers Work for Time Series”The same attention mechanism that revolutionized NLP works for time series:
ATTENTION IN TIME SERIES:─────────────────────────
Traditional RNN: Sequential processing t=1 → t=2 → t=3 → t=4 → ... → t=100
Problem: Information from t=1 might not reach t=100!
Transformer: Direct connections to ALL timesteps
t=1 t=2 t=3 t=4 ... t=100 │ │ │ │ │ └────┴────┴────┴─────────┘ │ Attention can directly access any timestep!
Example: Forecasting energy demand─────────────────────────────────To predict Monday 8am demand, attention can:- Look at last Monday 8am (7 days ago)- Look at yesterday 8am- Look at same day last year- Ignore irrelevant midnight data
It LEARNS which past times are relevant!Temporal Fusion Transformer (TFT)
Section titled “Temporal Fusion Transformer (TFT)”Google’s TFT is state-of-the-art for time series:
TFT ARCHITECTURE:─────────────────
┌─────────────────────────────────────────────────────┐│ OUTPUT LAYER ││ (Quantile predictions) │└──────────────────────┬──────────────────────────────┘ │┌──────────────────────┴──────────────────────────────┐│ TEMPORAL SELF-ATTENTION ││ (Which past times matter most?) │└──────────────────────┬──────────────────────────────┘ │┌──────────────────────┴──────────────────────────────┐│ LSTM ENCODER-DECODER ││ (Sequential processing) │└──────────────────────┬──────────────────────────────┘ │┌──────────────────────┴──────────────────────────────┐│ VARIABLE SELECTION NETWORK ││ (Which input features are important?) │└──────────────────────┬──────────────────────────────┘ │┌──────────────────────┴──────────────────────────────┐│ INPUT EMBEDDING ││ Static vars | Past observed | Known future ││ (store ID) | (past sales) | (promotions) │└─────────────────────────────────────────────────────┘
TFT INNOVATIONS:────────────────1. Variable Selection: Learns which features matter2. Static Enrichment: Uses metadata (store type, etc.)3. Interpretable Attention: See which past times mattered4. Multi-horizon: Predicts multiple future steps at once5. Quantile Output: Uncertainty estimates built-inDid You Know? The 2020 M5 Forecasting Competition on Kaggle (42,840 time series of Walmart sales) was won by teams using LightGBM, not deep learning! This surprised many researchers. The lesson: for many real-world problems, gradient boosting with good feature engineering still beats complex neural networks. Deep learning shines when you have millions of related time series.
️ Model Selection Guide
Section titled “️ Model Selection Guide”Choosing the right time series model is more art than science, but here’s a framework that works in practice.
The Decision Framework
Section titled “The Decision Framework”Think of choosing a time series model like choosing a transportation method. You wouldn’t take a bicycle for a cross-country trip, and you wouldn’t rent a private jet to go to the grocery store. The right choice depends on your journey—or in our case, your data and requirements.
Simple Exponential Smoothing is like walking: basic, reliable, works for short distances, requires no special equipment. Use it for stable data without trends or seasonality.
ARIMA is like driving a car: more powerful, handles highways (trends) and turns (some patterns), requires some skill to operate. Use it for data with clear autocorrelation patterns.
Prophet is like using Uber: easy to call, handles most common routes automatically, good default choices. Use it for daily business data where you want quick results.
Deep Learning (LSTM/Transformer) is like operating a commercial aircraft: powerful, efficient at scale, requires significant training and infrastructure, overkill for short trips. Use it for thousands of related time series with abundant data.
When Each Model Shines
Section titled “When Each Model Shines”Choose Exponential Smoothing when:
- You have a single stable time series
- No strong trend or seasonality exists
- You need a quick baseline
- Interpretability is critical (stock levels, simple demand)
Choose ARIMA/SARIMA when:
- Clear autocorrelation exists in your data (check ACF plot)
- You need statistical rigor and hypothesis testing
- The series has regular seasonality (monthly, quarterly, yearly)
- You want to understand the underlying statistical process
- Example: Economic indicators, utility demand, financial returns
Choose Prophet when:
- You have daily/weekly data with multiple seasonality patterns
- Missing values are common (Prophet handles gaps gracefully)
- Holidays and special events affect your series
- You need quick deployment without extensive tuning
- Business stakeholders need interpretable trend/seasonality decomposition
- Example: Website traffic, retail sales, social media metrics
Choose Gradient Boosting + Features when:
- You have strong exogenous variables (weather, promotions, events)
- Multiple related series share patterns
- You have moderate data (1,000-100,000 points)
- Feature engineering is your strength
- Example: Retail demand, energy forecasting, transportation planning
Choose Deep Learning when:
- You have millions of related time series
- Patterns are complex and non-linear
- Abundant data exists (100,000+ points per series, or cross-series learning)
- You have GPU infrastructure and ML expertise
- The business value justifies the complexity investment
- Example: Global e-commerce (Amazon), ride-sharing (Uber), cloud infrastructure (AWS)
The Ensemble Approach
Section titled “The Ensemble Approach”In practice, the best production systems don’t choose one model—they combine multiple approaches. This is similar to how you might check both Google Maps and Waze before a long drive, trusting the consensus route more than either alone.
A simple but effective ensemble strategy:
def ensemble_forecast(series, forecast_horizon): """ Simple ensemble combining ARIMA, Prophet, and naive baseline. """ from statsmodels.tsa.arima.model import ARIMA from prophet import Prophet import pandas as pd import numpy as np
forecasts = {}
# 1. Naive baseline (last value repeated) forecasts['naive'] = np.full(forecast_horizon, series.iloc[-1])
# 2. ARIMA try: arima = ARIMA(series, order=(5, 1, 2)) arima_fit = arima.fit() forecasts['arima'] = arima_fit.forecast(steps=forecast_horizon).values except: forecasts['arima'] = forecasts['naive']
# 3. Prophet try: prophet_df = pd.DataFrame({'ds': series.index, 'y': series.values}) model = Prophet(yearly_seasonality=True, weekly_seasonality=True) model.fit(prophet_df) future = model.make_future_dataframe(periods=forecast_horizon) forecasts['prophet'] = model.predict(future)['yhat'].iloc[-forecast_horizon:].values except: forecasts['prophet'] = forecasts['naive']
# Simple average ensemble ensemble = np.mean([forecasts['naive'], forecasts['arima'], forecasts['prophet']], axis=0)
return ensemble, forecastsWhy ensembles work: Different models capture different patterns. ARIMA might excel at autoregressive patterns but miss holiday effects. Prophet handles holidays but might overfit changepoints. By averaging, we reduce the risk of any single model’s weaknesses dominating.
Common Pitfalls in Model Selection
Section titled “Common Pitfalls in Model Selection”Pitfall 1: Always choosing the most complex model A simple exponential smoothing model that’s well-tuned often beats a misconfigured LSTM. Start simple, add complexity only when simpler models fail.
Pitfall 2: Ignoring baseline comparisons Always compare to naive forecasts (y_t = y_{t-1} or y_t = y_{t-365}). If your fancy model can’t beat naive, something is wrong with your data or evaluation.
Pitfall 3: Optimizing for the wrong metric Different business problems need different metrics. Overage costs differ from shortage costs. Ensure your optimization metric matches business impact.
Pitfall 4: Forgetting about inference time A model that takes 10 seconds to forecast might work in batch, but if you need real-time predictions for 10,000 series, that’s 28 hours of compute per cycle. Consider latency requirements early.
Temporal Feature Engineering
Section titled “Temporal Feature Engineering”Creating Features from Time
Section titled “Creating Features from Time”Raw timestamps hide valuable information:
FROM A SINGLE TIMESTAMP, EXTRACT:─────────────────────────────────
datetime: 2024-11-28 14:30:00 (Thanksgiving Thursday)
Calendar Features:├── year: 2024├── month: 11├── day: 28├── hour: 14├── minute: 30├── day_of_week: 3 (Thursday)├── day_of_year: 333├── week_of_year: 48├── quarter: 4├── is_weekend: False├── is_month_start: False├── is_month_end: False└── is_year_end: False
Cyclical Encoding (for neural networks):├── hour_sin: sin(2π × 14/24) = 0.866├── hour_cos: cos(2π × 14/24) = -0.5├── day_sin: sin(2π × 3/7) = 0.975├── day_cos: cos(2π × 3/7) = -0.223└── month_sin/cos: ...
Holiday Features:├── is_holiday: True (Thanksgiving)├── days_to_holiday: 0├── days_since_holiday: 0└── holiday_type: "thanksgiving"Lag Features: Looking Back in Time
Section titled “Lag Features: Looking Back in Time”LAG FEATURES:─────────────The most powerful time series features!
Original data:─────────────│ Date │ Sales │├────────────┼───────┤│ 2024-11-25 │ 100 ││ 2024-11-26 │ 120 ││ 2024-11-27 │ 110 ││ 2024-11-28 │ ??? │ ← Predict this
With lag features:─────────────────│ Date │ Sales │ lag_1 │ lag_2 │ lag_7 │├────────────┼───────┼───────┼───────┼───────┤│ 2024-11-25 │ 100 │ 95 │ 90 │ 98 ││ 2024-11-26 │ 120 │ 100 │ 95 │ 115 ││ 2024-11-27 │ 110 │ 120 │ 100 │ 105 ││ 2024-11-28 │ ??? │ 110 │ 120 │ 102 │ │ │ │ │ │ └── Same day last week │ └── 2 days ago └── Yesterday's sales
Now the model can learn:"Sales ≈ 0.3×lag_1 + 0.1×lag_2 + 0.5×lag_7"Rolling Statistics
Section titled “Rolling Statistics”ROLLING WINDOW FEATURES:────────────────────────
│ Date │ Sales │ roll_mean_7 │ roll_std_7 │ roll_max_7 │├────────────┼───────┼─────────────┼────────────┼────────────┤│ 2024-11-28 │ ??? │ 107.5 │ 8.2 │ 120 │
roll_mean_7 = mean of last 7 days' salesroll_std_7 = std dev of last 7 days (volatility!)roll_max_7 = max of last 7 days (recent peak)
EXPANDING WINDOW (cumulative):─────────────────────────────│ Date │ Sales │ expanding_mean │ days_since_start │├────────────┼───────┼────────────────┼──────────────────┤│ 2024-11-28 │ ??? │ 98.5 │ 333 │
expanding_mean = mean of ALL historical dataUseful for detecting regime changes!
EXPONENTIAL MOVING AVERAGE:───────────────────────────EMA gives more weight to recent observations.
EMA_t = α × value_t + (1-α) × EMA_{t-1}
α = 0.1: Slow EMA (long memory)α = 0.5: Fast EMA (recent focus)Did You Know? The most important feature in many time series competitions is simply “same day last year” (lag_365 or lag_364 depending on day-of-week alignment). In the M5 competition, this single feature provided more predictive power than dozens of other engineered features combined!
Anomaly Detection in Time Series
Section titled “Anomaly Detection in Time Series”What Makes an Anomaly?
Section titled “What Makes an Anomaly?”TYPES OF ANOMALIES:───────────────────
1. POINT ANOMALY A single value that's unusual
Normal: 100, 102, 98, 105, [500], 101, 99 ^^^^ Point anomaly!
2. CONTEXTUAL ANOMALY Normal in one context, anomalous in another
Summer: 85°F normal Winter: 85°F ANOMALY! (should be ~40°F)
3. COLLECTIVE ANOMALY A sequence that's unusual as a group
Normal: ~100, ~100, ~100 Anomaly: 50, 50, 50, 50, 50 (individually ok, but...) ^^^^^^^^^^^^^^^^^ Five consecutive lows is suspicious!
REAL-WORLD EXAMPLES:────────────────────├── Fraud Detection: Unusual spending pattern├── Server Monitoring: CPU spike at 3am├── Manufacturing: Machine vibration change├── Healthcare: Heart rate irregularity└── Finance: Flash crash in stock priceStatistical Methods
Section titled “Statistical Methods”Z-SCORE METHOD:───────────────z = (x - μ) / σ
If |z| > 3, it's an anomaly (3-sigma rule)
│ ▲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ 99.7% of data │ ╱ ╲ within ±3σ │ ╱ ╲ │ ╱ ╲ │────╱─────────────╲──── │ -3σ μ 3σ │ │ │ └────┴─────────────┴──── Anomaly zone!
IQR METHOD (Robust to outliers):────────────────────────────────Q1 = 25th percentileQ3 = 75th percentileIQR = Q3 - Q1
Lower bound = Q1 - 1.5 × IQRUpper bound = Q3 + 1.5 × IQR
Values outside bounds = anomaliesMachine Learning Methods
Section titled “Machine Learning Methods”ISOLATION FOREST:─────────────────Idea: Anomalies are easier to isolate!
Normal points: Need many splits to isolateAnomalies: Few splits to isolate (they're "far" from others)
│ ┌──────┴──────┐ │ │ ┌─────┴─────┐ ● ← Anomaly isolated in 1 split! │ │ ┌─┴─┐ ┌─┴─┐ ● ● ● ● ← Normal points need more splits
Isolation score = average path length to isolate pointLow path length = likely anomaly
AUTOENCODERS FOR ANOMALY DETECTION:───────────────────────────────────
Train autoencoder on NORMAL data only: Input → [Encoder] → Latent → [Decoder] → Reconstruction
For new data: reconstruction_error = ||input - reconstructed||
Normal data: Low error (autoencoder learned these patterns) Anomalies: High error (never seen before, can't reconstruct!)
Reconstruction Error │ Anomaly │ ● threshold │─────────────●──── │ ● ● │ ●●●●●●●●● │ ● ● ● └────────────────── Data pointsDid You Know? Netflix uses time series anomaly detection to monitor their 200+ microservices. They process millions of metrics per second and need to detect issues within seconds. Their system “Telltale” uses a combination of statistical methods and machine learning, automatically learning what’s “normal” for each service without human labeling!
Forecasting at Scale
Section titled “Forecasting at Scale”The Multiple Time Series Problem
Section titled “The Multiple Time Series Problem”SINGLE vs MULTIPLE TIME SERIES:───────────────────────────────
SINGLE TIME SERIES (Traditional):─────────────────────────────────One model per series. Works for 1-100 series.
Product A → ARIMA_AProduct B → ARIMA_BProduct C → ARIMA_C...Time: O(n) models to train
MULTIPLE TIME SERIES (Modern):──────────────────────────────One model for ALL series. Essential for 1000+ series.
Product A ──┐Product B ──┼──→ [Global Model] ──→ All forecastsProduct C ──┤... ─┘
Benefits:- Learns patterns across series- Handles cold start (new products)- Much faster (1 model, not n)- Often more accurate!Hierarchical Forecasting
Section titled “Hierarchical Forecasting”HIERARCHY EXAMPLE (Retail):───────────────────────────
Total Company │ ┌─────────────┼─────────────┐ │ │ │ Region A Region B Region C │ │ │ ┌────┼────┐ ┌────┼────┐ ┌────┼────┐ │ │ │ │ │ │ │ │ │ Store Store Store Store Store Store Store Store Store 1 2 3 4 5 6 7 8 9
RECONCILIATION PROBLEM:───────────────────────If you forecast each level independently: Total forecast: $1,000,000 Sum of regions: $950,000 ← Doesn't match!
Solutions:1. Top-down: Forecast total, split proportionally2. Bottom-up: Forecast stores, sum up3. Optimal reconciliation: Combine all levels optimallyPractical Considerations
Section titled “Practical Considerations”Handling Missing Data
Section titled “Handling Missing Data”STRATEGIES FOR MISSING VALUES:──────────────────────────────
1. FORWARD FILL (LOCF) Last Observation Carried Forward [10, 20, NaN, NaN, 50] → [10, 20, 20, 20, 50] Good for: Slow-changing data (prices, states)
2. BACKWARD FILL [10, NaN, NaN, 40, 50] → [10, 40, 40, 40, 50] Good for: When future is more relevant
3. LINEAR INTERPOLATION [10, NaN, NaN, 40, 50] → [10, 20, 30, 40, 50] Good for: Smooth continuous data
4. SEASONAL INTERPOLATION Use same time from previous cycle Good for: Strongly seasonal data
5. MODEL-BASED IMPUTATION Train model on non-missing data, predict missing Good for: Complex patternsEvaluation Metrics
Section titled “Evaluation Metrics”FORECASTING METRICS:────────────────────
MAE (Mean Absolute Error): MAE = mean(|actual - predicted|) Interpretable: "Average error is $X"
RMSE (Root Mean Square Error): RMSE = sqrt(mean((actual - predicted)²)) Penalizes large errors more
MAPE (Mean Absolute Percentage Error): MAPE = mean(|actual - predicted| / |actual|) × 100% Scale-independent Problem: undefined when actual = 0!
SMAPE (Symmetric MAPE): SMAPE = mean(2|A - P| / (|A| + |P|)) × 100% Handles zeros better
MASE (Mean Absolute Scaled Error): MASE = MAE / MAE_of_naive_forecast < 1 means better than naive The gold standard for academics!
WHICH TO USE?─────────────├── Business stakeholders: MAE (easy to explain)├── Scale comparison: MAPE/SMAPE├── Academic: MASE└── Optimization: Usually RMSE (differentiable)Avoiding Data Leakage
Section titled “Avoiding Data Leakage”TIME SERIES CROSS-VALIDATION:─────────────────────────────
WRONG (standard k-fold):────────────────────────Randomly split data - FUTURE leaks into PAST! Train: [▓▓▓░░▓▓░▓▓] (random mix) Test: [░░░▓▓░░▓░░]
Model might see Dec 2024 in training, then "predict" Nov 2024 in test. CHEATING!
CORRECT (time-based):─────────────────────
Walk-forward validation: Fold 1: Train [▓▓▓░░░░░░░] Test [░▓░░░░░░░░] Fold 2: Train [▓▓▓▓░░░░░░] Test [░░▓░░░░░░░] Fold 3: Train [▓▓▓▓▓░░░░░] Test [░░░▓░░░░░░] Fold 4: Train [▓▓▓▓▓▓░░░░] Test [░░░░▓░░░░░] ─────────────────────────────────────→ time
Always train on PAST, test on FUTURE!
GAP BETWEEN TRAIN AND TEST:───────────────────────────If forecasting 7 days ahead, leave 7-day gap: Train: [▓▓▓▓▓▓░░░░░░░░] Gap [░░░░░░░] Test [▓▓▓]
Prevents target leakage through lagged features!Production War Stories
Section titled “Production War Stories”The $50 Million Inventory Mistake
Section titled “The $50 Million Inventory Mistake”March 2022, Major Retailer
A retail forecasting team deployed a new demand forecasting model that improved accuracy by 3% on historical data. What they didn’t realize: the model was trained on 2019-2021 data, which included the COVID anomaly period. The model had learned that “March = panic buying spike.”
When March 2022 arrived and no pandemic panic occurred, the model predicted massive demand. The company overstocked by $50 million worth of inventory. Perishables spoiled. Warehouses overflowed. Discounting destroyed margins.
The Root Cause: No regime change detection. The model assumed future would look like the past, including a once-in-century pandemic.
# The Fix: Regime detection before forecastingdef detect_regime_change(series, window=30, threshold=2.0): """Detect when the underlying pattern has fundamentally changed.""" rolling_mean = series.rolling(window).mean() rolling_std = series.rolling(window).std()
# Check if recent data is wildly different from historical patterns recent_zscore = (series.iloc[-window:].mean() - rolling_mean.iloc[-window*2:-window].mean()) / rolling_std.iloc[-window*2:-window].mean()
if abs(recent_zscore) > threshold: print(f"️ REGIME CHANGE DETECTED: z-score = {recent_zscore:.2f}") print(" Consider retraining on post-change data only!") return True return False
# Usage: Run before every forecast cycleif detect_regime_change(sales_data): model = retrain_on_recent_data(sales_data, months=3)else: model = use_full_historical_model(sales_data)Lesson: Always monitor for regime changes. A model that was perfect yesterday might be worthless today.
The 0.01% That Cost Millions
Section titled “The 0.01% That Cost Millions”September 2019, Financial Services Company
A quant team built an algorithmic trading model that forecasted stock prices with 99.99% directional accuracy in backtesting. They deployed it with $100 million in capital.
In the first week, the model lost $3 million. The “99.99% accuracy” was meaningless because they had committed the cardinal sin of time series: data leakage through look-ahead bias.
Their feature engineering calculated technical indicators using the full dataset (including future prices), then “predicted” the past. The model was essentially memorizing, not forecasting.
# WRONG: Look-ahead bias (what they did)def calculate_rsi_wrong(df): """This code looks at the ENTIRE series, including future!""" df['RSI'] = talib.RSI(df['close'], timeperiod=14) # Calculated on ALL data return df
# CORRECT: Point-in-time calculationdef calculate_rsi_correct(df, current_idx): """Only use data available at the time of prediction.""" # Only calculate on data up to current_idx historical_data = df.loc[:current_idx, 'close'] return talib.RSI(historical_data, timeperiod=14).iloc[-1]
# Or use expanding window approachdef create_features_safely(df): """Create features using only past data at each point.""" features = pd.DataFrame(index=df.index)
for i in range(14, len(df)): # At time i, we only know prices 0 to i-1 historical = df['close'].iloc[:i] features.loc[df.index[i], 'RSI'] = calculate_rsi_on_history(historical)
return featuresThe Fix: Implement strict point-in-time feature engineering. Every feature at time t must only use data from times < t.
The Anomaly That Wasn’t
Section titled “The Anomaly That Wasn’t”July 2023, Cloud Infrastructure Provider
An ML team built an anomaly detection system to monitor 50,000 server metrics. It was sensitive and fast, flagging anomalies within seconds. Operations loved it—at first.
Within a month, the team was overwhelmed. The system generated 10,000+ alerts daily. Most were false positives from normal daily patterns (high CPU at 9 AM when employees log in) or weekly patterns (backup jobs on Sunday nights).
The model treated every deviation from a static baseline as anomalous, ignoring the fact that servers have predictable cycles.
# WRONG: Static threshold (what they did)def detect_anomaly_naive(value, historical_mean, historical_std): z_score = (value - historical_mean) / historical_std return abs(z_score) > 3 # Static threshold
# CORRECT: Seasonally-adjusted detectiondef detect_anomaly_seasonal(value, timestamp, historical_data): """Account for hour-of-day and day-of-week patterns.""" hour = timestamp.hour day = timestamp.dayofweek
# Get historical values for same hour and day similar_times = historical_data[ (historical_data.index.hour == hour) & (historical_data.index.dayofweek == day) ]
if len(similar_times) < 10: # Not enough history for this time slot return False, "Insufficient history"
seasonal_mean = similar_times.mean() seasonal_std = similar_times.std()
z_score = (value - seasonal_mean) / (seasonal_std + 1e-6)
if abs(z_score) > 3: return True, f"Anomaly: z={z_score:.2f} vs typical for {hour}:00 on {day}" return False, "Normal"After implementing seasonal baselines, false positives dropped by 90%, and the team could actually respond to real issues.
Lesson: Time series anomaly detection must account for temporal patterns. 9 AM behavior is different from 3 AM behavior.
️ Common Mistakes
Section titled “️ Common Mistakes”Mistake 1: Shuffling Time Series Data
Section titled “Mistake 1: Shuffling Time Series Data”# WRONG: Random train/test split destroys temporal structurefrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# Future data can leak into training!
# CORRECT: Time-based split preserves temporal ordersplit_idx = int(len(X) * 0.8)X_train, X_test = X[:split_idx], X[split_idx:]y_train, y_test = y[:split_idx], y[split_idx:]# Training only sees past, testing only sees futureWhy it matters: Random shuffling lets your model “peek” at the future. It will show amazing validation scores but fail completely in production.
Mistake 2: Ignoring Seasonality When Differencing
Section titled “Mistake 2: Ignoring Seasonality When Differencing”# WRONG: Only first-order differencing for seasonal datadiff_wrong = series.diff(1) # Removes trend but not seasonality
# CORRECT: Seasonal differencing for seasonal data# For monthly data with yearly seasonality:diff_correct = series.diff(12) # Remove yearly patterndiff_correct = diff_correct.diff(1) # Then remove trendThe symptom: Residuals still show periodic patterns in ACF plot. Model captures trend but completely misses that December is always different from July.
Mistake 3: Using Future Information in Features
Section titled “Mistake 3: Using Future Information in Features”# WRONG: Rolling features that look into the futuredf['moving_avg'] = df['sales'].rolling(window=7, center=True).mean()# ^^^^^ Uses 3 future days!
# CORRECT: Only use past datadf['moving_avg'] = df['sales'].rolling(window=7, center=False).mean().shift(1)# ^^^^^ Only past ^^^^^^ Exclude currentWhy it’s subtle: center=True is the default in some libraries and seems harmless, but it uses future values in the calculation.
Mistake 4: Overfitting to Recent Data
Section titled “Mistake 4: Overfitting to Recent Data”# WRONG: Training only on last 3 monthsmodel = Prophet()model.fit(df[df['ds'] > '2024-09-01']) # Misses yearly seasonality!
# CORRECT: Include at least 2 full cycles of your longest seasonality# For yearly seasonality, use 2+ years of datamodel = Prophet()model.fit(df[df['ds'] > '2022-09-01']) # Captures 2 full yearsThe trap: Recent data feels most relevant, but short windows can’t capture long seasonal patterns. Your model won’t know that January always differs from July.
Mistake 5: Ignoring Scale When Comparing Forecasts
Section titled “Mistake 5: Ignoring Scale When Comparing Forecasts”# WRONG: Comparing MAPE across very different seriesmape_product_a = 5.2% # Sales: $1M/monthmape_product_b = 45.3% # Sales: $100/month (!)
# Product B seems terrible, but 45% of $100 is only $45 error!# Product A at 5% of $1M is $50,000 error!
# CORRECT: Use scale-independent metrics or absolute errors# Option 1: Compare MAE in dollarsmae_a = 50000 # Much larger actual impactmae_b = 45 # Tiny actual impact
# Option 2: Use MASE (Mean Absolute Scaled Error)# Compares your model to a naive baseline on the same seriesmase_a = 0.8 # 20% better than naivemase_b = 1.2 # 20% worse than naive (worry about this one!)Lesson: MAPE punishes forecasting low-volume items unfairly. Use MASE or weighted metrics for portfolio forecasting.
Economics of Time Series Forecasting
Section titled “Economics of Time Series Forecasting”ROI Breakdown by Industry
Section titled “ROI Breakdown by Industry”| Industry | Use Case | 1% Accuracy Gain Value | Typical Investment |
|---|---|---|---|
| Retail | Demand forecasting | $50M+ (inventory costs) | $500K |
| Energy | Load forecasting | $10M+ (grid balance) | $1M |
| Finance | Trading signals | $100M+ (alpha capture) | $5M |
| Logistics | Capacity planning | $20M+ (fleet optimization) | $300K |
| Healthcare | Patient volume | $5M+ (staffing costs) | $200K |
Cost Structure of Production Forecasting Systems
Section titled “Cost Structure of Production Forecasting Systems”TOTAL ANNUAL COST: $500K - $2M (enterprise scale)──────────────────────────────────────────────────
Infrastructure (40%):├── Compute for training: $5K-50K/month├── Real-time inference: $2K-20K/month├── Data storage: $1K-10K/month└── Monitoring systems: $1K-5K/month
Personnel (50%):├── Data scientists: 2-5 FTEs├── ML engineers: 1-3 FTEs└── Domain experts: 0.5-1 FTE
Tools & Services (10%):├── Cloud ML platforms├── Feature stores└── Experiment trackingWhen to Build vs Buy
Section titled “When to Build vs Buy”| Factor | Build Custom | Use Prophet/Auto-ARIMA | Buy Platform |
|---|---|---|---|
| Time series count | >10,000 | <100 | 100-10,000 |
| Customization need | High | Low | Medium |
| Team expertise | Deep ML | Basic stats | Varies |
| Time to value | 6-12 months | 1-2 weeks | 1-3 months |
| Annual cost | $500K+ | $50K | $100K-300K |
Did You Know? Walmart’s forecasting system handles 500 million time series (every item × every store × every hour). Improving accuracy by 1 point on their largest 1% of series has the same dollar impact as improving the bottom 50% by 20 points. Focus your effort where the money is!
Debugging and Troubleshooting
Section titled “Debugging and Troubleshooting”Problem: Model Predicts Flat Line
Section titled “Problem: Model Predicts Flat Line”Symptom: Your forecast is a straight horizontal line near the mean.
Diagnosis: This usually means the model can’t find patterns and defaults to predicting the average. Common causes:
-
Data is truly random: Some series (like stock prices) have very weak autocorrelation. The best predictor of tomorrow’s price is today’s price (random walk).
-
Wrong differencing: Over-differencing removes all signal. If you’ve differenced twice and predictions are flat, try d=1 or d=0.
-
Feature scaling issues: If your features are on vastly different scales, the model might ignore small-scale but important features.
# Check if your series has predictable patternsfrom statsmodels.graphics.tsaplots import plot_acfimport matplotlib.pyplot as plt
def diagnose_flat_predictions(series): """Diagnose why model predicts flat line.""" # Check autocorrelation fig, ax = plt.subplots(figsize=(10, 4)) plot_acf(series.dropna(), lags=40, ax=ax) plt.title("Autocorrelation - Strong patterns = tall bars") plt.show()
# Check variance print(f"Mean: {series.mean():.4f}") print(f"Std Dev: {series.std():.4f}") print(f"Coefficient of Variation: {series.std()/series.mean()*100:.1f}%")
# Check if naive forecast is better naive_mae = abs(series - series.shift(1)).mean() mean_mae = abs(series - series.mean()).mean() print(f"\nNaive (y_t = y_{t-1}) MAE: {naive_mae:.4f}") print(f"Mean prediction MAE: {mean_mae:.4f}")
if naive_mae < mean_mae: print(" Series has predictable structure (naive beats mean)") else: print("️ Series might be unpredictable (mean beats naive)")Problem: Predictions Are Always One Step Behind
Section titled “Problem: Predictions Are Always One Step Behind”Symptom: Your forecast perfectly tracks actuals but shifted by one timestep.
Diagnosis: This is classic data leakage. Your model learned to copy the previous value because you accidentally included lagged target in features.
# Common leak: Target at t-1 included as feature for predicting t# The model learns: y_t ≈ y_{t-1} (just copy previous value)
# Check for this:def check_for_lag_leak(predictions, actuals): """Detect if predictions are just lagged actuals.""" # Correlation with lag-1 actual corr_lag1 = np.corrcoef(predictions[1:], actuals[:-1])[0, 1] # Correlation with actual corr_actual = np.corrcoef(predictions, actuals)[0, 1]
print(f"Correlation with lag-1 actual: {corr_lag1:.4f}") print(f"Correlation with actual: {corr_actual:.4f}")
if corr_lag1 > corr_actual: print("️ LEAK DETECTED: Predictions track lagged values!") print(" Check your features for data leakage")Fix: Ensure your lag features at time t only use values from t-2 or earlier for predicting t+1.
Problem: Great Training Metrics, Terrible Production Performance
Section titled “Problem: Great Training Metrics, Terrible Production Performance”Symptom: Model shows 95%+ accuracy in development but fails in production.
Diagnosis: Almost always one of these issues:
- Train/test contamination: Used random split instead of time split
- Look-ahead bias: Features computed on full dataset
- Target leakage: Features contain target information
- Regime change: Production data is fundamentally different
def validate_training_integrity(model_metrics, production_metrics): """Check if training performance is realistic.""" train_mae = model_metrics['train_mae'] val_mae = model_metrics['val_mae'] prod_mae = production_metrics['mae']
print(f"Training MAE: {train_mae:.4f}") print(f"Validation MAE: {val_mae:.4f}") print(f"Production MAE: {prod_mae:.4f}")
# Warning signs if train_mae < val_mae * 0.5: print("️ Training much better than validation - likely overfitting")
if prod_mae > val_mae * 2: print(" Production 2x worse than validation - check for:") print(" 1. Data leakage in validation") print(" 2. Regime change in production") print(" 3. Feature pipeline differences")
ratio = prod_mae / val_mae if 0.8 < ratio < 1.2: print(" Production performance matches expectations")Problem: ARIMA Convergence Warnings
Section titled “Problem: ARIMA Convergence Warnings”Symptom: statsmodels throws convergence warnings or optimization failures.
Diagnosis: ARIMA optimization can be finicky. Common fixes:
from statsmodels.tsa.arima.model import ARIMAimport warnings
def fit_arima_robust(series, order, max_attempts=5): """Fit ARIMA with multiple optimization attempts."""
methods = ['lbfgs', 'bfgs', 'powell', 'nm', 'cg'] best_model = None best_aic = np.inf
for method in methods[:max_attempts]: try: with warnings.catch_warnings(): warnings.simplefilter("ignore")
model = ARIMA(series, order=order) fitted = model.fit(method=method)
if fitted.aic < best_aic: best_aic = fitted.aic best_model = fitted print(f"Method {method}: AIC={fitted.aic:.2f} ")
except Exception as e: print(f"Method {method}: Failed ({str(e)[:50]})")
if best_model is None: print("All methods failed. Try:") print("1. Check for missing values") print("2. Try simpler order (lower p, q)") print("3. Ensure data is numeric")
return best_modelProblem: Prophet Is Too Slow
Section titled “Problem: Prophet Is Too Slow”Symptom: Prophet takes minutes to fit on large datasets.
Diagnosis: Prophet uses MCMC sampling which is compute-intensive. Speed up options:
from prophet import Prophet
def fast_prophet_fit(df, quick_mode=True): """Configure Prophet for faster training."""
if quick_mode: model = Prophet( # Reduce MCMC samples (default is 1000) mcmc_samples=0, # Use MAP estimation instead
# Reduce changepoint detection n_changepoints=10, # Default is 25
# Simplify seasonality yearly_seasonality=5, # Fewer Fourier terms (default 10) weekly_seasonality=3, # Default is 3
# Disable uncertainty intervals for speed uncertainty_samples=0 ) else: model = Prophet() # Full accuracy mode
model.fit(df) return model
# Also consider: sample your data for initial exploration# Full data for final model onlyReal-World Success Stories
Section titled “Real-World Success Stories”Uber: Dynamic Pricing at Scale
Section titled “Uber: Dynamic Pricing at Scale”Uber’s pricing engine processes millions of time series forecasts in real-time. Every region, every hour, every day needs a demand prediction to set prices that balance supply and demand. Their forecasting system, called “COTA” (Competition and TAxi), combines:
- Hierarchical forecasting: City → Zone → Grid cell → Time bucket
- External features: Weather, events, holidays, historical demand
- Ensemble approach: Gradient boosting for baseline, LSTM for complex patterns
The impact? Uber reduced driver idle time by 15% and increased rider satisfaction by ensuring cars are positioned where demand will appear. At their scale, a 1% improvement in forecast accuracy translates to $200 million annually in better resource allocation.
Technical insight: Uber discovered that simple lag features (same time yesterday, same day last week) provided 60% of their model’s predictive power. Deep learning added 5-10% on top, but the marginal cost of complexity often wasn’t worth it for new markets with limited data.
Amazon: 300 Million Forecasts Daily
Section titled “Amazon: 300 Million Forecasts Daily”Amazon’s forecasting system is perhaps the world’s largest production time series application. Every SKU in every warehouse needs daily demand predictions to drive:
- Inventory ordering: When and how much to order
- Warehouse placement: Which warehouse should hold stock
- Shipping optimization: Pre-positioning for anticipated demand
Their system, documented in the 2022 paper “AutoGluon-Timeseries,” uses an AutoML approach that:
- Automatically tries multiple models (ARIMA, Prophet, DeepAR, simple baselines)
- Selects the best model per series
- Combines predictions through weighted ensembling
The result: Amazon reduced inventory carrying costs by 15% while maintaining 99%+ in-stock rates. For a company with billions in inventory, this represents hundreds of millions in savings.
Key lesson: Amazon found that model selection matters more than model complexity. A simple exponential smoothing model often beats LSTM for stable products, while deep learning excels for products with complex promotional patterns.
Capital One: Fraud Detection in Milliseconds
Section titled “Capital One: Fraud Detection in Milliseconds”Credit card fraud detection is fundamentally a time series problem. Capital One processes 50,000+ transactions per second, each needing a fraud score within 100 milliseconds. Their system combines:
- Customer behavioral sequences: What’s normal for this cardholder?
- Merchant risk patterns: How does this merchant’s transaction flow look?
- Temporal anomaly detection: Is this timing unusual?
They reduced fraud losses by 25% while decreasing false positive rates by 15%—meaning fewer annoyed customers getting their legitimate purchases declined.
Technical architecture: The system uses a streaming approach where LSTM models pre-compute customer embeddings hourly, while real-time scoring uses lightweight models that compare new transactions against these embeddings. This hybrid architecture achieves the speed requirements while maintaining accuracy.
Netflix: Understanding Viewing Patterns
Section titled “Netflix: Understanding Viewing Patterns”Netflix uses time series forecasting for capacity planning, knowing that a big premiere can spike traffic 10x. Their approach:
- Seasonal decomposition: Identify weekly patterns (peak on weekends) and yearly patterns (holidays)
- Event modeling: Account for show premieres, sporting events, and cultural moments
- Geographic cascading: A show premieres at midnight in each timezone, creating predictable traffic waves
By accurately forecasting viewership, Netflix reduced their infrastructure costs by 30% while improving streaming quality. Over-provisioning wastes money; under-provisioning causes buffering.
Did You Know? Netflix found that their forecasting accuracy improved by 12% simply by adding “day since last release of similar content” as a feature. Viewers binge-watch, so demand patterns after a new season release follow predictable decay curves that vary by genre.
Instacart: Predicting Grocery Demand
Section titled “Instacart: Predicting Grocery Demand”Instacart faces a unique forecasting challenge: predicting which products customers will order, when, and in what quantities—with the added complexity of perishable goods.
Their 2023 ML system processes 50,000+ time series (products × stores) and learned several critical lessons:
- Hierarchical helps: Forecasting “dairy products” first, then drilling down to “2% milk,” then specific brands improves accuracy for low-volume items
- Weather is king: Temperature and precipitation are the two most important external features for grocery demand
- Substitution modeling: When a product is out of stock, demand shifts to alternatives—ignoring this creates systematic bias
The business impact: Instacart reduced food waste by 20% (better predictions mean less over-ordering) while improving customer satisfaction through higher in-stock rates.
Interview Preparation
Section titled “Interview Preparation”Question 1: Explain stationarity and why it matters
Section titled “Question 1: Explain stationarity and why it matters”Answer: Stationarity means the statistical properties of a time series (mean, variance, autocorrelation) don’t change over time. It matters because most classical forecasting methods assume stationarity. If a series is non-stationary, yesterday’s patterns don’t reliably predict tomorrow—the underlying process is changing.
To test stationarity, I use the Augmented Dickey-Fuller (ADF) test. If the p-value is below 0.05, we reject the null hypothesis of a unit root, meaning the series is stationary. If it’s non-stationary, we apply differencing (subtracting the previous value) until it becomes stationary. The number of differences needed is the ‘d’ parameter in ARIMA(p,d,q).
Question 2: How do you prevent data leakage in time series problems?
Section titled “Question 2: How do you prevent data leakage in time series problems?”Answer: Data leakage in time series comes from three main sources:
-
Random train/test splits: Instead, always use time-based splits where training data is strictly before test data.
-
Look-ahead bias in features: Every feature at time t must only use information from times before t. Be especially careful with rolling calculations—use
center=Falseand.shift(1). -
Target leakage: Features that directly encode the target or are caused by the target. For example, using “total monthly sales” to predict daily sales within that month.
To validate, I always leave a gap between training and test sets equal to the forecast horizon. If I’m predicting 7 days ahead, there should be at least a 7-day gap to simulate realistic conditions.
Question 3: When would you use Prophet vs ARIMA vs deep learning?
Section titled “Question 3: When would you use Prophet vs ARIMA vs deep learning?”Answer: The choice depends on several factors:
Prophet when:
- You have daily data with strong weekly/yearly seasonality
- Missing values or outliers are common
- You need to handle holidays and special events
- Business stakeholders need interpretable components
- You have limited statistical expertise
ARIMA/SARIMA when:
- You need a statistical baseline
- Data is hourly or has unusual seasonality
- You want explicit control over model parameters
- Residual analysis for model diagnostics is important
Deep Learning (LSTM, Transformer) when:
- You have multiple related time series (>1000)
- Patterns are complex and non-linear
- You have abundant data (>10,000 points per series)
- Multiple exogenous variables affect the forecast
- You’re willing to sacrifice interpretability for accuracy
In practice, I often start with Prophet or ARIMA as a baseline, then try gradient boosting with lag features (which wins many competitions), and only move to deep learning if those approaches plateau.
Question 4: How do you handle multiple time series at scale?
Section titled “Question 4: How do you handle multiple time series at scale?”Answer: At scale (1000+ time series), training individual models becomes impractical. I use global models that learn across all series:
-
Feature engineering: Create a unified feature set including series-specific static features (store type, product category), temporal features (day of week, holidays), and lag features (same time last week, last year).
-
Global model: Train a single model (typically LightGBM or a Transformer) on all series together. The model learns patterns that transfer across series.
-
Hierarchical reconciliation: If series have a hierarchy (stores → regions → total), forecast at each level and reconcile using optimal combination methods to ensure consistency.
-
Cold start handling: For new series with no history, the global model can still forecast using static features and patterns learned from similar series.
This approach is more accurate than individual models because rare patterns in one series might be common across the portfolio.
Question 5: Design a real-time anomaly detection system for server metrics
Section titled “Question 5: Design a real-time anomaly detection system for server metrics”System Design Answer:
Requirements clarification:
- Volume: 50,000 servers × 100 metrics × per-minute = 5M data points/minute
- Latency: Detect anomalies within 60 seconds
- False positive tolerance: <1% to avoid alert fatigue
Architecture:
┌────────────────────────────────────────────────────────────┐│ ARCHITECTURE │├────────────────────────────────────────────────────────────┤│ ││ Metrics ─→ Kafka ─→ Flink Processor ─→ Anomaly Scorer ││ │ │ ││ ▼ ▼ ││ Feature Store Alert Router ││ (Redis + S3) │ ││ │ ▼ ││ └──────→ PagerDuty/Slack │└────────────────────────────────────────────────────────────┘Anomaly detection approach:
- Seasonal baseline: For each metric × hour × day-of-week, maintain rolling mean and std
- Ensemble scoring: Combine Z-score with Isolation Forest for robustness
- Suppression rules: Group related alerts (same server, same root cause)
- Adaptive thresholds: Tighten thresholds for critical services, relax for dev environments
Key design decisions:
- Use streaming (Flink) rather than batch for low latency
- Store baselines in Redis for sub-millisecond lookups
- Implement alert correlation to reduce noise (100 metrics spiking on one server = 1 alert)
Key Takeaways
Section titled “Key Takeaways”-
Time series data is fundamentally different: Temporal dependency means order matters. Never shuffle time series data, and always use time-based train/test splits.
-
Stationarity is the foundation: Most classical methods require stationary data. Use the ADF test to check, and apply differencing to transform non-stationary series.
-
Decompose to understand: Every time series can be broken into trend, seasonality, and residual components. Understanding these components guides model selection and feature engineering.
-
ARIMA is the workhorse: The (p,d,q) parameters capture autoregression, differencing, and moving average effects. Use ACF/PACF plots to guide parameter selection, or let auto-ARIMA search for you.
-
Prophet democratized forecasting: Facebook’s Prophet handles holidays, missing data, and changepoints automatically. It’s the best “just works” solution for daily business data.
-
Deep learning needs scale: LSTMs and Transformers shine when you have thousands of related time series and abundant data. For single series, classical methods often win.
-
Feature engineering wins competitions: Lag features (yesterday, last week, last year) and rolling statistics often outperform complex models. The M5 competition proved gradient boosting + good features beats deep learning.
-
Data leakage is the silent killer: Point-in-time feature engineering is critical. One leaked feature can make a useless model look perfect in backtesting.
-
Anomaly detection needs context: A value that’s normal at 9 AM might be anomalous at 3 AM. Always build seasonally-aware baselines to reduce false positives.
-
Monitor for regime change: Models trained on historical data assume the future resembles the past. Detect when underlying patterns shift and retrain accordingly.
Hands-On Exercises
Section titled “Hands-On Exercises”Exercise 1: Build Complete ARIMA Pipeline
Section titled “Exercise 1: Build Complete ARIMA Pipeline”"""Complete ARIMA forecasting pipeline with proper evaluation."""import pandas as pdimport numpy as npfrom statsmodels.tsa.stattools import adfullerfrom statsmodels.tsa.arima.model import ARIMAfrom statsmodels.graphics.tsaplots import plot_acf, plot_pacfimport matplotlib.pyplot as pltfrom sklearn.metrics import mean_absolute_error, mean_squared_error
def check_stationarity(series, significance=0.05): """ Test for stationarity using Augmented Dickey-Fuller test.
Returns True if series is stationary (p-value < significance). """ result = adfuller(series.dropna()) adf_stat = result[0] p_value = result[1]
print(f"ADF Statistic: {adf_stat:.4f}") print(f"p-value: {p_value:.4f}")
if p_value < significance: print(" Series IS stationary (reject null hypothesis)") return True else: print(" Series is NOT stationary (fail to reject null)") return False
def make_stationary(series, max_diff=2): """ Apply differencing until series is stationary.
Returns (transformed_series, number_of_differences). """ current = series.copy() d = 0
while not check_stationarity(current) and d < max_diff: d += 1 current = current.diff().dropna() print(f"\nAfter {d} difference(s):")
return current, d
def select_arima_order(series, max_p=5, max_q=5): """ Use information criteria to select best ARIMA(p,d,q) parameters. """ # Make stationary first stationary, d = make_stationary(series)
# Grid search over p and q best_aic = np.inf best_order = None
for p in range(max_p + 1): for q in range(max_q + 1): try: model = ARIMA(series, order=(p, d, q)) fitted = model.fit()
if fitted.aic < best_aic: best_aic = fitted.aic best_order = (p, d, q) print(f"ARIMA({p},{d},{q}): AIC={fitted.aic:.2f}") except: continue
print(f"\nBest order: ARIMA{best_order} with AIC={best_aic:.2f}") return best_order
def walk_forward_validation(series, order, test_size=30): """ Evaluate ARIMA using walk-forward validation (time-respecting CV).
At each step: 1. Train on all data up to time t 2. Predict time t+1 3. Move forward and repeat """ history = list(series[:-test_size]) predictions = [] actuals = list(series[-test_size:])
for i in range(test_size): # Fit model on history model = ARIMA(history, order=order) fitted = model.fit()
# Predict next value forecast = fitted.forecast(steps=1)[0] predictions.append(forecast)
# Add actual to history (simulates getting new data) history.append(actuals[i])
if (i + 1) % 10 == 0: print(f"Progress: {i + 1}/{test_size} predictions")
# Calculate metrics mae = mean_absolute_error(actuals, predictions) rmse = np.sqrt(mean_squared_error(actuals, predictions))
print(f"\nWalk-Forward Validation Results:") print(f"MAE: {mae:.4f}") print(f"RMSE: {rmse:.4f}")
return predictions, actuals, mae, rmse
# Example usage:# df = pd.read_csv('sales.csv', parse_dates=['date'], index_col='date')# series = df['sales']# order = select_arima_order(series)# predictions, actuals, mae, rmse = walk_forward_validation(series, order)Exercise 2: Prophet vs ARIMA Comparison
Section titled “Exercise 2: Prophet vs ARIMA Comparison”"""Head-to-head comparison of Prophet and ARIMA on the same dataset."""import pandas as pdimport numpy as npfrom prophet import Prophetfrom statsmodels.tsa.arima.model import ARIMAfrom sklearn.metrics import mean_absolute_errorimport matplotlib.pyplot as plt
def prepare_prophet_data(series): """Convert pandas Series to Prophet format (ds, y columns).""" df = pd.DataFrame({ 'ds': series.index, 'y': series.values }) return df
def compare_forecasters(series, forecast_horizon=30): """ Compare Prophet vs ARIMA on the same train/test split. """ # Split data train = series[:-forecast_horizon] test = series[-forecast_horizon:]
results = {}
# --- PROPHET --- print("Training Prophet...") prophet_df = prepare_prophet_data(train) prophet_model = Prophet( yearly_seasonality=True, weekly_seasonality=True, daily_seasonality=False ) prophet_model.fit(prophet_df)
# Make future dataframe future = prophet_model.make_future_dataframe(periods=forecast_horizon) prophet_forecast = prophet_model.predict(future) prophet_preds = prophet_forecast['yhat'].iloc[-forecast_horizon:].values
results['Prophet'] = { 'predictions': prophet_preds, 'mae': mean_absolute_error(test.values, prophet_preds) }
# --- ARIMA --- print("Training ARIMA...") # Using auto-selected order (you'd use select_arima_order in practice) arima_model = ARIMA(train, order=(5, 1, 2)) arima_fitted = arima_model.fit() arima_preds = arima_fitted.forecast(steps=forecast_horizon)
results['ARIMA'] = { 'predictions': arima_preds, 'mae': mean_absolute_error(test.values, arima_preds) }
# --- COMPARISON --- print("\n" + "=" * 50) print("COMPARISON RESULTS") print("=" * 50) for name, data in results.items(): print(f"{name:15} MAE: {data['mae']:.4f}")
# Determine winner winner = min(results.keys(), key=lambda k: results[k]['mae']) print(f"\n Winner: {winner}")
# Plot comparison plt.figure(figsize=(12, 6)) plt.plot(test.index, test.values, 'k-', label='Actual', linewidth=2) plt.plot(test.index, results['Prophet']['predictions'], 'b--', label='Prophet') plt.plot(test.index, results['ARIMA']['predictions'], 'r--', label='ARIMA') plt.legend() plt.title('Prophet vs ARIMA Forecast Comparison') plt.xlabel('Date') plt.ylabel('Value') plt.tight_layout() plt.savefig('forecast_comparison.png')
return results
# Analyze Prophet componentsdef analyze_prophet_components(model, forecast): """Visualize what Prophet learned about trend and seasonality.""" fig = model.plot_components(forecast) plt.tight_layout() plt.savefig('prophet_components.png')
# Extract component strengths trend_range = forecast['trend'].max() - forecast['trend'].min() yearly_range = forecast['yearly'].max() - forecast['yearly'].min()
print(f"\nComponent Analysis:") print(f"Trend range: {trend_range:.2f}") print(f"Yearly seasonality range: {yearly_range:.2f}") print(f"Ratio (seasonality/trend): {yearly_range/trend_range:.2%}")Exercise 3: LSTM Time Series Model
Section titled “Exercise 3: LSTM Time Series Model”"""LSTM model for time series forecasting with proper sequence creation."""import numpy as npimport pandas as pdimport torchimport torch.nn as nnfrom torch.utils.data import DataLoader, TensorDatasetfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import mean_absolute_error
class LSTMForecaster(nn.Module): """ LSTM architecture for time series prediction. """ def __init__(self, input_size=1, hidden_size=64, num_layers=2, dropout=0.2): super().__init__() self.hidden_size = hidden_size self.num_layers = num_layers
self.lstm = nn.LSTM( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout )
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x): # x shape: (batch, sequence_length, input_size) lstm_out, _ = self.lstm(x)
# Take the last output last_output = lstm_out[:, -1, :]
# Predict prediction = self.fc(last_output) return prediction
def create_sequences(data, look_back=30): """ Create input sequences and targets for LSTM training.
Given [1, 2, 3, 4, 5] with look_back=3: X = [[1,2,3], [2,3,4]] y = [4, 5] """ X, y = [], [] for i in range(len(data) - look_back): X.append(data[i:(i + look_back)]) y.append(data[i + look_back]) return np.array(X), np.array(y)
def train_lstm_forecaster(series, look_back=30, epochs=100, batch_size=32): """ Complete LSTM training pipeline with proper time split. """ # Scale data to [0, 1] scaler = MinMaxScaler() scaled_data = scaler.fit_transform(series.values.reshape(-1, 1))
# Create sequences X, y = create_sequences(scaled_data.flatten(), look_back)
# Time-based split (80% train, 20% test) split_idx = int(len(X) * 0.8) X_train, X_test = X[:split_idx], X[split_idx:] y_train, y_test = y[:split_idx], y[split_idx:]
print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}")
# Convert to PyTorch tensors X_train = torch.FloatTensor(X_train).unsqueeze(-1) y_train = torch.FloatTensor(y_train).unsqueeze(-1) X_test = torch.FloatTensor(X_test).unsqueeze(-1) y_test = torch.FloatTensor(y_test).unsqueeze(-1)
# Create DataLoader train_dataset = TensorDataset(X_train, y_train) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False) # Note: shuffle=False for time series!
# Initialize model model = LSTMForecaster() criterion = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Training loop model.train() for epoch in range(epochs): total_loss = 0 for batch_X, batch_y in train_loader: optimizer.zero_grad() outputs = model(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() total_loss += loss.item()
if (epoch + 1) % 20 == 0: print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.6f}")
# Evaluate model.eval() with torch.no_grad(): predictions = model(X_test)
# Inverse transform to original scale preds_original = scaler.inverse_transform(predictions.numpy()) actuals_original = scaler.inverse_transform(y_test.numpy())
mae = mean_absolute_error(actuals_original, preds_original) print(f"\nTest MAE: {mae:.4f}")
return model, scaler, mae
# Usage:# df = pd.read_csv('data.csv', index_col='date', parse_dates=True)# model, scaler, mae = train_lstm_forecaster(df['value'], look_back=30)Exercise 4: Build Anomaly Detection System
Section titled “Exercise 4: Build Anomaly Detection System”"""Production-ready time series anomaly detection with seasonal baselines."""import pandas as pdimport numpy as npfrom datetime import datetime, timedeltafrom collections import defaultdict
class SeasonalAnomalyDetector: """ Anomaly detector that learns hour-of-day and day-of-week patterns. """
def __init__(self, z_threshold=3.0, min_samples=20): self.z_threshold = z_threshold self.min_samples = min_samples # Store baselines by (hour, day_of_week) self.baselines = defaultdict(lambda: {'values': [], 'mean': None, 'std': None})
def fit(self, series): """ Learn normal patterns from historical data.
series: pd.Series with DatetimeIndex """ for timestamp, value in series.items(): key = (timestamp.hour, timestamp.dayofweek) self.baselines[key]['values'].append(value)
# Calculate statistics for each time slot for key, data in self.baselines.items(): values = np.array(data['values']) if len(values) >= self.min_samples: # Use robust statistics (median and MAD) for outlier resistance data['mean'] = np.median(values) mad = np.median(np.abs(values - data['mean'])) data['std'] = 1.4826 * mad # Scale MAD to approximate std
# Fallback to regular std if MAD is 0 if data['std'] < 1e-6: data['std'] = np.std(values) else: data['mean'] = None data['std'] = None
print(f"Fitted on {len(series)} points") print(f"Unique time slots learned: {len(self.baselines)}")
return self
def detect(self, timestamp, value): """ Check if a value is anomalous given its timestamp.
Returns (is_anomaly, details_dict) """ key = (timestamp.hour, timestamp.dayofweek) baseline = self.baselines[key]
if baseline['mean'] is None: return False, { 'status': 'insufficient_history', 'message': f'Not enough data for {timestamp.hour}:00 on day {timestamp.dayofweek}' }
z_score = (value - baseline['mean']) / (baseline['std'] + 1e-10)
is_anomaly = abs(z_score) > self.z_threshold
return is_anomaly, { 'status': 'anomaly' if is_anomaly else 'normal', 'value': value, 'expected': baseline['mean'], 'z_score': z_score, 'threshold': self.z_threshold, 'timestamp': timestamp }
def detect_batch(self, series): """ Run detection on a batch of data, returning all anomalies. """ anomalies = [] for timestamp, value in series.items(): is_anomaly, details = self.detect(timestamp, value) if is_anomaly: anomalies.append(details)
print(f"Found {len(anomalies)} anomalies in {len(series)} points") print(f"Anomaly rate: {len(anomalies)/len(series)*100:.2f}%")
return anomalies
# Usage example:# train_data = df['metric']['2024-01-01':'2024-06-30']# detector = SeasonalAnomalyDetector(z_threshold=3.0)# detector.fit(train_data)## test_data = df['metric']['2024-07-01':'2024-07-31']# anomalies = detector.detect_batch(test_data)# for a in anomalies[:5]:# print(f"{a['timestamp']}: value={a['value']:.2f}, expected={a['expected']:.2f}, z={a['z_score']:.2f}")Summary
Section titled “Summary”TIME SERIES FORECASTING TOOLKIT:────────────────────────────────
┌───────────────────────────────────────────────────────────┐│ CLASSICAL METHODS │├────────────────┬──────────────────────────────────────────┤│ ARIMA/SARIMA │ Statistical, interpretable, good baseline││ Prophet │ Easy to use, handles holidays, robust ││ Exponential │ Simple, fast, good for benchmarking ││ Smoothing │ │└────────────────┴──────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐│ DEEP LEARNING METHODS │├────────────────┬──────────────────────────────────────────┤│ LSTM/GRU │ Sequential, good for medium-length deps ││ Transformer │ Parallel, great for long dependencies ││ TFT │ State-of-art, interpretable attention ││ N-BEATS │ Pure DL, no hand-crafted features │└────────────────┴──────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐│ ENSEMBLE / HYBRID │├────────────────┬──────────────────────────────────────────┤│ LightGBM+Lags │ Often wins competitions! Simple & fast ││ Stacking │ Combine multiple model predictions ││ Weighted Avg │ Average classical + DL forecasts │└────────────────┴──────────────────────────────────────────┘
DECISION FLOWCHART:─────────────────── Start │ ┌────────────┴────────────┐ │ How many series? │ └────────────┬────────────┘ │ ┌───────┴───────┐ <10 >100 │ │ ▼ ▼ ARIMA/ Global Prophet Model │ │ ┌─────┴─────┐ ┌─────┴─────┐ │Seasonality│ │ >10k pts │ └─────┬─────┘ └─────┬─────┘ Yes│No Yes│No │ │ │ │ ▼ ▼ ▼ ▼ SARIMA AR Transformer LightGBM Prophet TFT +LagsHistorical Context
Section titled “Historical Context”Understanding where time series methods came from helps appreciate their design decisions and limitations.
The Classical Era (1920s-1970s)
Section titled “The Classical Era (1920s-1970s)”Time series forecasting began with simple moving averages in the 1920s, used primarily for economic forecasting and quality control in manufacturing. The field was transformed in the 1950s when Robert Brown developed exponential smoothing while working at the U.S. Navy’s Office of Operations Research. Brown needed to forecast demand for submarine spare parts—a problem where recent data should matter more than old data. His exponentially weighted moving average became the foundation for modern forecasting.
The next revolution came in 1970 when George Box and Gwilym Jenkins published their seminal book on ARIMA models. Box was a statistician at the University of Wisconsin, and Jenkins worked at the University of Lancaster. Their methodology—identify, estimate, diagnose, forecast—remained the dominant paradigm for three decades. Box famously noted, “All models are wrong, but some are useful,” capturing the pragmatic philosophy that still guides forecasting today.
Did You Know? The Box-Jenkins methodology was originally developed for predicting gas furnace temperatures in chemical plants. The autocorrelation techniques they refined for this industrial application became the foundation for forecasting everything from stock prices to weather patterns.
The Machine Learning Era (2000s-2010s)
Section titled “The Machine Learning Era (2000s-2010s)”The rise of machine learning brought new approaches to time series. Recurrent Neural Networks (RNNs) were proposed as early as 1986 by David Rumelhart, but the vanishing gradient problem limited their practical use. Sepp Hochreiter and Jürgen Schmidhuber solved this in 1997 with Long Short-Term Memory (LSTM) networks, but computing power wasn’t sufficient to train them effectively until the 2010s.
Meanwhile, practical forecasters discovered that gradient boosting with hand-crafted features often outperformed neural networks. The M Competitions (Makridakis Competitions), running since 1982, provided rigorous benchmarks. In the 2018 M4 competition, a hybrid approach combining exponential smoothing with neural networks won—showing that classical and modern methods could complement each other.
The Transformer Era (2017-Present)
Section titled “The Transformer Era (2017-Present)”The introduction of Transformers in 2017 (Vaswani et al.’s “Attention Is All You Need”) revolutionized natural language processing and eventually time series. The attention mechanism solved the fundamental problem that plagued RNNs: how to directly connect distant timesteps without information degrading through sequential processing.
Google’s Temporal Fusion Transformer (2020) adapted these ideas specifically for time series, adding variable selection networks to handle the many exogenous variables common in forecasting problems. Amazon’s DeepAR and Facebook’s Prophet (2017) democratized sophisticated forecasting, making it accessible to practitioners without deep statistical training.
Today, we’re in an exciting period where classical methods, gradient boosting, and deep learning each have their place. The key insight from decades of research: no single method dominates. The best practitioners understand the strengths of each approach and choose based on their specific problem, data, and constraints.
Further Reading
Section titled “Further Reading”Papers
Section titled “Papers”- “Time Series Analysis: Forecasting and Control” (Box & Jenkins, 1970) - The foundational text that defined modern time series analysis
- “Time Series Forecasting with Prophet” (Taylor & Letham, 2017) - Facebook’s accessible forecasting framework
- “Temporal Fusion Transformers” (Lim et al., 2020) - State-of-the-art deep learning for interpretable forecasting
- “N-BEATS: Neural Basis Expansion Analysis” (Oreshkin et al., 2020) - Pure deep learning without hand-crafted features
- “Deep Learning for Time Series Forecasting” (Lim & Zohren, 2021) - Comprehensive survey of modern methods
- “The M5 Accuracy Competition: Results, Findings and Conclusions” (Makridakis et al., 2022) - Empirical insights from the largest forecasting competition
Libraries
Section titled “Libraries”- statsmodels: ARIMA, exponential smoothing, and classical statistical methods
- Prophet: Facebook’s forecasting library, excellent for daily data with seasonality
- GluonTS: Amazon’s deep learning time series toolkit with DeepAR and other models
- Darts: Unified interface for classical, ML, and deep learning methods
- sktime: scikit-learn compatible time series with consistent API
- pytorch-forecasting: PyTorch-based deep learning models including TFT
- NeuralProphet: Prophet-like interface with neural network backends
Next Steps
Section titled “Next Steps”You now understand time series forecasting from classical ARIMA to modern transformers!
Up Next: Module 39 - AutoML & Feature Stores
Module 38 Complete! “The best forecast is the one that’s useful, not the one that’s most complex.”