Skip to content

Time Series Forecasting

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 6-8

Illustrative AutoML comparison.

A junior practitioner can sometimes surface an AutoML result that challenges a carefully hand-built baseline.

A team had invested substantial time in a hand-built churn model before comparing it with an AutoML baseline.

Sarah asked a naive question: “Could I try AutoGluon on the same data?”

The senior engineers exchanged knowing glances. “Sure, but don’t expect it to beat a model built by experienced engineers.”

A short AutoML run can sometimes produce a stronger baseline than a manually tuned model, depending on the dataset and evaluation setup.

“How is that possible?” asked the team lead.

“It evaluated multiple model families, combined strong candidates, and surfaced patterns the manual workflow had not captured,” the teammate explained.

The operational lesson is that AutoML can provide a fast, credible baseline before a team commits to more manual modeling work.

“AutoML doesn’t make data scientists obsolete—it makes their time more valuable. Now you can spend six months on problems that actually need human creativity.” — Common AutoML perspective

This module teaches you how to use AutoML tools effectively—and why they’re not magic, but a powerful force multiplier for any ML practitioner.


By the end of this module, you will:

  • Understand AutoML and when to use it
  • Master AutoGluon for automated machine learning
  • Learn feature store concepts and architecture
  • Implement automated feature engineering
  • Build end-to-end ML pipelines

Imagine you’re a chef at a restaurant. Traditional ML is like cooking everything from scratch - you select ingredients (features), decide cooking methods (algorithms), adjust seasoning (hyperparameters), and taste-test constantly (validation). It requires years of expertise.

AutoML is like having a robot chef that tries hundreds of recipes automatically, learns from each attempt, and presents you with the best dish. You just need to provide the ingredients and describe what you want.

TRADITIONAL ML WORKFLOW:
────────────────────────
Data → [Manual Feature Eng] → [Choose Algorithm] → [Tune Hyperparameters] → Model
↑ ↑ ↑
│ │ │
Requires expertise Requires expertise Takes days/weeks
│ │ │
└─────────────────────────┴──────────────────────┘
TIME: Days to Weeks
AUTOML WORKFLOW:
────────────────
Data → [AutoML System] → Best Model
├── Tries 100+ algorithms
├── Engineers features automatically
├── Tunes hyperparameters
└── Ensembles best models
TIME: Minutes to Hours

Did You Know? AutoML systems have reported strong benchmark and competition results under fixed compute budgets, but the exact ranking and runtime depend on the benchmark and configuration.


REALITY OF ML IN INDUSTRY:
──────────────────────────
Companies with ML needs: ████████████████████ 1,000,000+
Companies with ML experts: ███ ~50,000
Expert ML engineers: █ ~300,000
Gap: 95%+ of companies can't hire ML experts!
AutoML bridges this gap:
- Democratizes ML (anyone can use it)
- Accelerates expert productivity (10x faster)
- Establishes strong baselines automatically
USE AUTOML:
───────────
Establishing baselines quickly
Tabular data problems
Time-constrained projects
Non-expert teams
Comparing many algorithms
Hyperparameter optimization
DON'T USE AUTOML (alone):
─────────────────────────
Custom architectures needed
Domain-specific constraints
Real-time inference requirements
Highly specialized problems
When interpretability is critical

AUTOML FRAMEWORK COMPARISON:
────────────────────────────
┌──────────────────────────────────────────────────────────────────┐
│ FRAMEWORK │ BEST FOR │ KEY STRENGTH │
├──────────────────────────────────────────────────────────────────┤
│ AutoGluon │ Tabular, general │ Best accuracy, ensembles │
│ (Amazon) │ │ │
├──────────────────────────────────────────────────────────────────┤
│ auto-sklearn │ scikit-learn users │ Meta-learning, warm-start │
│ (Freiburg) │ │ │
├──────────────────────────────────────────────────────────────────┤
│ H2O AutoML │ Enterprise │ Production-ready, scaling │
│ (H2O.ai) │ │ │
├──────────────────────────────────────────────────────────────────┤
│ FLAML │ Fast experiments │ Low compute, fast │
│ (Microsoft) │ │ │
├──────────────────────────────────────────────────────────────────┤
│ PyCaret │ Low-code ML │ Simple API, visualization │
│ (Open Source) │ │ │
└──────────────────────────────────────────────────────────────────┘

AutoGluon (from Amazon) has reported strong results on tabular benchmarks and competitions:

AUTOGLUON ARCHITECTURE:
───────────────────────
Input Data
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│NN │ │GBM │ │Linear │
│Models │ │Models │ │Models │
└───┬───┘ └───┬───┘ └───┬───┘
│ │ │
│ ┌──────┼──────┐ │
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────┐
│ MULTI-LAYER STACKING │
│ (Ensemble of ensembles) │
└──────────────┬──────────────┘
Best Model
MODELS AUTOGLUON TRIES:
───────────────────────
Neural Networks:
- TabularNN (custom for tabular)
- FastAI neural network
Gradient Boosting:
- LightGBM
- CatBoost
- XGBoost
Linear Models:
- Ridge/Lasso regression
- Linear SVM
Ensemble:
- Weighted ensemble
- Multi-layer stacking

Did You Know? AutoGluon uses stacked ensembles that train later models on earlier model predictions. This can improve accuracy, but the exact gain depends on the dataset, time budget, and evaluation setup.


How does AutoML choose which algorithms to try?

ALGORITHM SELECTION STRATEGIES:
───────────────────────────────
1. EXHAUSTIVE SEARCH
Try ALL algorithms, all hyperparameters
Problem: Computationally infeasible!
10 algorithms × 100 hyperparameter combos = 1,000 models
If each takes 1 minute = 16+ hours
2. META-LEARNING (auto-sklearn approach)
Learn from past datasets which algorithms work best
"This dataset looks like Dataset #4,523 from our database.
Random Forest worked best there, let's try that first!"
Steps:
a) Extract meta-features from dataset
b) Find similar historical datasets
c) Start with algorithms that worked on those
3. BAYESIAN OPTIMIZATION
Smart search that learns as it goes
┌──────────────────────────────────────────────┐
│ Iteration 1: Try random config → Score: 0.75 │
│ Iteration 2: Try another → Score: 0.82 │
│ Iteration 3: Try similar to best → 0.85 │
│ ...learns that high learning_rate is bad... │
│ Iteration 50: Optimal found → 0.91 │
└──────────────────────────────────────────────┘
4. BANDIT-BASED (Hyperband/ASHA)
Give more resources to promising configs
Start: 100 configs with 1 epoch each
Keep: Top 25 configs, train for 4 epochs
Keep: Top 6 configs, train for 16 epochs
Keep: Top 2 configs, train to completion
Result: Find best config with 10x less compute!

Think of hyperparameter optimization like tuning a guitar. Each hyperparameter is a string that affects the sound. Turn the learning rate too high and you get noise; too low and you barely hear anything. The problem? A neural network has dozens of “strings,” and they all interact with each other.

Traditional approach: try every combination. With 10 hyperparameters and 5 values each, that’s 5^10 = 9.7 million combinations. Even at 1 minute per trial, that’s 18 years.

Smart approach: use Bayesian optimization, which learns from each trial. “High learning rate made things worse? Let’s try lower values.” It can often find good configurations in far fewer trials than exhaustive search.

HYPERPARAMETER SEARCH SPACE:
────────────────────────────
LightGBM example:
┌─────────────────────────────────────────────────┐
│ Parameter │ Search Space │
├─────────────────────────────────────────────────┤
│ n_estimators │ [100, 200, 500, 1000] │
│ learning_rate │ [0.01, 0.05, 0.1, 0.3] │
│ max_depth │ [3, 5, 7, 10, -1] │
│ num_leaves │ [15, 31, 63, 127] │
│ min_child_weight │ [1e-3, 1e-2, 0.1, 1] │
│ subsample │ [0.5, 0.7, 0.9, 1.0] │
│ colsample_bytree │ [0.5, 0.7, 0.9, 1.0] │
└─────────────────────────────────────────────────┘
Total combinations: 4×4×5×4×4×4×4 = 20,480 configs!
Smart search finds good config in ~50 trials
Exhaustive search needs 20,480 trials
Speedup: 400x

FEATURE ENGINEERING REALITY:
────────────────────────────
Time spent on ML projects:
┌──────────────────────────────────────────────────────┐
│ Data Collection ████████████ 25% │
│ Data Cleaning ████████████████ 35% │
│ Feature Engineering ██████████████ 30% │
│ Model Training ████ 10% │
└──────────────────────────────────────────────────────┘
Feature engineering is:
- Time-consuming (30% of project time)
- Requires domain expertise
- Often repetitive across projects
- Critical for model performance
"Give me better features, and I'll give you a better model."
- Every ML Engineer
AUTO-FEATURE TECHNIQUES:
────────────────────────
1. AGGREGATION (for relational data)
─────────────────────────────────
customer_id → orders table
Auto-generated features:
- count(orders)
- sum(order_amount)
- avg(order_amount)
- max(order_amount)
- days_since_last_order
2. TRANSFORMATION
───────────────
Original: [price, quantity]
Auto-generated:
- log(price)
- sqrt(quantity)
- price * quantity (interaction)
- price / quantity (ratio)
- price ** 2 (polynomial)
3. TIME-BASED (from timestamps)
────────────────────────────
Original: purchase_datetime
Auto-generated:
- hour_of_day
- day_of_week
- is_weekend
- month
- quarter
- days_since_signup
4. ENCODING (for categoricals)
───────────────────────────
Original: category = ["A", "B", "C"]
Auto-generated:
- One-hot encoding
- Target encoding
- Frequency encoding
- Label encoding
DEEP FEATURE SYNTHESIS (DFS):
─────────────────────────────
Given relational tables, automatically generate features:
TABLES:
customers(id, signup_date, country)
orders(id, customer_id, date, amount)
products(id, order_id, product_type, price)
DFS GENERATES:
──────────────
Depth 1: Simple aggregations
- COUNT(orders)
- SUM(orders.amount)
- AVG(orders.amount)
Depth 2: Stacked aggregations
- COUNT(orders.products)
- AVG(orders.SUM(products.price))
- MODE(orders.MODE(products.product_type))
Depth 3: Triple-stacked!
- STD(orders.AVG(products.price))
Result: 100s of features from 3 tables!
PRIMITIVES USED:
────────────────
Aggregation: sum, mean, count, max, min, std, mode
Transform: year, month, weekday, cum_sum, diff

Did You Know? Automated feature engineering can generate large numbers of relational features quickly, but those features still need regularization, validation, and task-specific judgment to be useful.


Think of a feature store as a “data warehouse for ML features” - a centralized repository where teams can share, discover, and reuse features.

Imagine a restaurant where every chef prepares their own spice blends. Chef A makes curry powder. Chef B makes the same curry powder differently. Chef C needs curry powder but doesn’t know it already exists, so they make a third version. Different dishes taste inconsistent, ingredients are wasted, and no one knows which recipe is “official.”

Now imagine a central spice cabinet with standardized, labeled blends. Every chef uses the same curry powder. New chefs can see what’s available. If the curry powder recipe improves, all dishes improve automatically.

That’s what a feature store does for ML features—centralizes, standardizes, and shares them across teams.

WITHOUT FEATURE STORE:
──────────────────────
Team A: Builds "customer_lifetime_value" feature
├── Writes SQL query
├── Schedules daily job
└── Stores in their own table
Team B: Needs same feature
├── Doesn't know Team A has it
├── Builds their own version
└── Gets slightly different results!
Team C: Needs feature for real-time inference
├── Can't use batch SQL
└── Builds third version!
Result: 3 versions of the same feature, inconsistent!
WITH FEATURE STORE:
───────────────────
┌─────────────────────────────┐
│ FEATURE STORE │
│ ┌─────────────────────┐ │
│ │ customer_lifetime_ │ │
│ │ value │ │
│ │ - Batch: Daily SQL │ │
│ │ - Online: Redis │ │
│ │ - Owner: Team A │ │
│ │ - Version: 2.3 │ │
│ └─────────────────────┘ │
└──────────────┬──────────────┘
┌──────────────────┼──────────────────┐
│ │ │
Team A Team B Team C
(training) (training) (inference)
All teams use the SAME feature definition!
FEATURE STORE COMPONENTS:
─────────────────────────
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE STORE │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ FEATURE REGISTRY │ │
│ │ - Feature definitions (schema, transformations) │ │
│ │ - Metadata (owner, description, data lineage) │ │
│ │ - Versioning │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴────────────────────┐ │
│ │ │ │
│ ┌──────▼──────┐ ┌───────▼───────┐ │
│ │ OFFLINE │ │ ONLINE │ │
│ │ STORE │ │ STORE │ │
│ │ │ │ │ │
│ │ - Historical│ Sync │ - Latest │ │
│ │ - Training │ ◄──────────────────────►│ - Real-time │ │
│ │ - BigQuery/ │ │ - Redis/ │ │
│ │ S3/HDFS │ │ DynamoDB │ │
│ └─────────────┘ └───────────────┘ │
│ │ │ │
│ │ │ │
│ ┌──────▼──────────────────────────────────────────▼──────┐ │
│ │ SERVING LAYER │ │
│ │ - Batch serving (training) │ │
│ │ - Online serving (inference) │ │
│ │ - Point-in-time correctness │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

This is the MOST important concept in feature stores:

POINT-IN-TIME PROBLEM:
──────────────────────
Training data preparation:
"What was the customer's purchase_count on 2024-01-15?"
WRONG APPROACH (data leakage!):
SELECT purchase_count FROM current_features
WHERE customer_id = 123
Problem: Returns TODAY's count, not 2024-01-15's!
The model trains on future information = cheating!
CORRECT APPROACH (point-in-time join):
SELECT purchase_count FROM feature_history
WHERE customer_id = 123
AND feature_timestamp <= '2024-01-15'
ORDER BY feature_timestamp DESC
LIMIT 1
Returns: Value as it was on 2024-01-15
TIMELINE:
─────────
2024-01-15 Today
│ │
▼ ▼
──────────●─────────────────●──────►
│ │
Training event Don't use
Use features these values!
from HERE

Did You Know? Feature stores centralize feature definitions, can support online and offline retrieval, and help reduce training-serving inconsistencies.


FEAST COMPONENTS:
─────────────────
┌─────────────────────────────────────────────────────────────┐
│ FEAST │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ FEATURE REPO │ │
│ │ feature_store.yaml # Configuration │ │
│ │ features.py # Feature definitions │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ │ feast apply │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ REGISTRY │ │
│ │ - Feature views │ │
│ │ - Entities │ │
│ │ - Data sources │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Offline │ │ Online │ │ Serving │ │
│ │ Store │ │ Store │ │ API │ │
│ │ (Parquet) │ │ (Redis) │ │ (gRPC) │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────┘
features.py
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64
# Define entity (the thing we're building features for)
customer = Entity(
name="customer_id",
join_keys=["customer_id"],
description="Customer identifier"
)
# Define data source
customer_stats_source = FileSource(
path="data/customer_stats.parquet",
timestamp_field="event_timestamp"
)
# Define feature view
customer_stats = FeatureView(
name="customer_stats",
entities=[customer],
ttl=timedelta(days=90), # How long features are valid
schema=[
Field(name="total_purchases", dtype=Int64),
Field(name="avg_order_value", dtype=Float32),
Field(name="days_since_last_order", dtype=Int64),
],
online=True, # Serve from online store
source=customer_stats_source,
)
# Training: Get historical features
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# Entity dataframe (what we want features for)
entity_df = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5],
"event_timestamp": pd.to_datetime(["2024-01-15"] * 5)
})
# Get training data with point-in-time correctness!
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"customer_stats:total_purchases",
"customer_stats:avg_order_value",
"customer_stats:days_since_last_order"
]
).to_df()
# Inference: Get online features
feature_vector = store.get_online_features(
features=[
"customer_stats:total_purchases",
"customer_stats:avg_order_value",
],
entity_rows=[{"customer_id": 123}]
).to_dict()
# Returns: {"customer_id": [123], "total_purchases": [47], ...}

AUTOMATED ML PIPELINE:
──────────────────────
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Data │──►│ Feature │──►│ Model │──►│ Model │──►│ Deploy │
│ Ingest │ │ Eng │ │ Train │ │ Eval │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Scheduled Feature AutoML Metrics A/B Test
Jobs Store Trains Tracked or Canary
ORCHESTRATION:
- Airflow / Dagster / Prefect
- MLflow for experiment tracking
- Feature store for features
- Model registry for models
- CI/CD for deployment
MLFLOW EXPERIMENT TRACKING:
───────────────────────────
with mlflow.start_run():
# Log parameters
mlflow.log_param("model_type", "LightGBM")
mlflow.log_param("n_estimators", 100)
# Train model
model = train_model(...)
# Log metrics
mlflow.log_metric("accuracy", 0.95)
mlflow.log_metric("f1_score", 0.93)
# Log model
mlflow.sklearn.log_model(model, "model")
MLFLOW UI shows:
┌────────────────────────────────────────────────────────────┐
│ Run ID │ Model │ n_estimators │ Accuracy │ F1 │
├────────────────────────────────────────────────────────────┤
│ abc123 │ LightGBM │ 100 │ 0.95 │ 0.93 │
│ def456 │ XGBoost │ 200 │ 0.94 │ 0.92 │
│ ghi789 │ RF │ 150 │ 0.92 │ 0.90 │
└────────────────────────────────────────────────────────────┘
Easy comparison across experiments!

PRODUCTION AUTOML WORKFLOW:
───────────────────────────
1. DATA PREPARATION
├── Load data
├── Basic cleaning (handle missing, remove duplicates)
├── Define target variable
└── Split train/validation/test
2. QUICK AUTOML RUN (1 hour)
├── Use AutoGluon with time_limit=3600
├── Get baseline performance
└── Identify promising models
3. ANALYZE RESULTS
├── Check feature importance
├── Look for data leakage
└── Understand model behavior
4. EXTENDED RUN (optional)
├── Run AutoGluon with more time
├── Try different presets (best_quality vs optimize_for_deployment)
└── Experiment with hyperparameter ranges
5. MODEL SELECTION
├── Compare accuracy vs inference time
├── Consider interpretability needs
└── Check model size for deployment
6. PRODUCTION PREPARATION
├── Export best model
├── Create feature pipeline
├── Set up monitoring
└── Deploy with gradual rollout
AUTOGLUON PRESETS:
──────────────────
PRESET: "best_quality"
- Maximum accuracy
- Uses all algorithms
- Deep stacking ensembles
- Time: 4-8x longer
- Use for: Final production models
PRESET: "high_quality"
- Near-optimal accuracy
- Good ensemble
- Reasonable time
- Use for: Most use cases
PRESET: "good_quality"
- Good accuracy
- Faster training
- Use for: Prototyping
PRESET: "medium_quality"
- Decent accuracy
- Much faster
- Use for: Quick baselines
PRESET: "optimize_for_deployment"
- Single model (no ensemble)
- Fast inference
- Smaller model size
- Use for: Real-time serving

COMMON AUTOML MISTAKES:
───────────────────────
1. DATA LEAKAGE
───────────────
Problem: Target information leaks into features
Example: Predicting "will customer churn?"
Bad feature: "cancellation_date" (directly reveals answer!)
Solution: Review feature importance, suspicious features
that are too predictive are often leaky
2. OVERFITTING TO VALIDATION
──────────────────────────
Problem: Running AutoML many times, picking best
Each run: Accuracy = 0.91, 0.92, 0.93, 0.94, 0.90...
Pick best: 0.94!
Test set: 0.88 (overfit to validation)
Solution: Hold out a TRUE test set, evaluate once at end
3. IGNORING BUSINESS CONSTRAINTS
───────────────────────────────
Problem: Best model has 200ms latency, need < 10ms
AutoGluon best model: Stacked ensemble
Inference time: 200ms
Solution: Use optimize_for_deployment preset
or constrain model types
4. FEATURE STORE SKEW
────────────────────
Problem: Training features differ from serving features
Training: feature_v1 (old transformation)
Serving: feature_v2 (new transformation)
Solution: Use feature store with versioning
AUTOML BEST PRACTICES:
──────────────────────
1. START SIMPLE
- Run quick AutoML first (30 min - 1 hour)
- Get baseline before investing more time
- Understand what's possible
2. FEATURE ENGINEERING STILL MATTERS
- AutoML optimizes models, not features
- Domain features often help
- Combine AutoML + manual feature eng
3. USE PROPER VALIDATION
- Time-based splits for time series
- Stratified for imbalanced classes
- Group splits for related samples
4. MONITOR IN PRODUCTION
- Track feature distributions
- Monitor model performance
- Set up drift detection
5. DOCUMENT EVERYTHING
- Which AutoML settings?
- What features used?
- Business metrics impact?

Did You Know? AutoML systems typically include built-in preprocessing and feature handling for tabular data, but the exact transformations and benchmark outcomes depend on the product and evaluation setup.


KEY CONCEPTS RECAP:
───────────────────
AUTOML:
- Automates algorithm selection + hyperparameter tuning
- AutoGluon: Best accuracy, multi-layer stacking
- Use presets based on needs (quality vs speed)
FEATURE STORES:
- Centralized feature management
- Point-in-time correctness for training
- Online/offline serving
- Feast: Open source, easy to start
AUTOMATED FEATURE ENGINEERING:
- Aggregations, transformations, time features
- Featuretools for deep feature synthesis
- Still combine with domain knowledge
ML PIPELINES:
- End-to-end automation
- MLflow for experiment tracking
- Orchestration (Airflow, Dagster)
WHEN TO USE WHAT:
─────────────────
┌─────────────────────────────────────────────────────────────┐
│ SCENARIO │ RECOMMENDATION │
├─────────────────────────────────────────────────────────────┤
│ Quick baseline │ AutoGluon, medium_quality │
│ Production model │ AutoGluon, best_quality │
│ Real-time serving │ AutoGluon, optimize_for_deployment│
│ Feature reuse │ Feast feature store │
│ Relational data │ Featuretools + AutoML │
│ Team collaboration │ Feature store + MLflow │
└─────────────────────────────────────────────────────────────┘

Production War Stories: AutoML and Feature Store Lessons

Section titled “Production War Stories: AutoML and Feature Store Lessons”

Illustrative fintech scenario.

A team can sometimes see a large offline-metric jump from an AutoML run, but that does not guarantee the features are valid for production decisions.

After deployment, a model can fail badly if it learned from leaked or otherwise invalid features.

The forensic analysis revealed the issue: a highly ranked feature depended on information that was not actually available at prediction time, which created data leakage.

The AutoML system had found a perfect predictor: a feature that leaked the outcome. It’s like trying to predict who will win a race by looking at the finish photo—technically accurate, but useless for making predictions before the race.

Financial impact: data leakage can cause severe financial and organizational damage before a team detects it.

The fix implemented:

# Before: Feature calculated whenever
days_until_first_payment = payment_date - approval_date
# After: Strict point-in-time feature validation
def validate_feature_timing(feature_name, feature_timestamp, prediction_timestamp):
"""Ensure feature was available BEFORE prediction was needed."""
if feature_timestamp > prediction_timestamp:
raise DataLeakageError(
f"Feature '{feature_name}' has timestamp {feature_timestamp} "
f"but prediction was needed at {prediction_timestamp}. "
f"This is data leakage!"
)

Lesson: AutoML will find ANY signal, including signals from the future. Point-in-time validation isn’t optional—it’s essential.

Did You Know? Data leakage is common in production ML, can take time to detect, and point-in-time-correct feature retrieval helps reduce that risk.


Illustrative retail scenario.

A high-traffic retail system can fail if feature computation becomes the latency bottleneck during peak demand.

This year, they’d implemented Feast as their feature store. The architecture was different:

LAST YEAR (crashed):
────────────────────
Request → Compute Features On-Demand → Model → Response
└── SQL query per request
└── 200ms latency
└── Can't scale past 500 RPS
THIS YEAR (Feast):
─────────────────
Pre-computed features → Redis (Online Store)
Request → Redis lookup (1ms) → Model → Response
└── 50,000+ RPS
└── 5ms total latency

With precomputed online features, a recommendation stack can handle much higher traffic with lower latency than on-demand feature computation.

The key insight: Feature computation is the bottleneck, not model inference. Pre-computing features and serving from an online store changed everything.

Financial impact: improving feature serving can materially affect both revenue and operating cost, especially during peak traffic.


Illustrative insurance scenario.

The compliance team found that model decisions were uneven across location-based groups, raising a fairness concern.

Investigation suggested that the model had learned from proxy signals tied to historically biased outcomes, so the issue was not just accuracy but fairness.

The team’s response:

  1. Removed proxy features: location-based and other features that acted as strong proxies for protected characteristics
  2. Added fairness constraints: introduced explicit checks so model behavior would be reviewed against agreed fairness criteria across groups
  3. Implemented explainability: Required human review for any denial with unusual feature weights
# Fairness-aware AutoML configuration
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(
label='claim_approved',
eval_metric='roc_auc'
).fit(
train_data,
# Add fairness constraint
hyperparameters={
'GBM': {
'constraint_type': 'demographic_parity',
'fairness_target': 'race_proxy',
'fairness_threshold': 0.05
}
}
)

Regulatory outcome: addressing fairness issues early can reduce legal and regulatory risk.

Lesson: AutoML optimizes what you tell it to optimize. If you only optimize for accuracy, it will happily learn discriminatory patterns. Fairness must be an explicit constraint.

Did You Know? Amazon famously scrapped an AI recruiting tool in 2018 after discovering it systematically downgraded women’s resumes. The model, trained on 10 years of hiring data, learned that Amazon had historically hired mostly men—and therefore preferred male candidates. This incident led to an industry-wide push for fairness-aware ML.


Mistake 1: Trusting AutoML Feature Importance Blindly

Section titled “Mistake 1: Trusting AutoML Feature Importance Blindly”

Wrong:

# AutoML found these are the most important features
# Great, let's use them!
top_features = model.feature_importance()[:10]
production_model = train_on(data[top_features]) # Dangerous!

Problem: Feature importance from AutoML can be misleading. A leaky feature will show high importance. A feature that’s important for one model type might be useless for another.

Right:

def validate_feature_importance(feature_name, importance_score, data):
"""Sanity check for suspiciously important features."""
# Check for data leakage
if importance_score > 0.3: # Suspiciously high
print(f"️ WARNING: {feature_name} has importance {importance_score}")
print("Checking for potential leakage...")
# Check if feature correlates with target timing
correlation_with_target = data[feature_name].corr(data['target'])
if abs(correlation_with_target) > 0.8:
raise LeakageWarning(
f"{feature_name} has {correlation_with_target:.2f} correlation "
f"with target. Likely data leakage!"
)
# Check if feature is available at prediction time
if not is_available_at_prediction_time(feature_name):
raise LeakageWarning(
f"{feature_name} is not available at prediction time!"
)
return True

Mistake 2: Not Setting Time Limits on AutoML

Section titled “Mistake 2: Not Setting Time Limits on AutoML”

Wrong:

# "Just let it run until it's done"
predictor = TabularPredictor(label='target').fit(train_data)
# 3 days later: still running, $2,000 in cloud costs

Problem: AutoML can keep exploring additional models and configurations unless you bound the run. Without explicit time limits, training can take longer and cost more than expected.

Right:

# Always set explicit time limits
predictor = TabularPredictor(
label='target',
eval_metric='roc_auc'
).fit(
train_data,
time_limit=3600, # 1 hour max
presets='best_quality', # Will do its best within time limit
# AutoGluon automatically prioritizes promising models
)

Wrong:

# Training time
training_features = compute_features_from_warehouse(training_data)
model.fit(training_features, labels)
# Serving time (different code path!)
serving_features = compute_features_from_api(request_data) # Different!
prediction = model.predict(serving_features)

Problem: Subtle differences in feature computation between training and serving cause silent model degradation. Think of it as using different thermometers that are calibrated differently—your predictions will be systematically off.

Right:

# Use feature store for BOTH training and serving
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# Training time
training_features = store.get_historical_features(
entity_df=training_entities,
features=['customer:total_purchases', 'customer:avg_order_value']
).to_df()
# Serving time (SAME feature definitions!)
serving_features = store.get_online_features(
features=['customer:total_purchases', 'customer:avg_order_value'],
entity_rows=[{"customer_id": request.customer_id}]
).to_dict()
# Features are guaranteed to be computed identically

Wrong:

# Features.py - Modified directly in production
customer_value = total_purchases * avg_order_value # Changed from sum to product
# Now training data has old definition, production has new...

Problem: Changing feature definitions without versioning creates chaos. Models trained on v1 features serving with v2 features will produce garbage.

Right:

# features_v2.py - Explicit versioning
class CustomerValueV2(FeatureView):
"""
Customer lifetime value calculation.
v1 -> v2 changes:
- Changed from sum to product formula
- Added recency weighting
- Breaking change: requires model retraining
Migration: Models must be retrained before using v2
"""
name = "customer_value_v2"
version = "2.0.0"
deprecates = "customer_value_v1" # Mark old version
requires_retraining = True
def compute(self, data):
return (data.total_purchases * data.avg_order_value *
self.recency_weight(data.days_since_last_order))

Wrong:

# "AutoML is magic, let's use it everywhere!"
image_model = AutoGluon.fit(image_data) # Works but suboptimal
text_model = AutoGluon.fit(text_data) # Works but suboptimal
time_series = AutoGluon.fit(ts_data) # Works but suboptimal

Problem: AutoML is often strongest on tabular data, while other modalities frequently benefit from more specialized methods.

Right:

WHEN TO USE AUTOML vs SPECIALIZED TOOLS:
────────────────────────────────────────
Data Type │ AutoML Good? │ Better Alternative
─────────────┼──────────────┼────────────────────────
Tabular │ Excellent │ N/A - AutoML is best
Images │ ️ OK │ Transfer learning (ResNet, ViT)
Text │ ️ OK │ Fine-tuned LLMs, BERT
Time Series │ ️ OK │ Prophet, NeuralProphet, DeepAR
Graph │ Poor │ PyTorch Geometric, DGL
Audio │ Poor │ Whisper, wav2vec

ScenarioManual MLAutoMLSavings
Initial Model Development
Data scientist timeMultiple weeks for a manual first modelAbout a week for a bounded AutoML baselineMeaningful time savings
Compute costsHigher from longer manual experimentationLower for a short bounded AutoML runModest compute savings
Time to productionSeveral weeksRoughly days to a couple of weeksFaster delivery
Ongoing Maintenance
Monthly retraining timeDays of manual workHours for a more automated workflowLower recurring effort
Model iteration costHigher per iterationLower per iterationLower iteration cost
Annual Savings
Initial + 12 months maintenanceHigher manual cost profileLower with more automationPotential annual savings
MetricWithout Feature StoreWith Feature Store
Feature development timeOften measured in weeksOften faster with reuse
Feature duplicationMultiple teams may rebuild similar featuresShared definitions reduce duplication
Training-serving skew incidentsCan happen repeatedlyCan be reduced with shared feature definitions
Revenue lost to skewCan be materialCan be reduced when skew is reduced
Engineer time on debuggingSignificantLower with shared infrastructure
Total Annual ImpactPotential operational savings
REAL COSTS OF MANUAL ML:
────────────────────────
┌────────────────────────────────────────────────────────────┐
│ Problem │ Typical Cost │
├────────────────────────────────────────────────────────────┤
│ Data leakage in production │ $1M-10M (depending on │
│ │ time to detect) │
├────────────────────────────────────────────────────────────┤
│ Training-serving skew │ $100K-500K per incident │
│ (silent model degradation) │ (lost revenue + debug) │
├────────────────────────────────────────────────────────────┤
│ Duplicate feature work │ $50K-200K/year │
│ (multiple teams building │ (wasted engineer time) │
│ same features) │ │
├────────────────────────────────────────────────────────────┤
│ Slow model iteration │ $500K-2M/year │
│ (opportunity cost of delayed │ (competitors move │
│ improvements) │ faster) │
└────────────────────────────────────────────────────────────┘

Did You Know? Teams often adopt feature stores to speed up model delivery and reduce operational rework, but the exact ROI depends heavily on the organization and workload.


Interview Preparation: AutoML & Feature Stores

Section titled “Interview Preparation: AutoML & Feature Stores”

Q1: “When would you use AutoML vs. hand-crafted models?”

Section titled “Q1: “When would you use AutoML vs. hand-crafted models?””

Strong Answer: “I use AutoML in three main scenarios. First, for establishing baselines quickly—before investing weeks in manual model development, I run AutoML to understand what’s achievable. If AutoML gets 0.75 AUC, I know my hand-crafted model should aim for at least 0.78 to justify the extra effort.

Second, for tabular data problems with clear evaluation metrics—this is a strong use case for AutoML, which can often provide a competitive baseline with far less manual effort.

Third, when the ML isn’t the core differentiator—if we’re building a feature where ML is a small component, I’d rather spend engineering time on the product, not model tuning.

I wouldn’t use AutoML when I need specific architectures like transformers for NLP, when there are strict latency requirements that need optimized single models, or when interpretability is critical and I need to explain every decision.”

Q2: “Explain point-in-time correctness in feature stores.”

Section titled “Q2: “Explain point-in-time correctness in feature stores.””

Strong Answer: “Point-in-time correctness ensures that when training a model, we only use feature values that were available at the time the prediction would have been made. It prevents data leakage from the future.

Imagine training a model to predict customer churn. If a customer churned on March 15, their training features should reflect their state on March 14 or earlier—not their state after they churned. Without point-in-time correctness, we might accidentally include features like ‘days_since_last_login’ that jumped to 30+ after they stopped using the product—a clear signal they’ve churned that wouldn’t be available when making a real prediction.

Feature stores implement this by maintaining timestamped feature values and performing point-in-time joins. When you request historical features for training, you provide entity timestamps, and the feature store returns the most recent feature value that existed before each timestamp.

This is critical because models trained with leaked features show fantastic offline metrics but fail dramatically in production—a pattern I’ve seen called ‘suspiciously good AUC syndrome.’”

Q3: “How does multi-layer stacking work in AutoGluon?”

Section titled “Q3: “How does multi-layer stacking work in AutoGluon?””

Strong Answer: “Multi-layer stacking is AutoGluon’s ensemble technique that significantly outperforms simple averaging or voting.

In the first layer, AutoGluon trains diverse base models—gradient boosting (LightGBM, XGBoost, CatBoost), neural networks, and linear models. Each model makes predictions on out-of-fold validation data to avoid leakage.

These first-layer predictions become features for the second layer, which trains new models to combine them. Think of it as learning ‘when to trust which model.’ If LightGBM is great on numerical features but weak on categoricals, while CatBoost is the opposite, the second layer learns to weight them appropriately based on the input.

AutoGluon can stack multiple layers, but the useful depth depends on the dataset, models, and time budget.

The key insight is that this outperforms simple ensembling because it learns non-linear combinations of model predictions. A weighted average says ‘LightGBM gets 40% weight.’ Multi-layer stacking says ‘LightGBM gets 80% weight when feature X is high, but only 20% when feature Y is low.’

In practice, stacked ensembles can outperform simpler combinations, but the size of that gain depends on the task, dataset, and evaluation setup.”

Q4: “What’s the difference between online and offline feature stores?”

Section titled “Q4: “What’s the difference between online and offline feature stores?””

Strong Answer: “They serve different use cases with different latency and storage requirements.

The offline store is optimized for training workloads. It stores historical feature values with timestamps, typically in data warehouses like BigQuery, Snowflake, or object storage like S3. Latency is seconds to minutes, but it can handle huge volumes—millions of feature vectors. I use it when creating training datasets with point-in-time correctness.

The online store is optimized for inference. It stores only the latest feature values in low-latency databases like Redis, DynamoDB, or Bigtable. Latency is single-digit milliseconds. I use it when serving predictions in real-time.

The key architectural insight is that they share the same feature definitions but different storage backends. The feature store materializes features from the offline store to the online store periodically—typically every few minutes to daily, depending on freshness requirements.

This dual architecture solves the training-serving skew problem. My training code uses get_historical_features from the offline store. My serving code uses get_online_features from the online store. Both use identical feature definitions, just different storage optimized for their use case.”

System Design: Design an ML Platform with AutoML and Feature Store

Section titled “System Design: Design an ML Platform with AutoML and Feature Store”

Prompt: “Design an ML platform for a fintech company that needs to make real-time credit decisions at 10,000 requests per second.”

Strong Answer:

“I’d design this with five key components:

1. Feature Store Architecture:

Online Store: Redis Cluster (6 nodes)
- Sharded by customer_id
- 500K features, 1ms p99 latency
- TTL: 24 hours, refreshed hourly
Offline Store: BigQuery
- Historical features for training
- 2 years of feature history
- Point-in-time query support
Feature Computation: Spark on Dataproc
- Batch features: daily runs at 2 AM
- Real-time features: Kafka Streams
- Features: credit_score, payment_history,
debt_to_income, account_age, etc.

2. AutoML Pipeline:

# Monthly retraining pipeline
def train_credit_model():
# Get training data with point-in-time correctness
training_data = feature_store.get_historical_features(
entity_df=approved_applications_last_6_months,
features=CREDIT_FEATURES,
label='defaulted_within_90_days'
)
# AutoML with fairness constraints
predictor = TabularPredictor(
label='defaulted',
eval_metric='roc_auc'
).fit(
training_data,
time_limit=14400, # 4 hours
presets='optimize_for_deployment', # Single model for latency
excluded_model_types=['NN'] # Neural nets too slow
)
# Validate fairness before promotion
if passes_fairness_audit(predictor):
deploy_to_production(predictor)

3. Serving Architecture for 10K RPS:

Load Balancer (GCP GLB)
├── Kubernetes Cluster (GKE)
│ └── Model Serving Pods (50 replicas)
│ - CPU-optimized (LightGBM)
│ - 10ms p99 latency per request
│ - gRPC for low overhead
└── Feature Store (Redis)
- 1ms feature fetch
- Pre-computed features only

4. Monitoring & Safety:

  • Feature drift detection: alert if distributions shift >10%
  • Model performance monitoring: daily AUC calculation on holdout
  • Fairness monitoring: automated demographic parity checks
  • Circuit breaker: fall back to rules-based model if ML fails

5. Cost Estimate:

  • Feature Store (Redis): $15K/month
  • Compute (GKE): $25K/month
  • BigQuery: $5K/month
  • Total: 45K/month=45K/month = 540K/year

Expected value: At 10K RPS, serving 864M decisions/day. Even 0.1% improvement in precision saves millions in bad debt.

This architecture handles 10K RPS with 15ms end-to-end latency while maintaining feature consistency and enabling rapid model iteration through AutoML.”


Build an AutoML baseline and compare it to a hand-crafted model:

"""
AutoML vs Manual Model Comparison
Dataset: Credit Card Fraud Detection (Kaggle)
Goal: Compare development time and accuracy
"""
import pandas as pd
from autogluon.tabular import TabularPredictor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import time
# Load data
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# MANUAL APPROACH
print("=" * 50)
print("MANUAL RANDOM FOREST")
print("=" * 50)
manual_start = time.time()
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_leaf=5,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict_proba(X_test)[:, 1]
rf_auc = roc_auc_score(y_test, rf_predictions)
manual_time = time.time() - manual_start
print(f"Time: {manual_time:.1f}s")
print(f"AUC: {rf_auc:.4f}")
# AUTOML APPROACH
print("\n" + "=" * 50)
print("AUTOGLUON")
print("=" * 50)
automl_start = time.time()
# Prepare data for AutoGluon
train_df = X_train.copy()
train_df['Class'] = y_train.values
predictor = TabularPredictor(
label='Class',
eval_metric='roc_auc',
verbosity=1
).fit(
train_df,
time_limit=300, # 5 minutes
presets='medium_quality'
)
test_df = X_test.copy()
automl_predictions = predictor.predict_proba(test_df)
automl_auc = roc_auc_score(y_test, automl_predictions.iloc[:, 1])
automl_time = time.time() - automl_start
print(f"Time: {automl_time:.1f}s")
print(f"AUC: {automl_auc:.4f}")
# COMPARISON
print("\n" + "=" * 50)
print("COMPARISON")
print("=" * 50)
print(f"Manual RF: {rf_auc:.4f} AUC in {manual_time:.1f}s")
print(f"AutoGluon: {automl_auc:.4f} AUC in {automl_time:.1f}s")
print(f"Improvement: {(automl_auc - rf_auc)*100:.2f} percentage points")
# See what AutoGluon tried
print("\n" + "=" * 50)
print("AUTOGLUON LEADERBOARD")
print("=" * 50)
print(predictor.leaderboard())

Expected Learning: AutoML can outperform a basic manually specified baseline, showing the value of automated model selection and ensembling.


Build a simple feature store with point-in-time correctness:

"""
Simple Feature Store Implementation
This exercise teaches the core concepts of feature stores:
- Point-in-time correctness
- Online vs offline serving
- Feature versioning
"""
import pandas as pd
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import json
import redis # For online store
class SimpleFeatureStore:
"""
A minimal feature store demonstrating core concepts.
Production feature stores like Feast add:
- Distributed storage
- Automatic materialization
- Schema validation
- Access control
"""
def __init__(self):
# Offline store: Historical features with timestamps
self.offline_store: Dict[str, pd.DataFrame] = {}
# Online store: Latest features only (simulated with dict)
self.online_store: Dict[str, Dict] = {}
# Feature registry: Metadata about features
self.registry: Dict[str, dict] = {}
def register_feature(
self,
name: str,
entity: str,
description: str,
version: str = "1.0.0"
):
"""Register a feature in the registry."""
self.registry[name] = {
'entity': entity,
'description': description,
'version': version,
'created_at': datetime.now().isoformat()
}
print(f" Registered feature: {name} v{version}")
def write_features(
self,
feature_name: str,
data: pd.DataFrame,
timestamp_col: str = 'event_timestamp'
):
"""Write features to both offline and online stores."""
if feature_name not in self.registry:
raise ValueError(f"Feature {feature_name} not registered!")
# Write to offline store (full history)
if feature_name not in self.offline_store:
self.offline_store[feature_name] = data.copy()
else:
self.offline_store[feature_name] = pd.concat([
self.offline_store[feature_name],
data
]).drop_duplicates()
# Write to online store (latest values only)
entity_col = self.registry[feature_name]['entity']
for _, row in data.iterrows():
entity_id = str(row[entity_col])
self.online_store[f"{feature_name}:{entity_id}"] = row.to_dict()
print(f" Wrote {len(data)} rows to {feature_name}")
def get_historical_features(
self,
feature_name: str,
entity_df: pd.DataFrame,
timestamp_col: str = 'event_timestamp'
) -> pd.DataFrame:
"""
Get historical features with point-in-time correctness.
This is the CRITICAL function that prevents data leakage.
"""
if feature_name not in self.offline_store:
raise ValueError(f"Feature {feature_name} not in offline store!")
feature_data = self.offline_store[feature_name]
entity_col = self.registry[feature_name]['entity']
results = []
for _, entity_row in entity_df.iterrows():
entity_id = entity_row[entity_col]
query_time = entity_row[timestamp_col]
# Point-in-time filter: only use features from BEFORE query time
valid_features = feature_data[
(feature_data[entity_col] == entity_id) &
(feature_data[timestamp_col] <= query_time)
]
if len(valid_features) > 0:
# Get the most recent valid feature
latest = valid_features.sort_values(timestamp_col).iloc[-1]
results.append(latest.to_dict())
else:
# No valid features - use nulls
results.append({entity_col: entity_id})
return pd.DataFrame(results)
def get_online_features(
self,
feature_name: str,
entity_ids: List[str]
) -> List[Dict]:
"""Get latest features for real-time inference."""
results = []
for entity_id in entity_ids:
key = f"{feature_name}:{entity_id}"
if key in self.online_store:
results.append(self.online_store[key])
else:
results.append({'error': 'not_found'})
return results
# Demo usage
if __name__ == "__main__":
store = SimpleFeatureStore()
# Register features
store.register_feature(
name='customer_stats',
entity='customer_id',
description='Aggregated customer purchase statistics'
)
# Create some historical feature data
feature_data = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2],
'total_purchases': [10, 15, 20, 5, 8],
'avg_order_value': [50.0, 52.0, 55.0, 30.0, 35.0],
'event_timestamp': pd.to_datetime([
'2024-01-01', '2024-02-01', '2024-03-01',
'2024-01-15', '2024-02-15'
])
})
store.write_features('customer_stats', feature_data)
# Point-in-time retrieval for training
training_entities = pd.DataFrame({
'customer_id': [1, 1, 2],
'event_timestamp': pd.to_datetime([
'2024-01-15', # Should get Jan 1 features
'2024-02-15', # Should get Feb 1 features
'2024-02-01' # Should get Jan 15 features
])
})
historical = store.get_historical_features(
'customer_stats',
training_entities
)
print("\n Historical Features (point-in-time correct):")
print(historical)
# Online retrieval for inference
online = store.get_online_features('customer_stats', ['1', '2'])
print("\n Online Features (latest values):")
for f in online:
print(f)

Expected Learning: Understanding how point-in-time correctness prevents data leakage and why online/offline stores serve different purposes.


Build a tool to detect potential data leakage in AutoML results:

"""
Data Leakage Detection Tool
Detects common patterns that indicate data leakage in AutoML models.
"""
import pandas as pd
import numpy as np
from typing import List, Tuple
class LeakageDetector:
"""Detects potential data leakage in ML features."""
def __init__(self, model, X: pd.DataFrame, y: pd.Series):
self.model = model
self.X = X
self.y = y
self.warnings = []
def check_feature_importance(
self,
importance_threshold: float = 0.3
) -> List[Tuple[str, float, str]]:
"""
Flag features with suspiciously high importance.
Leaky features often have unusually high importance because
they directly encode the target.
"""
try:
importances = self.model.feature_importance()
except AttributeError:
# For models without built-in feature importance
return []
suspicious = []
for feature, importance in importances.items():
if importance > importance_threshold:
suspicious.append((
feature,
importance,
f"️ Unusually high importance ({importance:.3f}). "
f"Check for data leakage!"
))
self.warnings.append(f"HIGH_IMPORTANCE: {feature}")
return suspicious
def check_correlation_with_target(
self,
correlation_threshold: float = 0.9
) -> List[Tuple[str, float, str]]:
"""
Flag features with very high correlation to target.
Perfect or near-perfect correlation often indicates leakage.
"""
suspicious = []
for col in self.X.columns:
if self.X[col].dtype in ['int64', 'float64']:
corr = abs(self.X[col].corr(self.y))
if corr > correlation_threshold:
suspicious.append((
col,
corr,
f"️ Very high correlation ({corr:.3f}). "
f"Likely data leakage!"
))
self.warnings.append(f"HIGH_CORRELATION: {col}")
return suspicious
def check_perfect_prediction_features(self) -> List[str]:
"""
Flag features that perfectly predict the target alone.
If a single feature achieves >99% accuracy, it's usually leakage.
"""
suspicious = []
for col in self.X.columns:
if self.X[col].nunique() < 100: # Categorical-ish
# Check if any value perfectly predicts target
for value in self.X[col].unique():
mask = self.X[col] == value
if mask.sum() > 10: # Enough samples
target_values = self.y[mask].unique()
if len(target_values) == 1:
suspicious.append(col)
self.warnings.append(
f"PERFECT_PREDICTOR: {col}={value} -> {target_values[0]}"
)
break
return suspicious
def check_temporal_leakage(
self,
date_columns: List[str],
target_date_column: str = None
) -> List[str]:
"""
Flag features that might be computed from future data.
"""
suspicious = []
# Check if any features have dates AFTER the target date
if target_date_column and target_date_column in self.X.columns:
target_date = pd.to_datetime(self.X[target_date_column])
for col in date_columns:
if col in self.X.columns and col != target_date_column:
feature_date = pd.to_datetime(self.X[col])
future_rows = (feature_date > target_date).sum()
if future_rows > 0:
suspicious.append(col)
self.warnings.append(
f"TEMPORAL_LEAKAGE: {col} has {future_rows} "
f"rows with dates after target"
)
return suspicious
def generate_report(self) -> str:
"""Generate a comprehensive leakage report."""
report = []
report.append("=" * 60)
report.append("DATA LEAKAGE DETECTION REPORT")
report.append("=" * 60)
# Run all checks
importance_issues = self.check_feature_importance()
correlation_issues = self.check_correlation_with_target()
perfect_predictors = self.check_perfect_prediction_features()
# Format report
if importance_issues:
report.append("\n HIGH IMPORTANCE FEATURES:")
for feature, importance, msg in importance_issues:
report.append(f" - {feature}: {msg}")
if correlation_issues:
report.append("\n HIGH CORRELATION FEATURES:")
for feature, corr, msg in correlation_issues:
report.append(f" - {feature}: {msg}")
if perfect_predictors:
report.append("\n PERFECT PREDICTOR FEATURES:")
for feature in perfect_predictors:
report.append(f" - {feature}")
if not any([importance_issues, correlation_issues, perfect_predictors]):
report.append("\n No obvious leakage detected")
report.append(" (Note: Some leakage types require domain knowledge to detect)")
report.append("\n" + "=" * 60)
report.append(f"Total warnings: {len(self.warnings)}")
return "\n".join(report)
# Usage example
if __name__ == "__main__":
# Create sample data with intentional leakage
np.random.seed(42)
n = 1000
# Normal features
X = pd.DataFrame({
'age': np.random.randint(18, 80, n),
'income': np.random.normal(50000, 20000, n),
'credit_score': np.random.randint(300, 850, n),
})
# Target: will customer default?
y = pd.Series(np.random.binomial(1, 0.2, n))
# Add LEAKY feature (computed from outcome!)
# This simulates a feature that includes future information
X['days_until_default'] = np.where(y == 1, np.random.randint(1, 90, n), -1)
# This feature is suspicious - high correlation
X['default_indicator'] = y * 0.99 + np.random.normal(0, 0.01, n)
# Create mock model
class MockModel:
def feature_importance(self):
return {
'age': 0.05,
'income': 0.08,
'credit_score': 0.12,
'days_until_default': 0.45, # Suspiciously high!
'default_indicator': 0.30
}
# Run detection
detector = LeakageDetector(MockModel(), X, y)
print(detector.generate_report())

Expected Learning: Understanding common leakage patterns and how to systematically detect them before they cause production failures.


  1. AutoML is a force multiplier, not a replacement — It doesn’t replace ML engineers; it amplifies their productivity by automating tedious parts (algorithm selection, hyperparameter tuning) so they can focus on harder problems.

  2. AutoGluon is a strong tabular baseline — For structured data problems, its stacking approach often performs well and is worth testing before investing in heavier manual tuning.

  3. Feature stores help reduce training-serving skew — By using the same feature definitions for training and serving, they reduce a major source of production ML failures.

  4. Point-in-time correctness is essential — Future information can leak into ML training data, and timestamp-aware joins help reduce that risk.

  5. Feature reuse compounds over time — Every feature you add to the store can be used by multiple models. After a year, you have a powerful feature library that accelerates all new projects.

  6. Set time limits on AutoML — Without constraints, AutoML will run forever. Always specify time_limit and use appropriate presets for your use case.

  7. Validate AutoML feature importance — Suspiciously high importance scores often indicate data leakage. Always sanity-check before trusting AutoML’s discoveries.

  8. Online vs offline stores serve different needs — Offline for training (historical, high volume), online for serving (latest values, low latency). Both share feature definitions.

  9. AutoML presets matter — ‘best_quality’ for maximum accuracy, ‘optimize_for_deployment’ for production serving. Choose based on your constraints.

  10. The economics can be compelling — feature stores and AutoML can reduce repetitive work, shorten iteration cycles, and lower some categories of operational error.


  • “AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data” (2020) - The paper behind AutoGluon’s design, explains multi-layer stacking
  • “Feast: Feature Store for Machine Learning” (2021) - Architecture and design decisions for Feast
  • “Auto-sklearn 2.0” (2020) - Meta-learning approach to AutoML
  • “Michelangelo: Uber’s Machine Learning Platform” (2017) - Original feature store architecture at scale

Q1. Your team has a new tabular churn dataset and only one afternoon to produce a credible baseline before leadership decides whether to fund a larger ML effort. A senior engineer suggests spending two weeks hand-tuning gradient boosting models first. What is the better first step, and why?

Answer Run a quick AutoML baseline first, such as AutoGluon with a bounded `time_limit` and a faster preset like `medium_quality` or `good_quality`.

This matches the module’s recommended workflow: start simple, get a baseline in 30-60 minutes, and use that result to judge whether more manual effort is justified. AutoML is especially strong for tabular data and time-constrained projects.

Q2. Your fraud team lets AutoGluon run without a time limit on a large dataset because they want “the best possible model.” Two days later, the job is still running and cloud costs have spiked. What should they have done differently?

Answer They should have set an explicit `time_limit` before training.

The module warns that AutoML will keep trying more models and configurations unless you constrain it. A practical approach is to start with a limited run, such as one hour, inspect the leaderboard and feature importance, and only extend the run if the extra compute is justified.

Q3. A lending platform needs credit decisions in under 10 ms, but AutoGluon’s top-scoring model is a deep stacked ensemble with much slower inference. Which preset is the best fit for deployment, and why?

Answer Use `optimize_for_deployment`.

The module explains that deployment-oriented settings can reduce artifact size and simplify deployment, but you still need to validate actual inference latency and serving behavior for your chosen model. best_quality may win on offline accuracy, but it often relies on ensembles that can be harder to serve under strict latency constraints.

Q4. Your team is training a churn model using customer features from a central store. For an event dated 2024-01-15, one engineer joins against the latest available customer record because it is simpler. Another engineer objects. Who is right, and what is the key concept involved?

Answer The second engineer is right: training must use point-in-time correct features, not the latest record.

The module emphasizes that using current values for past training events causes data leakage because the model sees future information. The correct approach is to retrieve the most recent feature value that existed on or before 2024-01-15.

Q5. Two teams independently build their own versions of customer_lifetime_value, and a third team creates yet another version for online inference. After deployment, the same customer receives inconsistent scores across systems. What platform component would have prevented this, and how?

Answer A feature store would have prevented this by creating a single, versioned definition of the feature shared across teams.

The module describes feature stores as a centralized system with a registry, metadata, and online/offline serving. That reduces duplication, keeps feature definitions consistent, and helps avoid training-serving skew.

Q6. An AutoML model for loan defaults suddenly shows a feature with extremely high importance: days_until_first_payment. Offline metrics look amazing, but production defaults rise after launch. Based on the module, what is the most likely problem and what should the team check first?

Answer The most likely problem is data leakage.

The team should first verify whether days_until_first_payment was actually available at prediction time. The module warns that AutoML will exploit any strong signal, including leaked future information, and that suspiciously important features must be sanity-checked before trusting them.

Q7. Your ML platform supports both model training on months of historical data and real-time recommendations at 50,000 requests per second. A new engineer wants to use the same low-latency online store for both tasks because it is fast. Why is that a bad design choice?

Answer It is a bad choice because training and inference need different storage behavior.

The module explains that the offline store is for historical, timestamped data used in training and point-in-time joins, while the online store is for the latest feature values used in low-latency inference. Using only the online store would make historical training data incomplete and increase the risk of leakage or skew.

You’ve completed Phase 8: Classical ML! You now understand:

  • Gradient boosting (XGBoost, LightGBM)
  • Time series forecasting (ARIMA, Prophet)
  • AutoML and feature stores

Up Next: Phase 9 - AI Safety & Evaluation


Module 39 Complete! “AutoML doesn’t replace ML engineers - it multiplies their productivity.”