Time Series Forecasting

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 6-8

The Intern Who Beat the Team

Illustrative AutoML comparison.

A junior practitioner can sometimes surface an AutoML result that challenges a carefully hand-built baseline.

A team had invested substantial time in a hand-built churn model before comparing it with an AutoML baseline.

Sarah asked a naive question: “Could I try AutoGluon on the same data?”

The senior engineers exchanged knowing glances. “Sure, but don’t expect it to beat a model built by experienced engineers.”

A short AutoML run can sometimes produce a stronger baseline than a manually tuned model, depending on the dataset and evaluation setup.

“How is that possible?” asked the team lead.

“It evaluated multiple model families, combined strong candidates, and surfaced patterns the manual workflow had not captured,” the teammate explained.

The operational lesson is that AutoML can provide a fast, credible baseline before a team commits to more manual modeling work.

“AutoML doesn’t make data scientists obsolete—it makes their time more valuable. Now you can spend six months on problems that actually need human creativity.” — Common AutoML perspective

This module teaches you how to use AutoML tools effectively—and why they’re not magic, but a powerful force multiplier for any ML practitioner.

What You’ll Be Able to Do

By the end of this module, you will:

Understand AutoML and when to use it
Master AutoGluon for automated machine learning
Learn feature store concepts and architecture
Implement automated feature engineering
Build end-to-end ML pipelines

The AutoML Revolution

Imagine you’re a chef at a restaurant. Traditional ML is like cooking everything from scratch - you select ingredients (features), decide cooking methods (algorithms), adjust seasoning (hyperparameters), and taste-test constantly (validation). It requires years of expertise.

AutoML is like having a robot chef that tries hundreds of recipes automatically, learns from each attempt, and presents you with the best dish. You just need to provide the ingredients and describe what you want.

TRADITIONAL ML WORKFLOW:
────────────────────────
Data → [Manual Feature Eng] → [Choose Algorithm] → [Tune Hyperparameters] → Model
         ↑                         ↑                      ↑
         │                         │                      │
    Requires expertise       Requires expertise     Takes days/weeks
         │                         │                      │
         └─────────────────────────┴──────────────────────┘
                     TIME: Days to Weeks


AUTOML WORKFLOW:
────────────────
Data → [AutoML System] → Best Model
            │
            ├── Tries 100+ algorithms
            ├── Engineers features automatically
            ├── Tunes hyperparameters
            └── Ensembles best models

       TIME: Minutes to Hours

Did You Know? AutoML systems have reported strong benchmark and competition results under fixed compute budgets, but the exact ranking and runtime depend on the benchmark and configuration.

Why AutoML Matters

The ML Expertise Gap

REALITY OF ML IN INDUSTRY:
──────────────────────────

Companies with ML needs:    ████████████████████ 1,000,000+
Companies with ML experts:  ███                  ~50,000
Expert ML engineers:        █                    ~300,000

Gap: 95%+ of companies can't hire ML experts!

AutoML bridges this gap:
- Democratizes ML (anyone can use it)
- Accelerates expert productivity (10x faster)
- Establishes strong baselines automatically

When to Use AutoML

USE AUTOML:
───────────
 Establishing baselines quickly
 Tabular data problems
 Time-constrained projects
 Non-expert teams
 Comparing many algorithms
 Hyperparameter optimization

DON'T USE AUTOML (alone):
─────────────────────────
 Custom architectures needed
 Domain-specific constraints
 Real-time inference requirements
 Highly specialized problems
 When interpretability is critical

AutoML Landscape

Major AutoML Frameworks

AUTOML FRAMEWORK COMPARISON:
────────────────────────────

┌──────────────────────────────────────────────────────────────────┐
│  FRAMEWORK      │ BEST FOR           │ KEY STRENGTH              │
├──────────────────────────────────────────────────────────────────┤
│  AutoGluon      │ Tabular, general   │ Best accuracy, ensembles  │
│  (Amazon)       │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  auto-sklearn   │ scikit-learn users │ Meta-learning, warm-start │
│  (Freiburg)     │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  H2O AutoML     │ Enterprise         │ Production-ready, scaling │
│  (H2O.ai)       │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  FLAML          │ Fast experiments   │ Low compute, fast         │
│  (Microsoft)    │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  PyCaret        │ Low-code ML        │ Simple API, visualization │
│  (Open Source)  │                    │                           │
└──────────────────────────────────────────────────────────────────┘

AutoGluon Deep Dive

AutoGluon (from Amazon) has reported strong results on tabular benchmarks and competitions:

AUTOGLUON ARCHITECTURE:
───────────────────────

                    Input Data
                         │
            ┌────────────┼────────────┐
            │            │            │
            ▼            ▼            ▼
        ┌───────┐   ┌───────┐   ┌───────┐
        │NN     │   │GBM    │   │Linear │
        │Models │   │Models │   │Models │
        └───┬───┘   └───┬───┘   └───┬───┘
            │           │           │
            │    ┌──────┼──────┐    │
            │    │      │      │    │
            ▼    ▼      ▼      ▼    ▼
        ┌─────────────────────────────┐
        │     MULTI-LAYER STACKING    │
        │  (Ensemble of ensembles)    │
        └──────────────┬──────────────┘
                       │
                       ▼
                  Best Model


MODELS AUTOGLUON TRIES:
───────────────────────
Neural Networks:
  - TabularNN (custom for tabular)
  - FastAI neural network

Gradient Boosting:
  - LightGBM
  - CatBoost
  - XGBoost

Linear Models:
  - Ridge/Lasso regression
  - Linear SVM

Ensemble:
  - Weighted ensemble
  - Multi-layer stacking

Did You Know? AutoGluon uses stacked ensembles that train later models on earlier model predictions. This can improve accuracy, but the exact gain depends on the dataset, time budget, and evaluation setup.

AutoML Under the Hood

Algorithm Selection

How does AutoML choose which algorithms to try?

ALGORITHM SELECTION STRATEGIES:
───────────────────────────────

1. EXHAUSTIVE SEARCH
   Try ALL algorithms, all hyperparameters
   Problem: Computationally infeasible!

   10 algorithms × 100 hyperparameter combos = 1,000 models
   If each takes 1 minute = 16+ hours


2. META-LEARNING (auto-sklearn approach)
   Learn from past datasets which algorithms work best

   "This dataset looks like Dataset #4,523 from our database.
    Random Forest worked best there, let's try that first!"

   Steps:
   a) Extract meta-features from dataset
   b) Find similar historical datasets
   c) Start with algorithms that worked on those


3. BAYESIAN OPTIMIZATION
   Smart search that learns as it goes

   ┌──────────────────────────────────────────────┐
   │ Iteration 1: Try random config → Score: 0.75 │
   │ Iteration 2: Try another → Score: 0.82      │
   │ Iteration 3: Try similar to best → 0.85     │
   │ ...learns that high learning_rate is bad... │
   │ Iteration 50: Optimal found → 0.91          │
   └──────────────────────────────────────────────┘


4. BANDIT-BASED (Hyperband/ASHA)
   Give more resources to promising configs

   Start: 100 configs with 1 epoch each
   Keep:  Top 25 configs, train for 4 epochs
   Keep:  Top 6 configs, train for 16 epochs
   Keep:  Top 2 configs, train to completion

   Result: Find best config with 10x less compute!

Hyperparameter Optimization

Think of hyperparameter optimization like tuning a guitar. Each hyperparameter is a string that affects the sound. Turn the learning rate too high and you get noise; too low and you barely hear anything. The problem? A neural network has dozens of “strings,” and they all interact with each other.

Traditional approach: try every combination. With 10 hyperparameters and 5 values each, that’s 5^10 = 9.7 million combinations. Even at 1 minute per trial, that’s 18 years.

Smart approach: use Bayesian optimization, which learns from each trial. “High learning rate made things worse? Let’s try lower values.” It can often find good configurations in far fewer trials than exhaustive search.

HYPERPARAMETER SEARCH SPACE:
────────────────────────────

LightGBM example:
┌─────────────────────────────────────────────────┐
│  Parameter        │  Search Space              │
├─────────────────────────────────────────────────┤
│  n_estimators     │  [100, 200, 500, 1000]     │
│  learning_rate    │  [0.01, 0.05, 0.1, 0.3]    │
│  max_depth        │  [3, 5, 7, 10, -1]         │
│  num_leaves       │  [15, 31, 63, 127]         │
│  min_child_weight │  [1e-3, 1e-2, 0.1, 1]      │
│  subsample        │  [0.5, 0.7, 0.9, 1.0]      │
│  colsample_bytree │  [0.5, 0.7, 0.9, 1.0]      │
└─────────────────────────────────────────────────┘

Total combinations: 4×4×5×4×4×4×4 = 20,480 configs!

Smart search finds good config in ~50 trials
Exhaustive search needs 20,480 trials
Speedup: 400x

Automated Feature Engineering

Why This Module Matters

FEATURE ENGINEERING REALITY:
────────────────────────────

Time spent on ML projects:
┌──────────────────────────────────────────────────────┐
│  Data Collection      ████████████         25%       │
│  Data Cleaning        ████████████████     35%       │
│  Feature Engineering  ██████████████       30%       │
│  Model Training       ████                 10%       │
└──────────────────────────────────────────────────────┘

Feature engineering is:
- Time-consuming (30% of project time)
- Requires domain expertise
- Often repetitive across projects
- Critical for model performance

"Give me better features, and I'll give you a better model."
                                    - Every ML Engineer

Automated Feature Engineering Techniques

AUTO-FEATURE TECHNIQUES:
────────────────────────

1. AGGREGATION (for relational data)
   ─────────────────────────────────
   customer_id → orders table

   Auto-generated features:
   - count(orders)
   - sum(order_amount)
   - avg(order_amount)
   - max(order_amount)
   - days_since_last_order


2. TRANSFORMATION
   ───────────────
   Original: [price, quantity]

   Auto-generated:
   - log(price)
   - sqrt(quantity)
   - price * quantity  (interaction)
   - price / quantity  (ratio)
   - price ** 2        (polynomial)


3. TIME-BASED (from timestamps)
   ────────────────────────────
   Original: purchase_datetime

   Auto-generated:
   - hour_of_day
   - day_of_week
   - is_weekend
   - month
   - quarter
   - days_since_signup


4. ENCODING (for categoricals)
   ───────────────────────────
   Original: category = ["A", "B", "C"]

   Auto-generated:
   - One-hot encoding
   - Target encoding
   - Frequency encoding
   - Label encoding

Featuretools: Deep Feature Synthesis

DEEP FEATURE SYNTHESIS (DFS):
─────────────────────────────

Given relational tables, automatically generate features:

TABLES:
  customers(id, signup_date, country)
  orders(id, customer_id, date, amount)
  products(id, order_id, product_type, price)


DFS GENERATES:
──────────────
Depth 1: Simple aggregations
  - COUNT(orders)
  - SUM(orders.amount)
  - AVG(orders.amount)

Depth 2: Stacked aggregations
  - COUNT(orders.products)
  - AVG(orders.SUM(products.price))
  - MODE(orders.MODE(products.product_type))

Depth 3: Triple-stacked!
  - STD(orders.AVG(products.price))

Result: 100s of features from 3 tables!


PRIMITIVES USED:
────────────────
Aggregation: sum, mean, count, max, min, std, mode
Transform: year, month, weekday, cum_sum, diff

Did You Know? Automated feature engineering can generate large numbers of relational features quickly, but those features still need regularization, validation, and task-specific judgment to be useful.

Feature Stores

What is a Feature Store?

Think of a feature store as a “data warehouse for ML features” - a centralized repository where teams can share, discover, and reuse features.

Imagine a restaurant where every chef prepares their own spice blends. Chef A makes curry powder. Chef B makes the same curry powder differently. Chef C needs curry powder but doesn’t know it already exists, so they make a third version. Different dishes taste inconsistent, ingredients are wasted, and no one knows which recipe is “official.”

Now imagine a central spice cabinet with standardized, labeled blends. Every chef uses the same curry powder. New chefs can see what’s available. If the curry powder recipe improves, all dishes improve automatically.

That’s what a feature store does for ML features—centralizes, standardizes, and shares them across teams.

WITHOUT FEATURE STORE:
──────────────────────

Team A: Builds "customer_lifetime_value" feature
        ├── Writes SQL query
        ├── Schedules daily job
        └── Stores in their own table

Team B: Needs same feature
        ├── Doesn't know Team A has it
        ├── Builds their own version
        └── Gets slightly different results!

Team C: Needs feature for real-time inference
        ├── Can't use batch SQL
        └── Builds third version!

Result: 3 versions of the same feature, inconsistent!


WITH FEATURE STORE:
───────────────────

               ┌─────────────────────────────┐
               │      FEATURE STORE          │
               │  ┌─────────────────────┐    │
               │  │ customer_lifetime_  │    │
               │  │ value               │    │
               │  │ - Batch: Daily SQL  │    │
               │  │ - Online: Redis     │    │
               │  │ - Owner: Team A     │    │
               │  │ - Version: 2.3      │    │
               │  └─────────────────────┘    │
               └──────────────┬──────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
        Team A             Team B             Team C
     (training)          (training)        (inference)

All teams use the SAME feature definition!

Feature Store Architecture

FEATURE STORE COMPONENTS:
─────────────────────────

┌─────────────────────────────────────────────────────────────────┐
│                     FEATURE STORE                                │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                  FEATURE REGISTRY                          │ │
│  │  - Feature definitions (schema, transformations)           │ │
│  │  - Metadata (owner, description, data lineage)            │ │
│  │  - Versioning                                              │ │
│  └────────────────────────────────────────────────────────────┘ │
│                              │                                   │
│         ┌────────────────────┴────────────────────┐             │
│         │                                         │              │
│  ┌──────▼──────┐                         ┌───────▼───────┐      │
│  │ OFFLINE     │                         │ ONLINE        │      │
│  │ STORE       │                         │ STORE         │      │
│  │             │                         │               │      │
│  │ - Historical│         Sync            │ - Latest      │      │
│  │ - Training  │ ◄──────────────────────►│ - Real-time   │      │
│  │ - BigQuery/ │                         │ - Redis/      │      │
│  │   S3/HDFS   │                         │   DynamoDB    │      │
│  └─────────────┘                         └───────────────┘      │
│         │                                         │              │
│         │                                         │              │
│  ┌──────▼──────────────────────────────────────────▼──────┐     │
│  │                    SERVING LAYER                        │     │
│  │   - Batch serving (training)                           │     │
│  │   - Online serving (inference)                         │     │
│  │   - Point-in-time correctness                          │     │
│  └────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────┘

Point-in-Time Correctness

This is the MOST important concept in feature stores:

POINT-IN-TIME PROBLEM:
──────────────────────

Training data preparation:
  "What was the customer's purchase_count on 2024-01-15?"

WRONG APPROACH (data leakage!):
  SELECT purchase_count FROM current_features
  WHERE customer_id = 123

  Problem: Returns TODAY's count, not 2024-01-15's!
  The model trains on future information = cheating!


CORRECT APPROACH (point-in-time join):
  SELECT purchase_count FROM feature_history
  WHERE customer_id = 123
  AND feature_timestamp <= '2024-01-15'
  ORDER BY feature_timestamp DESC
  LIMIT 1

  Returns: Value as it was on 2024-01-15


TIMELINE:
─────────
          2024-01-15          Today
              │                 │
              ▼                 ▼
    ──────────●─────────────────●──────►
              │                 │
        Training event     Don't use
         Use features      these values!
         from HERE

Did You Know? Feature stores centralize feature definitions, can support online and offline retrieval, and help reduce training-serving inconsistencies.

Feast: Open Source Feature Store

Feast Architecture

FEAST COMPONENTS:
─────────────────

┌─────────────────────────────────────────────────────────────┐
│                         FEAST                                │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   FEATURE REPO                        │   │
│  │   feature_store.yaml   # Configuration                │   │
│  │   features.py          # Feature definitions          │   │
│  └──────────────────────────────────────────────────────┘   │
│                            │                                 │
│                            │ feast apply                     │
│                            ▼                                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   REGISTRY                            │   │
│  │   - Feature views                                     │   │
│  │   - Entities                                          │   │
│  │   - Data sources                                      │   │
│  └──────────────────────────────────────────────────────┘   │
│                            │                                 │
│            ┌───────────────┼───────────────┐                │
│            │               │               │                 │
│            ▼               ▼               ▼                 │
│     ┌───────────┐   ┌───────────┐   ┌───────────┐          │
│     │ Offline   │   │ Online    │   │ Serving   │          │
│     │ Store     │   │ Store     │   │ API       │          │
│     │ (Parquet) │   │ (Redis)   │   │ (gRPC)    │          │
│     └───────────┘   └───────────┘   └───────────┘          │
└─────────────────────────────────────────────────────────────┘

Defining Features in Feast

from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64

# Define entity (the thing we're building features for)
customer = Entity(
    name="customer_id",
    join_keys=["customer_id"],
    description="Customer identifier"
)

# Define data source
customer_stats_source = FileSource(
    path="data/customer_stats.parquet",
    timestamp_field="event_timestamp"
)

# Define feature view
customer_stats = FeatureView(
    name="customer_stats",
    entities=[customer],
    ttl=timedelta(days=90),  # How long features are valid
    schema=[
        Field(name="total_purchases", dtype=Int64),
        Field(name="avg_order_value", dtype=Float32),
        Field(name="days_since_last_order", dtype=Int64),
    ],
    online=True,   # Serve from online store
    source=customer_stats_source,
)

Using Feast

# Training: Get historical features
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Entity dataframe (what we want features for)
entity_df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "event_timestamp": pd.to_datetime(["2024-01-15"] * 5)
})

# Get training data with point-in-time correctness!
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_stats:total_purchases",
        "customer_stats:avg_order_value",
        "customer_stats:days_since_last_order"
    ]
).to_df()


# Inference: Get online features
feature_vector = store.get_online_features(
    features=[
        "customer_stats:total_purchases",
        "customer_stats:avg_order_value",
    ],
    entity_rows=[{"customer_id": 123}]
).to_dict()

# Returns: {"customer_id": [123], "total_purchases": [47], ...}

ML Pipeline Automation

End-to-End ML Pipeline

AUTOMATED ML PIPELINE:
──────────────────────

┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│  Data   │──►│ Feature │──►│ Model   │──►│ Model   │──►│ Deploy  │
│ Ingest  │   │   Eng   │   │ Train   │   │  Eval   │   │         │
└─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
     │             │             │             │             │
     ▼             ▼             ▼             ▼             ▼
 Scheduled    Feature       AutoML       Metrics      A/B Test
   Jobs       Store        Trains       Tracked      or Canary

ORCHESTRATION:
  - Airflow / Dagster / Prefect
  - MLflow for experiment tracking
  - Feature store for features
  - Model registry for models
  - CI/CD for deployment

MLflow Integration

MLFLOW EXPERIMENT TRACKING:
───────────────────────────

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model_type", "LightGBM")
    mlflow.log_param("n_estimators", 100)

    # Train model
    model = train_model(...)

    # Log metrics
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("f1_score", 0.93)

    # Log model
    mlflow.sklearn.log_model(model, "model")


MLFLOW UI shows:
┌────────────────────────────────────────────────────────────┐
│  Run ID    │ Model     │ n_estimators │ Accuracy │ F1     │
├────────────────────────────────────────────────────────────┤
│  abc123    │ LightGBM  │ 100          │ 0.95     │ 0.93   │
│  def456    │ XGBoost   │ 200          │ 0.94     │ 0.92   │
│  ghi789    │ RF        │ 150          │ 0.92     │ 0.90   │
└────────────────────────────────────────────────────────────┘

Easy comparison across experiments!

Practical AutoML Workflow

Step-by-Step AutoML Process

PRODUCTION AUTOML WORKFLOW:
───────────────────────────

1. DATA PREPARATION
   ├── Load data
   ├── Basic cleaning (handle missing, remove duplicates)
   ├── Define target variable
   └── Split train/validation/test

2. QUICK AUTOML RUN (1 hour)
   ├── Use AutoGluon with time_limit=3600
   ├── Get baseline performance
   └── Identify promising models

3. ANALYZE RESULTS
   ├── Check feature importance
   ├── Look for data leakage
   └── Understand model behavior

4. EXTENDED RUN (optional)
   ├── Run AutoGluon with more time
   ├── Try different presets (best_quality vs optimize_for_deployment)
   └── Experiment with hyperparameter ranges

5. MODEL SELECTION
   ├── Compare accuracy vs inference time
   ├── Consider interpretability needs
   └── Check model size for deployment

6. PRODUCTION PREPARATION
   ├── Export best model
   ├── Create feature pipeline
   ├── Set up monitoring
   └── Deploy with gradual rollout

AutoGluon Presets

AUTOGLUON PRESETS:
──────────────────

PRESET: "best_quality"
  - Maximum accuracy
  - Uses all algorithms
  - Deep stacking ensembles
  - Time: 4-8x longer
  - Use for: Final production models

PRESET: "high_quality"
  - Near-optimal accuracy
  - Good ensemble
  - Reasonable time
  - Use for: Most use cases

PRESET: "good_quality"
  - Good accuracy
  - Faster training
  - Use for: Prototyping

PRESET: "medium_quality"
  - Decent accuracy
  - Much faster
  - Use for: Quick baselines

PRESET: "optimize_for_deployment"
  - Single model (no ensemble)
  - Fast inference
  - Smaller model size
  - Use for: Real-time serving

Common Pitfalls and Best Practices

AutoML Pitfalls

COMMON AUTOML MISTAKES:
───────────────────────

1. DATA LEAKAGE
   ───────────────
   Problem: Target information leaks into features

   Example: Predicting "will customer churn?"
   Bad feature: "cancellation_date" (directly reveals answer!)

   Solution: Review feature importance, suspicious features
             that are too predictive are often leaky


2. OVERFITTING TO VALIDATION
   ──────────────────────────
   Problem: Running AutoML many times, picking best

   Each run: Accuracy = 0.91, 0.92, 0.93, 0.94, 0.90...
   Pick best: 0.94!
   Test set:  0.88  (overfit to validation)

   Solution: Hold out a TRUE test set, evaluate once at end


3. IGNORING BUSINESS CONSTRAINTS
   ───────────────────────────────
   Problem: Best model has 200ms latency, need < 10ms

   AutoGluon best model: Stacked ensemble
   Inference time: 200ms

   Solution: Use optimize_for_deployment preset
             or constrain model types


4. FEATURE STORE SKEW
   ────────────────────
   Problem: Training features differ from serving features

   Training: feature_v1 (old transformation)
   Serving:  feature_v2 (new transformation)

   Solution: Use feature store with versioning

Best Practices

AUTOML BEST PRACTICES:
──────────────────────

1. START SIMPLE
   - Run quick AutoML first (30 min - 1 hour)
   - Get baseline before investing more time
   - Understand what's possible

2. FEATURE ENGINEERING STILL MATTERS
   - AutoML optimizes models, not features
   - Domain features often help
   - Combine AutoML + manual feature eng

3. USE PROPER VALIDATION
   - Time-based splits for time series
   - Stratified for imbalanced classes
   - Group splits for related samples

4. MONITOR IN PRODUCTION
   - Track feature distributions
   - Monitor model performance
   - Set up drift detection

5. DOCUMENT EVERYTHING
   - Which AutoML settings?
   - What features used?
   - Business metrics impact?

Did You Know? AutoML systems typically include built-in preprocessing and feature handling for tabular data, but the exact transformations and benchmark outcomes depend on the product and evaluation setup.

Summary

KEY CONCEPTS RECAP:
───────────────────

AUTOML:
  - Automates algorithm selection + hyperparameter tuning
  - AutoGluon: Best accuracy, multi-layer stacking
  - Use presets based on needs (quality vs speed)

FEATURE STORES:
  - Centralized feature management
  - Point-in-time correctness for training
  - Online/offline serving
  - Feast: Open source, easy to start

AUTOMATED FEATURE ENGINEERING:
  - Aggregations, transformations, time features
  - Featuretools for deep feature synthesis
  - Still combine with domain knowledge

ML PIPELINES:
  - End-to-end automation
  - MLflow for experiment tracking
  - Orchestration (Airflow, Dagster)


WHEN TO USE WHAT:
─────────────────

┌─────────────────────────────────────────────────────────────┐
│  SCENARIO                  │  RECOMMENDATION                │
├─────────────────────────────────────────────────────────────┤
│  Quick baseline            │  AutoGluon, medium_quality     │
│  Production model          │  AutoGluon, best_quality       │
│  Real-time serving         │  AutoGluon, optimize_for_deployment│
│  Feature reuse             │  Feast feature store           │
│  Relational data           │  Featuretools + AutoML         │
│  Team collaboration        │  Feature store + MLflow        │
└─────────────────────────────────────────────────────────────┘

Production War Stories: AutoML and Feature Store Lessons

A Costly Feature Leak

Illustrative fintech scenario.

A team can sometimes see a large offline-metric jump from an AutoML run, but that does not guarantee the features are valid for production decisions.

After deployment, a model can fail badly if it learned from leaked or otherwise invalid features.

The forensic analysis revealed the issue: a highly ranked feature depended on information that was not actually available at prediction time, which created data leakage.

The AutoML system had found a perfect predictor: a feature that leaked the outcome. It’s like trying to predict who will win a race by looking at the finish photo—technically accurate, but useless for making predictions before the race.

Financial impact: data leakage can cause severe financial and organizational damage before a team detects it.

The fix implemented:

# Before: Feature calculated whenever
days_until_first_payment = payment_date - approval_date

# After: Strict point-in-time feature validation
def validate_feature_timing(feature_name, feature_timestamp, prediction_timestamp):
    """Ensure feature was available BEFORE prediction was needed."""
    if feature_timestamp > prediction_timestamp:
        raise DataLeakageError(
            f"Feature '{feature_name}' has timestamp {feature_timestamp} "
            f"but prediction was needed at {prediction_timestamp}. "
            f"This is data leakage!"
        )

Lesson: AutoML will find ANY signal, including signals from the future. Point-in-time validation isn’t optional—it’s essential.

Did You Know? Data leakage is common in production ML, can take time to detect, and point-in-time-correct feature retrieval helps reduce that risk.

The Feature Store That Saved Black Friday

Illustrative retail scenario.

A high-traffic retail system can fail if feature computation becomes the latency bottleneck during peak demand.

This year, they’d implemented Feast as their feature store. The architecture was different:

LAST YEAR (crashed):
────────────────────
Request → Compute Features On-Demand → Model → Response
          └── SQL query per request
          └── 200ms latency
          └── Can't scale past 500 RPS

THIS YEAR (Feast):
─────────────────
Pre-computed features → Redis (Online Store)
Request → Redis lookup (1ms) → Model → Response
          └── 50,000+ RPS
          └── 5ms total latency

With precomputed online features, a recommendation stack can handle much higher traffic with lower latency than on-demand feature computation.

The key insight: Feature computation is the bottleneck, not model inference. Pre-computing features and serving from an online store changed everything.

Financial impact: improving feature serving can materially affect both revenue and operating cost, especially during peak traffic.

The AutoML Model That Discriminated

Illustrative insurance scenario.

The compliance team found that model decisions were uneven across location-based groups, raising a fairness concern.

Investigation suggested that the model had learned from proxy signals tied to historically biased outcomes, so the issue was not just accuracy but fairness.

The team’s response:

Removed proxy features: location-based and other features that acted as strong proxies for protected characteristics
Added fairness constraints: introduced explicit checks so model behavior would be reviewed against agreed fairness criteria across groups
Implemented explainability: Required human review for any denial with unusual feature weights

# Fairness-aware AutoML configuration
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(
    label='claim_approved',
    eval_metric='roc_auc'
).fit(
    train_data,
    # Add fairness constraint
    hyperparameters={
        'GBM': {
            'constraint_type': 'demographic_parity',
            'fairness_target': 'race_proxy',
            'fairness_threshold': 0.05
        }
    }
)

Regulatory outcome: addressing fairness issues early can reduce legal and regulatory risk.

Lesson: AutoML optimizes what you tell it to optimize. If you only optimize for accuracy, it will happily learn discriminatory patterns. Fairness must be an explicit constraint.

Did You Know? Amazon famously scrapped an AI recruiting tool in 2018 after discovering it systematically downgraded women’s resumes. The model, trained on 10 years of hiring data, learned that Amazon had historically hired mostly men—and therefore preferred male candidates. This incident led to an industry-wide push for fairness-aware ML.

Common Mistakes and How to Avoid Them

Mistake 1: Trusting AutoML Feature Importance Blindly

Wrong:

# AutoML found these are the most important features
# Great, let's use them!
top_features = model.feature_importance()[:10]
production_model = train_on(data[top_features])  #  Dangerous!

Problem: Feature importance from AutoML can be misleading. A leaky feature will show high importance. A feature that’s important for one model type might be useless for another.

Right:

def validate_feature_importance(feature_name, importance_score, data):
    """Sanity check for suspiciously important features."""

    # Check for data leakage
    if importance_score > 0.3:  # Suspiciously high
        print(f"️ WARNING: {feature_name} has importance {importance_score}")
        print("Checking for potential leakage...")

        # Check if feature correlates with target timing
        correlation_with_target = data[feature_name].corr(data['target'])
        if abs(correlation_with_target) > 0.8:
            raise LeakageWarning(
                f"{feature_name} has {correlation_with_target:.2f} correlation "
                f"with target. Likely data leakage!"
            )

    # Check if feature is available at prediction time
    if not is_available_at_prediction_time(feature_name):
        raise LeakageWarning(
            f"{feature_name} is not available at prediction time!"
        )

    return True

Mistake 2: Not Setting Time Limits on AutoML

Wrong:

# "Just let it run until it's done"
predictor = TabularPredictor(label='target').fit(train_data)
# 3 days later: still running, $2,000 in cloud costs

Problem: AutoML can keep exploring additional models and configurations unless you bound the run. Without explicit time limits, training can take longer and cost more than expected.

Right:

# Always set explicit time limits
predictor = TabularPredictor(
    label='target',
    eval_metric='roc_auc'
).fit(
    train_data,
    time_limit=3600,  # 1 hour max
    presets='best_quality',  # Will do its best within time limit
    # AutoGluon automatically prioritizes promising models
)

Mistake 3: Ignoring Training-Serving Skew

Wrong:

# Training time
training_features = compute_features_from_warehouse(training_data)
model.fit(training_features, labels)

# Serving time (different code path!)
serving_features = compute_features_from_api(request_data)  #  Different!
prediction = model.predict(serving_features)

Problem: Subtle differences in feature computation between training and serving cause silent model degradation. Think of it as using different thermometers that are calibrated differently—your predictions will be systematically off.

Right:

# Use feature store for BOTH training and serving
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Training time
training_features = store.get_historical_features(
    entity_df=training_entities,
    features=['customer:total_purchases', 'customer:avg_order_value']
).to_df()

# Serving time (SAME feature definitions!)
serving_features = store.get_online_features(
    features=['customer:total_purchases', 'customer:avg_order_value'],
    entity_rows=[{"customer_id": request.customer_id}]
).to_dict()

# Features are guaranteed to be computed identically

Mistake 4: Not Versioning Features

Wrong:

# Features.py - Modified directly in production
customer_value = total_purchases * avg_order_value  # Changed from sum to product
# Now training data has old definition, production has new...

Problem: Changing feature definitions without versioning creates chaos. Models trained on v1 features serving with v2 features will produce garbage.

Right:

# features_v2.py - Explicit versioning
class CustomerValueV2(FeatureView):
    """
    Customer lifetime value calculation.

    v1 -> v2 changes:
    - Changed from sum to product formula
    - Added recency weighting
    - Breaking change: requires model retraining

    Migration: Models must be retrained before using v2
    """
    name = "customer_value_v2"
    version = "2.0.0"
    deprecates = "customer_value_v1"  # Mark old version
    requires_retraining = True

    def compute(self, data):
        return (data.total_purchases * data.avg_order_value *
                self.recency_weight(data.days_since_last_order))

Mistake 5: Using AutoML for Everything

Wrong:

# "AutoML is magic, let's use it everywhere!"
image_model = AutoGluon.fit(image_data)  # Works but suboptimal
text_model = AutoGluon.fit(text_data)    # Works but suboptimal
time_series = AutoGluon.fit(ts_data)      # Works but suboptimal

Problem: AutoML is often strongest on tabular data, while other modalities frequently benefit from more specialized methods.

Right:

WHEN TO USE AUTOML vs SPECIALIZED TOOLS:
────────────────────────────────────────

Data Type    │ AutoML Good? │ Better Alternative
─────────────┼──────────────┼────────────────────────
Tabular      │  Excellent │ N/A - AutoML is best
Images       │ ️ OK       │ Transfer learning (ResNet, ViT)
Text         │ ️ OK       │ Fine-tuned LLMs, BERT
Time Series  │ ️ OK       │ Prophet, NeuralProphet, DeepAR
Graph        │  Poor     │ PyTorch Geometric, DGL
Audio        │  Poor     │ Whisper, wav2vec

Economics of AutoML and Feature Stores

AutoML ROI Calculation

Scenario	Manual ML	AutoML	Savings
Initial Model Development
Data scientist time	Multiple weeks for a manual first model	About a week for a bounded AutoML baseline	Meaningful time savings
Compute costs	Higher from longer manual experimentation	Lower for a short bounded AutoML run	Modest compute savings
Time to production	Several weeks	Roughly days to a couple of weeks	Faster delivery
Ongoing Maintenance
Monthly retraining time	Days of manual work	Hours for a more automated workflow	Lower recurring effort
Model iteration cost	Higher per iteration	Lower per iteration	Lower iteration cost
Annual Savings
Initial + 12 months maintenance	Higher manual cost profile	Lower with more automation	Potential annual savings

Feature Store ROI Calculation

Metric	Without Feature Store	With Feature Store
Feature development time	Often measured in weeks	Often faster with reuse
Feature duplication	Multiple teams may rebuild similar features	Shared definitions reduce duplication
Training-serving skew incidents	Can happen repeatedly	Can be reduced with shared feature definitions
Revenue lost to skew	Can be material	Can be reduced when skew is reduced
Engineer time on debugging	Significant	Lower with shared infrastructure
Total Annual Impact		Potential operational savings

Cost of NOT Using These Tools

REAL COSTS OF MANUAL ML:
────────────────────────

┌────────────────────────────────────────────────────────────┐
│  Problem                      │  Typical Cost             │
├────────────────────────────────────────────────────────────┤
│  Data leakage in production   │  $1M-10M (depending on    │
│                               │  time to detect)          │
├────────────────────────────────────────────────────────────┤
│  Training-serving skew        │  $100K-500K per incident  │
│  (silent model degradation)   │  (lost revenue + debug)   │
├────────────────────────────────────────────────────────────┤
│  Duplicate feature work       │  $50K-200K/year           │
│  (multiple teams building     │  (wasted engineer time)   │
│  same features)               │                           │
├────────────────────────────────────────────────────────────┤
│  Slow model iteration         │  $500K-2M/year            │
│  (opportunity cost of delayed │  (competitors move        │
│  improvements)                │  faster)                  │
└────────────────────────────────────────────────────────────┘

Did You Know? Teams often adopt feature stores to speed up model delivery and reduce operational rework, but the exact ROI depends heavily on the organization and workload.

Interview Preparation: AutoML & Feature Stores

Q1: “When would you use AutoML vs. hand-crafted models?”

Strong Answer: “I use AutoML in three main scenarios. First, for establishing baselines quickly—before investing weeks in manual model development, I run AutoML to understand what’s achievable. If AutoML gets 0.75 AUC, I know my hand-crafted model should aim for at least 0.78 to justify the extra effort.

Second, for tabular data problems with clear evaluation metrics—this is a strong use case for AutoML, which can often provide a competitive baseline with far less manual effort.

Third, when the ML isn’t the core differentiator—if we’re building a feature where ML is a small component, I’d rather spend engineering time on the product, not model tuning.

I wouldn’t use AutoML when I need specific architectures like transformers for NLP, when there are strict latency requirements that need optimized single models, or when interpretability is critical and I need to explain every decision.”

Q2: “Explain point-in-time correctness in feature stores.”

Strong Answer: “Point-in-time correctness ensures that when training a model, we only use feature values that were available at the time the prediction would have been made. It prevents data leakage from the future.

Imagine training a model to predict customer churn. If a customer churned on March 15, their training features should reflect their state on March 14 or earlier—not their state after they churned. Without point-in-time correctness, we might accidentally include features like ‘days_since_last_login’ that jumped to 30+ after they stopped using the product—a clear signal they’ve churned that wouldn’t be available when making a real prediction.

Feature stores implement this by maintaining timestamped feature values and performing point-in-time joins. When you request historical features for training, you provide entity timestamps, and the feature store returns the most recent feature value that existed before each timestamp.

This is critical because models trained with leaked features show fantastic offline metrics but fail dramatically in production—a pattern I’ve seen called ‘suspiciously good AUC syndrome.’”

Q3: “How does multi-layer stacking work in AutoGluon?”

Strong Answer: “Multi-layer stacking is AutoGluon’s ensemble technique that significantly outperforms simple averaging or voting.

In the first layer, AutoGluon trains diverse base models—gradient boosting (LightGBM, XGBoost, CatBoost), neural networks, and linear models. Each model makes predictions on out-of-fold validation data to avoid leakage.

These first-layer predictions become features for the second layer, which trains new models to combine them. Think of it as learning ‘when to trust which model.’ If LightGBM is great on numerical features but weak on categoricals, while CatBoost is the opposite, the second layer learns to weight them appropriately based on the input.

AutoGluon can stack multiple layers, but the useful depth depends on the dataset, models, and time budget.

The key insight is that this outperforms simple ensembling because it learns non-linear combinations of model predictions. A weighted average says ‘LightGBM gets 40% weight.’ Multi-layer stacking says ‘LightGBM gets 80% weight when feature X is high, but only 20% when feature Y is low.’

In practice, stacked ensembles can outperform simpler combinations, but the size of that gain depends on the task, dataset, and evaluation setup.”

Q4: “What’s the difference between online and offline feature stores?”

Strong Answer: “They serve different use cases with different latency and storage requirements.

The offline store is optimized for training workloads. It stores historical feature values with timestamps, typically in data warehouses like BigQuery, Snowflake, or object storage like S3. Latency is seconds to minutes, but it can handle huge volumes—millions of feature vectors. I use it when creating training datasets with point-in-time correctness.

The online store is optimized for inference. It stores only the latest feature values in low-latency databases like Redis, DynamoDB, or Bigtable. Latency is single-digit milliseconds. I use it when serving predictions in real-time.

The key architectural insight is that they share the same feature definitions but different storage backends. The feature store materializes features from the offline store to the online store periodically—typically every few minutes to daily, depending on freshness requirements.

This dual architecture solves the training-serving skew problem. My training code uses get_historical_features from the offline store. My serving code uses get_online_features from the online store. Both use identical feature definitions, just different storage optimized for their use case.”

System Design: Design an ML Platform with AutoML and Feature Store

Prompt: “Design an ML platform for a fintech company that needs to make real-time credit decisions at 10,000 requests per second.”

Strong Answer:

“I’d design this with five key components:

1. Feature Store Architecture:

Online Store: Redis Cluster (6 nodes)
  - Sharded by customer_id
  - 500K features, 1ms p99 latency
  - TTL: 24 hours, refreshed hourly

Offline Store: BigQuery
  - Historical features for training
  - 2 years of feature history
  - Point-in-time query support

Feature Computation: Spark on Dataproc
  - Batch features: daily runs at 2 AM
  - Real-time features: Kafka Streams
  - Features: credit_score, payment_history,
    debt_to_income, account_age, etc.

2. AutoML Pipeline:

# Monthly retraining pipeline
def train_credit_model():
    # Get training data with point-in-time correctness
    training_data = feature_store.get_historical_features(
        entity_df=approved_applications_last_6_months,
        features=CREDIT_FEATURES,
        label='defaulted_within_90_days'
    )

    # AutoML with fairness constraints
    predictor = TabularPredictor(
        label='defaulted',
        eval_metric='roc_auc'
    ).fit(
        training_data,
        time_limit=14400,  # 4 hours
        presets='optimize_for_deployment',  # Single model for latency
        excluded_model_types=['NN']  # Neural nets too slow
    )

    # Validate fairness before promotion
    if passes_fairness_audit(predictor):
        deploy_to_production(predictor)

3. Serving Architecture for 10K RPS:

Load Balancer (GCP GLB)
        │
        ├── Kubernetes Cluster (GKE)
        │   └── Model Serving Pods (50 replicas)
        │       - CPU-optimized (LightGBM)
        │       - 10ms p99 latency per request
        │       - gRPC for low overhead
        │
        └── Feature Store (Redis)
            - 1ms feature fetch
            - Pre-computed features only

4. Monitoring & Safety:

Feature drift detection: alert if distributions shift >10%
Model performance monitoring: daily AUC calculation on holdout
Fairness monitoring: automated demographic parity checks
Circuit breaker: fall back to rules-based model if ML fails

5. Cost Estimate:

Feature Store (Redis): $15K/month
Compute (GKE): $25K/month
BigQuery: $5K/month
Total: $45K/month =$ 540K/year

Expected value: At 10K RPS, serving 864M decisions/day. Even 0.1% improvement in precision saves millions in bad debt.

This architecture handles 10K RPS with 15ms end-to-end latency while maintaining feature consistency and enabling rapid model iteration through AutoML.”

Hands-On Exercises

Exercise 1: AutoML Baseline Challenge

Build an AutoML baseline and compare it to a hand-crafted model:

"""
AutoML vs Manual Model Comparison

Dataset: Credit Card Fraud Detection (Kaggle)
Goal: Compare development time and accuracy
"""
import pandas as pd
from autogluon.tabular import TabularPredictor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import time

# Load data
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MANUAL APPROACH
print("=" * 50)
print("MANUAL RANDOM FOREST")
print("=" * 50)
manual_start = time.time()

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict_proba(X_test)[:, 1]
rf_auc = roc_auc_score(y_test, rf_predictions)

manual_time = time.time() - manual_start
print(f"Time: {manual_time:.1f}s")
print(f"AUC: {rf_auc:.4f}")

# AUTOML APPROACH
print("\n" + "=" * 50)
print("AUTOGLUON")
print("=" * 50)
automl_start = time.time()

# Prepare data for AutoGluon
train_df = X_train.copy()
train_df['Class'] = y_train.values

predictor = TabularPredictor(
    label='Class',
    eval_metric='roc_auc',
    verbosity=1
).fit(
    train_df,
    time_limit=300,  # 5 minutes
    presets='medium_quality'
)

test_df = X_test.copy()
automl_predictions = predictor.predict_proba(test_df)
automl_auc = roc_auc_score(y_test, automl_predictions.iloc[:, 1])

automl_time = time.time() - automl_start
print(f"Time: {automl_time:.1f}s")
print(f"AUC: {automl_auc:.4f}")

# COMPARISON
print("\n" + "=" * 50)
print("COMPARISON")
print("=" * 50)
print(f"Manual RF:  {rf_auc:.4f} AUC in {manual_time:.1f}s")
print(f"AutoGluon:  {automl_auc:.4f} AUC in {automl_time:.1f}s")
print(f"Improvement: {(automl_auc - rf_auc)*100:.2f} percentage points")

# See what AutoGluon tried
print("\n" + "=" * 50)
print("AUTOGLUON LEADERBOARD")
print("=" * 50)
print(predictor.leaderboard())

Expected Learning: AutoML can outperform a basic manually specified baseline, showing the value of automated model selection and ensembling.

Exercise 2: Feature Store Implementation

Build a simple feature store with point-in-time correctness:

"""
Simple Feature Store Implementation

This exercise teaches the core concepts of feature stores:
- Point-in-time correctness
- Online vs offline serving
- Feature versioning
"""
import pandas as pd
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import json
import redis  # For online store

class SimpleFeatureStore:
    """
    A minimal feature store demonstrating core concepts.

    Production feature stores like Feast add:
    - Distributed storage
    - Automatic materialization
    - Schema validation
    - Access control
    """

    def __init__(self):
        # Offline store: Historical features with timestamps
        self.offline_store: Dict[str, pd.DataFrame] = {}

        # Online store: Latest features only (simulated with dict)
        self.online_store: Dict[str, Dict] = {}

        # Feature registry: Metadata about features
        self.registry: Dict[str, dict] = {}

    def register_feature(
        self,
        name: str,
        entity: str,
        description: str,
        version: str = "1.0.0"
    ):
        """Register a feature in the registry."""
        self.registry[name] = {
            'entity': entity,
            'description': description,
            'version': version,
            'created_at': datetime.now().isoformat()
        }
        print(f" Registered feature: {name} v{version}")

    def write_features(
        self,
        feature_name: str,
        data: pd.DataFrame,
        timestamp_col: str = 'event_timestamp'
    ):
        """Write features to both offline and online stores."""
        if feature_name not in self.registry:
            raise ValueError(f"Feature {feature_name} not registered!")

        # Write to offline store (full history)
        if feature_name not in self.offline_store:
            self.offline_store[feature_name] = data.copy()
        else:
            self.offline_store[feature_name] = pd.concat([
                self.offline_store[feature_name],
                data
            ]).drop_duplicates()

        # Write to online store (latest values only)
        entity_col = self.registry[feature_name]['entity']
        for _, row in data.iterrows():
            entity_id = str(row[entity_col])
            self.online_store[f"{feature_name}:{entity_id}"] = row.to_dict()

        print(f" Wrote {len(data)} rows to {feature_name}")

    def get_historical_features(
        self,
        feature_name: str,
        entity_df: pd.DataFrame,
        timestamp_col: str = 'event_timestamp'
    ) -> pd.DataFrame:
        """
        Get historical features with point-in-time correctness.

        This is the CRITICAL function that prevents data leakage.
        """
        if feature_name not in self.offline_store:
            raise ValueError(f"Feature {feature_name} not in offline store!")

        feature_data = self.offline_store[feature_name]
        entity_col = self.registry[feature_name]['entity']

        results = []
        for _, entity_row in entity_df.iterrows():
            entity_id = entity_row[entity_col]
            query_time = entity_row[timestamp_col]

            # Point-in-time filter: only use features from BEFORE query time
            valid_features = feature_data[
                (feature_data[entity_col] == entity_id) &
                (feature_data[timestamp_col] <= query_time)
            ]

            if len(valid_features) > 0:
                # Get the most recent valid feature
                latest = valid_features.sort_values(timestamp_col).iloc[-1]
                results.append(latest.to_dict())
            else:
                # No valid features - use nulls
                results.append({entity_col: entity_id})

        return pd.DataFrame(results)

    def get_online_features(
        self,
        feature_name: str,
        entity_ids: List[str]
    ) -> List[Dict]:
        """Get latest features for real-time inference."""
        results = []
        for entity_id in entity_ids:
            key = f"{feature_name}:{entity_id}"
            if key in self.online_store:
                results.append(self.online_store[key])
            else:
                results.append({'error': 'not_found'})
        return results


# Demo usage
if __name__ == "__main__":
    store = SimpleFeatureStore()

    # Register features
    store.register_feature(
        name='customer_stats',
        entity='customer_id',
        description='Aggregated customer purchase statistics'
    )

    # Create some historical feature data
    feature_data = pd.DataFrame({
        'customer_id': [1, 1, 1, 2, 2],
        'total_purchases': [10, 15, 20, 5, 8],
        'avg_order_value': [50.0, 52.0, 55.0, 30.0, 35.0],
        'event_timestamp': pd.to_datetime([
            '2024-01-01', '2024-02-01', '2024-03-01',
            '2024-01-15', '2024-02-15'
        ])
    })

    store.write_features('customer_stats', feature_data)

    # Point-in-time retrieval for training
    training_entities = pd.DataFrame({
        'customer_id': [1, 1, 2],
        'event_timestamp': pd.to_datetime([
            '2024-01-15',  # Should get Jan 1 features
            '2024-02-15',  # Should get Feb 1 features
            '2024-02-01'   # Should get Jan 15 features
        ])
    })

    historical = store.get_historical_features(
        'customer_stats',
        training_entities
    )
    print("\n Historical Features (point-in-time correct):")
    print(historical)

    # Online retrieval for inference
    online = store.get_online_features('customer_stats', ['1', '2'])
    print("\n Online Features (latest values):")
    for f in online:
        print(f)

Expected Learning: Understanding how point-in-time correctness prevents data leakage and why online/offline stores serve different purposes.

Exercise 3: Data Leakage Detection

Build a tool to detect potential data leakage in AutoML results:

"""
Data Leakage Detection Tool

Detects common patterns that indicate data leakage in AutoML models.
"""
import pandas as pd
import numpy as np
from typing import List, Tuple

class LeakageDetector:
    """Detects potential data leakage in ML features."""

    def __init__(self, model, X: pd.DataFrame, y: pd.Series):
        self.model = model
        self.X = X
        self.y = y
        self.warnings = []

    def check_feature_importance(
        self,
        importance_threshold: float = 0.3
    ) -> List[Tuple[str, float, str]]:
        """
        Flag features with suspiciously high importance.

        Leaky features often have unusually high importance because
        they directly encode the target.
        """
        try:
            importances = self.model.feature_importance()
        except AttributeError:
            # For models without built-in feature importance
            return []

        suspicious = []
        for feature, importance in importances.items():
            if importance > importance_threshold:
                suspicious.append((
                    feature,
                    importance,
                    f"️ Unusually high importance ({importance:.3f}). "
                    f"Check for data leakage!"
                ))
                self.warnings.append(f"HIGH_IMPORTANCE: {feature}")

        return suspicious

    def check_correlation_with_target(
        self,
        correlation_threshold: float = 0.9
    ) -> List[Tuple[str, float, str]]:
        """
        Flag features with very high correlation to target.

        Perfect or near-perfect correlation often indicates leakage.
        """
        suspicious = []
        for col in self.X.columns:
            if self.X[col].dtype in ['int64', 'float64']:
                corr = abs(self.X[col].corr(self.y))
                if corr > correlation_threshold:
                    suspicious.append((
                        col,
                        corr,
                        f"️ Very high correlation ({corr:.3f}). "
                        f"Likely data leakage!"
                    ))
                    self.warnings.append(f"HIGH_CORRELATION: {col}")

        return suspicious

    def check_perfect_prediction_features(self) -> List[str]:
        """
        Flag features that perfectly predict the target alone.

        If a single feature achieves >99% accuracy, it's usually leakage.
        """
        suspicious = []
        for col in self.X.columns:
            if self.X[col].nunique() < 100:  # Categorical-ish
                # Check if any value perfectly predicts target
                for value in self.X[col].unique():
                    mask = self.X[col] == value
                    if mask.sum() > 10:  # Enough samples
                        target_values = self.y[mask].unique()
                        if len(target_values) == 1:
                            suspicious.append(col)
                            self.warnings.append(
                                f"PERFECT_PREDICTOR: {col}={value} -> {target_values[0]}"
                            )
                            break

        return suspicious

    def check_temporal_leakage(
        self,
        date_columns: List[str],
        target_date_column: str = None
    ) -> List[str]:
        """
        Flag features that might be computed from future data.
        """
        suspicious = []

        # Check if any features have dates AFTER the target date
        if target_date_column and target_date_column in self.X.columns:
            target_date = pd.to_datetime(self.X[target_date_column])
            for col in date_columns:
                if col in self.X.columns and col != target_date_column:
                    feature_date = pd.to_datetime(self.X[col])
                    future_rows = (feature_date > target_date).sum()
                    if future_rows > 0:
                        suspicious.append(col)
                        self.warnings.append(
                            f"TEMPORAL_LEAKAGE: {col} has {future_rows} "
                            f"rows with dates after target"
                        )

        return suspicious

    def generate_report(self) -> str:
        """Generate a comprehensive leakage report."""
        report = []
        report.append("=" * 60)
        report.append("DATA LEAKAGE DETECTION REPORT")
        report.append("=" * 60)

        # Run all checks
        importance_issues = self.check_feature_importance()
        correlation_issues = self.check_correlation_with_target()
        perfect_predictors = self.check_perfect_prediction_features()

        # Format report
        if importance_issues:
            report.append("\n HIGH IMPORTANCE FEATURES:")
            for feature, importance, msg in importance_issues:
                report.append(f"  - {feature}: {msg}")

        if correlation_issues:
            report.append("\n HIGH CORRELATION FEATURES:")
            for feature, corr, msg in correlation_issues:
                report.append(f"  - {feature}: {msg}")

        if perfect_predictors:
            report.append("\n PERFECT PREDICTOR FEATURES:")
            for feature in perfect_predictors:
                report.append(f"  - {feature}")

        if not any([importance_issues, correlation_issues, perfect_predictors]):
            report.append("\n No obvious leakage detected")
            report.append("   (Note: Some leakage types require domain knowledge to detect)")

        report.append("\n" + "=" * 60)
        report.append(f"Total warnings: {len(self.warnings)}")

        return "\n".join(report)


# Usage example
if __name__ == "__main__":
    # Create sample data with intentional leakage
    np.random.seed(42)
    n = 1000

    # Normal features
    X = pd.DataFrame({
        'age': np.random.randint(18, 80, n),
        'income': np.random.normal(50000, 20000, n),
        'credit_score': np.random.randint(300, 850, n),
    })

    # Target: will customer default?
    y = pd.Series(np.random.binomial(1, 0.2, n))

    # Add LEAKY feature (computed from outcome!)
    # This simulates a feature that includes future information
    X['days_until_default'] = np.where(y == 1, np.random.randint(1, 90, n), -1)

    # This feature is suspicious - high correlation
    X['default_indicator'] = y * 0.99 + np.random.normal(0, 0.01, n)

    # Create mock model
    class MockModel:
        def feature_importance(self):
            return {
                'age': 0.05,
                'income': 0.08,
                'credit_score': 0.12,
                'days_until_default': 0.45,  # Suspiciously high!
                'default_indicator': 0.30
            }

    # Run detection
    detector = LeakageDetector(MockModel(), X, y)
    print(detector.generate_report())

Expected Learning: Understanding common leakage patterns and how to systematically detect them before they cause production failures.

Key Takeaways

AutoML is a force multiplier, not a replacement — It doesn’t replace ML engineers; it amplifies their productivity by automating tedious parts (algorithm selection, hyperparameter tuning) so they can focus on harder problems.
AutoGluon is a strong tabular baseline — For structured data problems, its stacking approach often performs well and is worth testing before investing in heavier manual tuning.
Feature stores help reduce training-serving skew — By using the same feature definitions for training and serving, they reduce a major source of production ML failures.
Point-in-time correctness is essential — Future information can leak into ML training data, and timestamp-aware joins help reduce that risk.
Feature reuse compounds over time — Every feature you add to the store can be used by multiple models. After a year, you have a powerful feature library that accelerates all new projects.
Set time limits on AutoML — Without constraints, AutoML will run forever. Always specify time_limit and use appropriate presets for your use case.
Validate AutoML feature importance — Suspiciously high importance scores often indicate data leakage. Always sanity-check before trusting AutoML’s discoveries.
Online vs offline stores serve different needs — Offline for training (historical, high volume), online for serving (latest values, low latency). Both share feature definitions.
AutoML presets matter — ‘best_quality’ for maximum accuracy, ‘optimize_for_deployment’ for production serving. Choose based on your constraints.
The economics can be compelling — feature stores and AutoML can reduce repetitive work, shorten iteration cycles, and lower some categories of operational error.

Quiz

Q1. Your team has a new tabular churn dataset and only one afternoon to produce a credible baseline before leadership decides whether to fund a larger ML effort. A senior engineer suggests spending two weeks hand-tuning gradient boosting models first. What is the better first step, and why?

Answer

Run a quick AutoML baseline first, such as AutoGluon with a bounded `time_limit` and a faster preset like `medium_quality` or `good_quality`.

This matches the module’s recommended workflow: start simple, get a baseline in 30-60 minutes, and use that result to judge whether more manual effort is justified. AutoML is especially strong for tabular data and time-constrained projects.

Q2. Your fraud team lets AutoGluon run without a time limit on a large dataset because they want “the best possible model.” Two days later, the job is still running and cloud costs have spiked. What should they have done differently?

Answer

They should have set an explicit `time_limit` before training.

The module warns that AutoML will keep trying more models and configurations unless you constrain it. A practical approach is to start with a limited run, such as one hour, inspect the leaderboard and feature importance, and only extend the run if the extra compute is justified.

Q3. A lending platform needs credit decisions in under 10 ms, but AutoGluon’s top-scoring model is a deep stacked ensemble with much slower inference. Which preset is the best fit for deployment, and why?

Answer

Use `optimize_for_deployment`.

The module explains that deployment-oriented settings can reduce artifact size and simplify deployment, but you still need to validate actual inference latency and serving behavior for your chosen model. best_quality may win on offline accuracy, but it often relies on ensembles that can be harder to serve under strict latency constraints.

Q4. Your team is training a churn model using customer features from a central store. For an event dated 2024-01-15, one engineer joins against the latest available customer record because it is simpler. Another engineer objects. Who is right, and what is the key concept involved?

Answer

The second engineer is right: training must use point-in-time correct features, not the latest record.

The module emphasizes that using current values for past training events causes data leakage because the model sees future information. The correct approach is to retrieve the most recent feature value that existed on or before 2024-01-15.

Q5. Two teams independently build their own versions of customer_lifetime_value, and a third team creates yet another version for online inference. After deployment, the same customer receives inconsistent scores across systems. What platform component would have prevented this, and how?

Answer

A feature store would have prevented this by creating a single, versioned definition of the feature shared across teams.

The module describes feature stores as a centralized system with a registry, metadata, and online/offline serving. That reduces duplication, keeps feature definitions consistent, and helps avoid training-serving skew.

Q6. An AutoML model for loan defaults suddenly shows a feature with extremely high importance: days_until_first_payment. Offline metrics look amazing, but production defaults rise after launch. Based on the module, what is the most likely problem and what should the team check first?

Answer

The most likely problem is data leakage.

The team should first verify whether days_until_first_payment was actually available at prediction time. The module warns that AutoML will exploit any strong signal, including leaked future information, and that suspiciously important features must be sanity-checked before trusting them.

Q7. Your ML platform supports both model training on months of historical data and real-time recommendations at 50,000 requests per second. A new engineer wants to use the same low-latency online store for both tasks because it is fast. Why is that a bad design choice?

Answer

It is a bad choice because training and inference need different storage behavior.

The module explains that the offline store is for historical, timestamped data used in training and point-in-time joins, while the online store is for the latest feature values used in low-latency inference. Using only the online store would make historical training data incomplete and increase the risk of leakage or skew.

Next Steps

You’ve completed Phase 8: Classical ML! You now understand:

Gradient boosting (XGBoost, LightGBM)
Time series forecasting (ARIMA, Prophet)
AutoML and feature stores

Up Next: Phase 9 - AI Safety & Evaluation

Module 39 Complete! “AutoML doesn’t replace ML engineers - it multiplies their productivity.”

Sources

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data — Primary source for AutoGluon’s stacking approach, benchmark framing, and competition results.
Feast GitHub Repository — Official upstream overview for online/offline feature serving and point-in-time-correct retrieval.
Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning — Primary source for meta-learning and budget-aware AutoML design in a widely cited framework.

Time Series Forecasting

The Intern Who Beat the Team

What You’ll Be Able to Do

The AutoML Revolution

Why AutoML Matters

The ML Expertise Gap

When to Use AutoML

AutoML Landscape

Major AutoML Frameworks

AutoGluon Deep Dive

AutoML Under the Hood

Algorithm Selection

Hyperparameter Optimization

Automated Feature Engineering

Why This Module Matters

Automated Feature Engineering Techniques

Featuretools: Deep Feature Synthesis

Feature Stores

What is a Feature Store?

Feature Store Architecture

Point-in-Time Correctness

Feast: Open Source Feature Store

Feast Architecture

Defining Features in Feast

Using Feast

ML Pipeline Automation

End-to-End ML Pipeline

MLflow Integration

Practical AutoML Workflow

Step-by-Step AutoML Process

AutoGluon Presets

Common Pitfalls and Best Practices

AutoML Pitfalls

Best Practices

Summary

Production War Stories: AutoML and Feature Store Lessons

A Costly Feature Leak

The Feature Store That Saved Black Friday

The AutoML Model That Discriminated

Common Mistakes and How to Avoid Them

Mistake 1: Trusting AutoML Feature Importance Blindly

Mistake 2: Not Setting Time Limits on AutoML

Mistake 3: Ignoring Training-Serving Skew

Mistake 4: Not Versioning Features

Mistake 5: Using AutoML for Everything

Economics of AutoML and Feature Stores

AutoML ROI Calculation

Feature Store ROI Calculation

Cost of NOT Using These Tools

Interview Preparation: AutoML & Feature Stores

Q1: “When would you use AutoML vs. hand-crafted models?”

Q2: “Explain point-in-time correctness in feature stores.”

Q3: “How does multi-layer stacking work in AutoGluon?”

Q4: “What’s the difference between online and offline feature stores?”

System Design: Design an ML Platform with AutoML and Feature Store

Hands-On Exercises

Exercise 1: AutoML Baseline Challenge

Exercise 2: Feature Store Implementation

Exercise 3: Data Leakage Detection

Key Takeaways

Further Reading

Tools

Papers

Quiz

Next Steps

Sources