Time Series Forecasting

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 6-8

The Intern Who Beat the Team

Seattle. August 12, 2020. 10:15 AM.

Sarah Chen was three weeks into her data science internship when she made the senior engineers uncomfortable.

The team had spent six months building a churn prediction model. Feature engineering alone took three engineers two months. Hyperparameter tuning consumed another month. The final model achieved 0.847 AUC—a number the team celebrated with champagne.

Sarah asked a naive question: “Could I try AutoGluon on the same data?”

The senior engineers exchanged knowing glances. “Sure, but don’t expect it to beat a model built by experienced engineers.”

Four hours later, Sarah walked into the standup meeting. Her AutoGluon model achieved 0.863 AUC—almost two percentage points higher than the hand-crafted solution. The room went silent.

“How is that possible?” asked the team lead.

“It tried 15 different algorithms, ensembled the top 7, and did multi-layer stacking,” Sarah replied, reading from AutoGluon’s leaderboard. “It also found a feature interaction our model missed.”

The team’s six months of work had been outperformed by an intern with four hours and an AutoML library.

“AutoML doesn’t make data scientists obsolete—it makes their time more valuable. Now you can spend six months on problems that actually need human creativity.” — Nick Erickson, Lead Developer of AutoGluon, NeurIPS Workshop 2020

This module teaches you how to use AutoML tools effectively—and why they’re not magic, but a powerful force multiplier for any ML practitioner.

What You’ll Be Able to Do

By the end of this module, you will:

Understand AutoML and when to use it
Master AutoGluon for automated machine learning
Learn feature store concepts and architecture
Implement automated feature engineering
Build end-to-end ML pipelines

The AutoML Revolution

Imagine you’re a chef at a restaurant. Traditional ML is like cooking everything from scratch - you select ingredients (features), decide cooking methods (algorithms), adjust seasoning (hyperparameters), and taste-test constantly (validation). It requires years of expertise.

AutoML is like having a robot chef that tries hundreds of recipes automatically, learns from each attempt, and presents you with the best dish. You just need to provide the ingredients and describe what you want.

TRADITIONAL ML WORKFLOW:
────────────────────────
Data → [Manual Feature Eng] → [Choose Algorithm] → [Tune Hyperparameters] → Model
         ↑                         ↑                      ↑
         │                         │                      │
    Requires expertise       Requires expertise     Takes days/weeks
         │                         │                      │
         └─────────────────────────┴──────────────────────┘
                     TIME: Days to Weeks


AUTOML WORKFLOW:
────────────────
Data → [AutoML System] → Best Model
            │
            ├── Tries 100+ algorithms
            ├── Engineers features automatically
            ├── Tunes hyperparameters
            └── Ensembles best models

       TIME: Minutes to Hours

Did You Know? In the 2019 AutoML Challenge, AutoGluon achieved top-3 results on 39 out of 39 datasets, often beating solutions that took ML experts weeks to develop. The total computation time? About 4 hours per dataset with no human intervention!

Why AutoML Matters

The ML Expertise Gap

REALITY OF ML IN INDUSTRY:
──────────────────────────

Companies with ML needs:    ████████████████████ 1,000,000+
Companies with ML experts:  ███                  ~50,000
Expert ML engineers:        █                    ~300,000

Gap: 95%+ of companies can't hire ML experts!

AutoML bridges this gap:
- Democratizes ML (anyone can use it)
- Accelerates expert productivity (10x faster)
- Establishes strong baselines automatically

When to Use AutoML

USE AUTOML:
───────────
 Establishing baselines quickly
 Tabular data problems
 Time-constrained projects
 Non-expert teams
 Comparing many algorithms
 Hyperparameter optimization

DON'T USE AUTOML (alone):
─────────────────────────
 Custom architectures needed
 Domain-specific constraints
 Real-time inference requirements
 Highly specialized problems
 When interpretability is critical

AutoML Landscape

Major AutoML Frameworks

AUTOML FRAMEWORK COMPARISON:
────────────────────────────

┌──────────────────────────────────────────────────────────────────┐
│  FRAMEWORK      │ BEST FOR           │ KEY STRENGTH              │
├──────────────────────────────────────────────────────────────────┤
│  AutoGluon      │ Tabular, general   │ Best accuracy, ensembles  │
│  (Amazon)       │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  auto-sklearn   │ scikit-learn users │ Meta-learning, warm-start │
│  (Freiburg)     │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  H2O AutoML     │ Enterprise         │ Production-ready, scaling │
│  (H2O.ai)       │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  FLAML          │ Fast experiments   │ Low compute, fast         │
│  (Microsoft)    │                    │                           │
├──────────────────────────────────────────────────────────────────┤
│  PyCaret        │ Low-code ML        │ Simple API, visualization │
│  (Open Source)  │                    │                           │
└──────────────────────────────────────────────────────────────────┘

AutoGluon Deep Dive

AutoGluon (by Amazon) consistently wins ML competitions:

AUTOGLUON ARCHITECTURE:
───────────────────────

                    Input Data
                         │
            ┌────────────┼────────────┐
            │            │            │
            ▼            ▼            ▼
        ┌───────┐   ┌───────┐   ┌───────┐
        │NN     │   │GBM    │   │Linear │
        │Models │   │Models │   │Models │
        └───┬───┘   └───┬───┘   └───┬───┘
            │           │           │
            │    ┌──────┼──────┐    │
            │    │      │      │    │
            ▼    ▼      ▼      ▼    ▼
        ┌─────────────────────────────┐
        │     MULTI-LAYER STACKING    │
        │  (Ensemble of ensembles)    │
        └──────────────┬──────────────┘
                       │
                       ▼
                  Best Model


MODELS AUTOGLUON TRIES:
───────────────────────
Neural Networks:
  - TabularNN (custom for tabular)
  - FastAI neural network

Gradient Boosting:
  - LightGBM
  - CatBoost
  - XGBoost

Linear Models:
  - Ridge/Lasso regression
  - Linear SVM

Ensemble:
  - Weighted ensemble
  - Multi-layer stacking

Did You Know? AutoGluon’s “multi-layer stacking” is unique. Instead of just averaging model predictions, it trains a second layer of models on the first layer’s predictions, then a third layer, and so on. This recursive ensembling often improves accuracy by 1-3% - which can mean millions in revenue for businesses!

AutoML Under the Hood

Algorithm Selection

How does AutoML choose which algorithms to try?

ALGORITHM SELECTION STRATEGIES:
───────────────────────────────

1. EXHAUSTIVE SEARCH
   Try ALL algorithms, all hyperparameters
   Problem: Computationally infeasible!

   10 algorithms × 100 hyperparameter combos = 1,000 models
   If each takes 1 minute = 16+ hours


2. META-LEARNING (auto-sklearn approach)
   Learn from past datasets which algorithms work best

   "This dataset looks like Dataset #4,523 from our database.
    Random Forest worked best there, let's try that first!"

   Steps:
   a) Extract meta-features from dataset
   b) Find similar historical datasets
   c) Start with algorithms that worked on those


3. BAYESIAN OPTIMIZATION
   Smart search that learns as it goes

   ┌──────────────────────────────────────────────┐
   │ Iteration 1: Try random config → Score: 0.75 │
   │ Iteration 2: Try another → Score: 0.82      │
   │ Iteration 3: Try similar to best → 0.85     │
   │ ...learns that high learning_rate is bad... │
   │ Iteration 50: Optimal found → 0.91          │
   └──────────────────────────────────────────────┘


4. BANDIT-BASED (Hyperband/ASHA)
   Give more resources to promising configs

   Start: 100 configs with 1 epoch each
   Keep:  Top 25 configs, train for 4 epochs
   Keep:  Top 6 configs, train for 16 epochs
   Keep:  Top 2 configs, train to completion

   Result: Find best config with 10x less compute!

Hyperparameter Optimization

Think of hyperparameter optimization like tuning a guitar. Each hyperparameter is a string that affects the sound. Turn the learning rate too high and you get noise; too low and you barely hear anything. The problem? A neural network has dozens of “strings,” and they all interact with each other.

Traditional approach: try every combination. With 10 hyperparameters and 5 values each, that’s 5^10 = 9.7 million combinations. Even at 1 minute per trial, that’s 18 years.

Smart approach: use Bayesian optimization, which learns from each trial. “High learning rate made things worse? Let’s try lower values.” It finds good configurations in 50-100 trials instead of millions.

HYPERPARAMETER SEARCH SPACE:
────────────────────────────

LightGBM example:
┌─────────────────────────────────────────────────┐
│  Parameter        │  Search Space              │
├─────────────────────────────────────────────────┤
│  n_estimators     │  [100, 200, 500, 1000]     │
│  learning_rate    │  [0.01, 0.05, 0.1, 0.3]    │
│  max_depth        │  [3, 5, 7, 10, -1]         │
│  num_leaves       │  [15, 31, 63, 127]         │
│  min_child_weight │  [1e-3, 1e-2, 0.1, 1]      │
│  subsample        │  [0.5, 0.7, 0.9, 1.0]      │
│  colsample_bytree │  [0.5, 0.7, 0.9, 1.0]      │
└─────────────────────────────────────────────────┘

Total combinations: 4×4×5×4×4×4×4 = 20,480 configs!

Smart search finds good config in ~50 trials
Exhaustive search needs 20,480 trials
Speedup: 400x

Automated Feature Engineering

Why This Module Matters

FEATURE ENGINEERING REALITY:
────────────────────────────

Time spent on ML projects:
┌──────────────────────────────────────────────────────┐
│  Data Collection      ████████████         25%       │
│  Data Cleaning        ████████████████     35%       │
│  Feature Engineering  ██████████████       30%       │
│  Model Training       ████                 10%       │
└──────────────────────────────────────────────────────┘

Feature engineering is:
- Time-consuming (30% of project time)
- Requires domain expertise
- Often repetitive across projects
- Critical for model performance

"Give me better features, and I'll give you a better model."
                                    - Every ML Engineer

Automated Feature Engineering Techniques

AUTO-FEATURE TECHNIQUES:
────────────────────────

1. AGGREGATION (for relational data)
   ─────────────────────────────────
   customer_id → orders table

   Auto-generated features:
   - count(orders)
   - sum(order_amount)
   - avg(order_amount)
   - max(order_amount)
   - days_since_last_order


2. TRANSFORMATION
   ───────────────
   Original: [price, quantity]

   Auto-generated:
   - log(price)
   - sqrt(quantity)
   - price * quantity  (interaction)
   - price / quantity  (ratio)
   - price ** 2        (polynomial)


3. TIME-BASED (from timestamps)
   ────────────────────────────
   Original: purchase_datetime

   Auto-generated:
   - hour_of_day
   - day_of_week
   - is_weekend
   - month
   - quarter
   - days_since_signup


4. ENCODING (for categoricals)
   ───────────────────────────
   Original: category = ["A", "B", "C"]

   Auto-generated:
   - One-hot encoding
   - Target encoding
   - Frequency encoding
   - Label encoding

Featuretools: Deep Feature Synthesis

DEEP FEATURE SYNTHESIS (DFS):
─────────────────────────────

Given relational tables, automatically generate features:

TABLES:
  customers(id, signup_date, country)
  orders(id, customer_id, date, amount)
  products(id, order_id, product_type, price)


DFS GENERATES:
──────────────
Depth 1: Simple aggregations
  - COUNT(orders)
  - SUM(orders.amount)
  - AVG(orders.amount)

Depth 2: Stacked aggregations
  - COUNT(orders.products)
  - AVG(orders.SUM(products.price))
  - MODE(orders.MODE(products.product_type))

Depth 3: Triple-stacked!
  - STD(orders.AVG(products.price))

Result: 100s of features from 3 tables!


PRIMITIVES USED:
────────────────
Aggregation: sum, mean, count, max, min, std, mode
Transform: year, month, weekday, cum_sum, diff

Did You Know? In the 2015 KDD Cup, a team used Featuretools to generate over 1,000 features automatically. They finished in the top 10% of the competition with minimal manual feature engineering. The winning insight: more features (properly regularized) often beats carefully hand-crafted few features.

Feature Stores

What is a Feature Store?

Think of a feature store as a “data warehouse for ML features” - a centralized repository where teams can share, discover, and reuse features.

Imagine a restaurant where every chef prepares their own spice blends. Chef A makes curry powder. Chef B makes the same curry powder differently. Chef C needs curry powder but doesn’t know it already exists, so they make a third version. Different dishes taste inconsistent, ingredients are wasted, and no one knows which recipe is “official.”

Now imagine a central spice cabinet with standardized, labeled blends. Every chef uses the same curry powder. New chefs can see what’s available. If the curry powder recipe improves, all dishes improve automatically.

That’s what a feature store does for ML features—centralizes, standardizes, and shares them across teams.

WITHOUT FEATURE STORE:
──────────────────────

Team A: Builds "customer_lifetime_value" feature
        ├── Writes SQL query
        ├── Schedules daily job
        └── Stores in their own table

Team B: Needs same feature
        ├── Doesn't know Team A has it
        ├── Builds their own version
        └── Gets slightly different results!

Team C: Needs feature for real-time inference
        ├── Can't use batch SQL
        └── Builds third version!

Result: 3 versions of the same feature, inconsistent!


WITH FEATURE STORE:
───────────────────

               ┌─────────────────────────────┐
               │      FEATURE STORE          │
               │  ┌─────────────────────┐    │
               │  │ customer_lifetime_  │    │
               │  │ value               │    │
               │  │ - Batch: Daily SQL  │    │
               │  │ - Online: Redis     │    │
               │  │ - Owner: Team A     │    │
               │  │ - Version: 2.3      │    │
               │  └─────────────────────┘    │
               └──────────────┬──────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
        Team A             Team B             Team C
     (training)          (training)        (inference)

All teams use the SAME feature definition!

Feature Store Architecture

FEATURE STORE COMPONENTS:
─────────────────────────

┌─────────────────────────────────────────────────────────────────┐
│                     FEATURE STORE                                │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                  FEATURE REGISTRY                          │ │
│  │  - Feature definitions (schema, transformations)           │ │
│  │  - Metadata (owner, description, data lineage)            │ │
│  │  - Versioning                                              │ │
│  └────────────────────────────────────────────────────────────┘ │
│                              │                                   │
│         ┌────────────────────┴────────────────────┐             │
│         │                                         │              │
│  ┌──────▼──────┐                         ┌───────▼───────┐      │
│  │ OFFLINE     │                         │ ONLINE        │      │
│  │ STORE       │                         │ STORE         │      │
│  │             │                         │               │      │
│  │ - Historical│         Sync            │ - Latest      │      │
│  │ - Training  │ ◄──────────────────────►│ - Real-time   │      │
│  │ - BigQuery/ │                         │ - Redis/      │      │
│  │   S3/HDFS   │                         │   DynamoDB    │      │
│  └─────────────┘                         └───────────────┘      │
│         │                                         │              │
│         │                                         │              │
│  ┌──────▼──────────────────────────────────────────▼──────┐     │
│  │                    SERVING LAYER                        │     │
│  │   - Batch serving (training)                           │     │
│  │   - Online serving (inference)                         │     │
│  │   - Point-in-time correctness                          │     │
│  └────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────┘

Point-in-Time Correctness

This is the MOST important concept in feature stores:

POINT-IN-TIME PROBLEM:
──────────────────────

Training data preparation:
  "What was the customer's purchase_count on 2024-01-15?"

WRONG APPROACH (data leakage!):
  SELECT purchase_count FROM current_features
  WHERE customer_id = 123

  Problem: Returns TODAY's count, not 2024-01-15's!
  The model trains on future information = cheating!


CORRECT APPROACH (point-in-time join):
  SELECT purchase_count FROM feature_history
  WHERE customer_id = 123
  AND feature_timestamp <= '2024-01-15'
  ORDER BY feature_timestamp DESC
  LIMIT 1

  Returns: Value as it was on 2024-01-15


TIMELINE:
─────────
          2024-01-15          Today
              │                 │
              ▼                 ▼
    ──────────●─────────────────●──────►
              │                 │
        Training event     Don't use
         Use features      these values!
         from HERE

Did You Know? Uber’s feature store “Michelangelo” serves over 10 million feature vector lookups per second. They estimate that having a centralized feature store reduced their ML feature development time by 50% and virtually eliminated training-serving skew issues that previously caused silent model degradation.

Feast: Open Source Feature Store

Feast Architecture

FEAST COMPONENTS:
─────────────────

┌─────────────────────────────────────────────────────────────┐
│                         FEAST                                │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   FEATURE REPO                        │   │
│  │   feature_store.yaml   # Configuration                │   │
│  │   features.py          # Feature definitions          │   │
│  └──────────────────────────────────────────────────────┘   │
│                            │                                 │
│                            │ feast apply                     │
│                            ▼                                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   REGISTRY                            │   │
│  │   - Feature views                                     │   │
│  │   - Entities                                          │   │
│  │   - Data sources                                      │   │
│  └──────────────────────────────────────────────────────┘   │
│                            │                                 │
│            ┌───────────────┼───────────────┐                │
│            │               │               │                 │
│            ▼               ▼               ▼                 │
│     ┌───────────┐   ┌───────────┐   ┌───────────┐          │
│     │ Offline   │   │ Online    │   │ Serving   │          │
│     │ Store     │   │ Store     │   │ API       │          │
│     │ (Parquet) │   │ (Redis)   │   │ (gRPC)    │          │
│     └───────────┘   └───────────┘   └───────────┘          │
└─────────────────────────────────────────────────────────────┘

Defining Features in Feast

from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64

# Define entity (the thing we're building features for)
customer = Entity(
    name="customer_id",
    join_keys=["customer_id"],
    description="Customer identifier"
)

# Define data source
customer_stats_source = FileSource(
    path="data/customer_stats.parquet",
    timestamp_field="event_timestamp"
)

# Define feature view
customer_stats = FeatureView(
    name="customer_stats",
    entities=[customer],
    ttl=timedelta(days=90),  # How long features are valid
    schema=[
        Field(name="total_purchases", dtype=Int64),
        Field(name="avg_order_value", dtype=Float32),
        Field(name="days_since_last_order", dtype=Int64),
    ],
    online=True,   # Serve from online store
    source=customer_stats_source,
)

Using Feast

# Training: Get historical features
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Entity dataframe (what we want features for)
entity_df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "event_timestamp": pd.to_datetime(["2024-01-15"] * 5)
})

# Get training data with point-in-time correctness!
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_stats:total_purchases",
        "customer_stats:avg_order_value",
        "customer_stats:days_since_last_order"
    ]
).to_df()


# Inference: Get online features
feature_vector = store.get_online_features(
    features=[
        "customer_stats:total_purchases",
        "customer_stats:avg_order_value",
    ],
    entity_rows=[{"customer_id": 123}]
).to_dict()

# Returns: {"customer_id": [123], "total_purchases": [47], ...}

ML Pipeline Automation

End-to-End ML Pipeline

AUTOMATED ML PIPELINE:
──────────────────────

┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│  Data   │──►│ Feature │──►│ Model   │──►│ Model   │──►│ Deploy  │
│ Ingest  │   │   Eng   │   │ Train   │   │  Eval   │   │         │
└─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
     │             │             │             │             │
     ▼             ▼             ▼             ▼             ▼
 Scheduled    Feature       AutoML       Metrics      A/B Test
   Jobs       Store        Trains       Tracked      or Canary

ORCHESTRATION:
  - Airflow / Dagster / Prefect
  - MLflow for experiment tracking
  - Feature store for features
  - Model registry for models
  - CI/CD for deployment

MLflow Integration

MLFLOW EXPERIMENT TRACKING:
───────────────────────────

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model_type", "LightGBM")
    mlflow.log_param("n_estimators", 100)

    # Train model
    model = train_model(...)

    # Log metrics
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("f1_score", 0.93)

    # Log model
    mlflow.sklearn.log_model(model, "model")


MLFLOW UI shows:
┌────────────────────────────────────────────────────────────┐
│  Run ID    │ Model     │ n_estimators │ Accuracy │ F1     │
├────────────────────────────────────────────────────────────┤
│  abc123    │ LightGBM  │ 100          │ 0.95     │ 0.93   │
│  def456    │ XGBoost   │ 200          │ 0.94     │ 0.92   │
│  ghi789    │ RF        │ 150          │ 0.92     │ 0.90   │
└────────────────────────────────────────────────────────────┘

Easy comparison across experiments!

Practical AutoML Workflow

Step-by-Step AutoML Process

PRODUCTION AUTOML WORKFLOW:
───────────────────────────

1. DATA PREPARATION
   ├── Load data
   ├── Basic cleaning (handle missing, remove duplicates)
   ├── Define target variable
   └── Split train/validation/test

2. QUICK AUTOML RUN (1 hour)
   ├── Use AutoGluon with time_limit=3600
   ├── Get baseline performance
   └── Identify promising models

3. ANALYZE RESULTS
   ├── Check feature importance
   ├── Look for data leakage
   └── Understand model behavior

4. EXTENDED RUN (optional)
   ├── Run AutoGluon with more time
   ├── Try different presets (best_quality vs optimize_for_deployment)
   └── Experiment with hyperparameter ranges

5. MODEL SELECTION
   ├── Compare accuracy vs inference time
   ├── Consider interpretability needs
   └── Check model size for deployment

6. PRODUCTION PREPARATION
   ├── Export best model
   ├── Create feature pipeline
   ├── Set up monitoring
   └── Deploy with gradual rollout

AutoGluon Presets

AUTOGLUON PRESETS:
──────────────────

PRESET: "best_quality"
  - Maximum accuracy
  - Uses all algorithms
  - Deep stacking ensembles
  - Time: 4-8x longer
  - Use for: Final production models

PRESET: "high_quality"
  - Near-optimal accuracy
  - Good ensemble
  - Reasonable time
  - Use for: Most use cases

PRESET: "good_quality"
  - Good accuracy
  - Faster training
  - Use for: Prototyping

PRESET: "medium_quality"
  - Decent accuracy
  - Much faster
  - Use for: Quick baselines

PRESET: "optimize_for_deployment"
  - Single model (no ensemble)
  - Fast inference
  - Smaller model size
  - Use for: Real-time serving

Common Pitfalls and Best Practices

AutoML Pitfalls

COMMON AUTOML MISTAKES:
───────────────────────

1. DATA LEAKAGE
   ───────────────
   Problem: Target information leaks into features

   Example: Predicting "will customer churn?"
   Bad feature: "cancellation_date" (directly reveals answer!)

   Solution: Review feature importance, suspicious features
             that are too predictive are often leaky


2. OVERFITTING TO VALIDATION
   ──────────────────────────
   Problem: Running AutoML many times, picking best

   Each run: Accuracy = 0.91, 0.92, 0.93, 0.94, 0.90...
   Pick best: 0.94!
   Test set:  0.88  (overfit to validation)

   Solution: Hold out a TRUE test set, evaluate once at end


3. IGNORING BUSINESS CONSTRAINTS
   ───────────────────────────────
   Problem: Best model has 200ms latency, need < 10ms

   AutoGluon best model: Stacked ensemble
   Inference time: 200ms

   Solution: Use optimize_for_deployment preset
             or constrain model types


4. FEATURE STORE SKEW
   ────────────────────
   Problem: Training features differ from serving features

   Training: feature_v1 (old transformation)
   Serving:  feature_v2 (new transformation)

   Solution: Use feature store with versioning

Best Practices

AUTOML BEST PRACTICES:
──────────────────────

1. START SIMPLE
   - Run quick AutoML first (30 min - 1 hour)
   - Get baseline before investing more time
   - Understand what's possible

2. FEATURE ENGINEERING STILL MATTERS
   - AutoML optimizes models, not features
   - Domain features often help
   - Combine AutoML + manual feature eng

3. USE PROPER VALIDATION
   - Time-based splits for time series
   - Stratified for imbalanced classes
   - Group splits for related samples

4. MONITOR IN PRODUCTION
   - Track feature distributions
   - Monitor model performance
   - Set up drift detection

5. DOCUMENT EVERYTHING
   - Which AutoML settings?
   - What features used?
   - Business metrics impact?

Did You Know? Google’s AutoML Tables (now part of Vertex AI) automatically applies over 100 different data preprocessing and feature engineering transformations. When tested on the OpenML benchmark suite, it achieved state-of-the-art results on 60% of datasets - often surpassing manually tuned models!

Summary

KEY CONCEPTS RECAP:
───────────────────

AUTOML:
  - Automates algorithm selection + hyperparameter tuning
  - AutoGluon: Best accuracy, multi-layer stacking
  - Use presets based on needs (quality vs speed)

FEATURE STORES:
  - Centralized feature management
  - Point-in-time correctness for training
  - Online/offline serving
  - Feast: Open source, easy to start

AUTOMATED FEATURE ENGINEERING:
  - Aggregations, transformations, time features
  - Featuretools for deep feature synthesis
  - Still combine with domain knowledge

ML PIPELINES:
  - End-to-end automation
  - MLflow for experiment tracking
  - Orchestration (Airflow, Dagster)


WHEN TO USE WHAT:
─────────────────

┌─────────────────────────────────────────────────────────────┐
│  SCENARIO                  │  RECOMMENDATION                │
├─────────────────────────────────────────────────────────────┤
│  Quick baseline            │  AutoGluon, medium_quality     │
│  Production model          │  AutoGluon, best_quality       │
│  Real-time serving         │  AutoGluon, optimize_for_deployment│
│  Feature reuse             │  Feast feature store           │
│  Relational data           │  Featuretools + AutoML         │
│  Team collaboration        │  Feature store + MLflow        │
└─────────────────────────────────────────────────────────────┘

Production War Stories: AutoML and Feature Store Lessons

The $12 Million Feature Leak

Singapore. March 2022. Fintech startup CredFlow.

The data science team was celebrating. Their credit scoring model, built with AutoGluon in just two days, achieved 0.94 AUC—far better than their previous hand-built model at 0.78. The model went into production, and loan approvals skyrocketed.

Three months later, the collections team noticed something disturbing: default rates had tripled. Loans approved by the new model were failing at unprecedented rates.

The forensic analysis revealed the horror: One feature—days_until_first_payment—had an importance score of 0.42. This feature was calculated from a field populated after the loan was already approved. Customers with short payment windows (who’d been approved quickly) showed different patterns than those with longer windows.

The AutoML system had found a perfect predictor: a feature that leaked the outcome. It’s like trying to predict who will win a race by looking at the finish photo—technically accurate, but useless for making predictions before the race.

Financial impact: $12.3 million in bad loans before detection. The CRO was fired. The startup nearly collapsed.

The fix implemented:

# Before: Feature calculated whenever
days_until_first_payment = payment_date - approval_date

# After: Strict point-in-time feature validation
def validate_feature_timing(feature_name, feature_timestamp, prediction_timestamp):
    """Ensure feature was available BEFORE prediction was needed."""
    if feature_timestamp > prediction_timestamp:
        raise DataLeakageError(
            f"Feature '{feature_name}' has timestamp {feature_timestamp} "
            f"but prediction was needed at {prediction_timestamp}. "
            f"This is data leakage!"
        )

Lesson: AutoML will find ANY signal, including signals from the future. Point-in-time validation isn’t optional—it’s essential.

Did You Know? A 2023 survey of ML practitioners found that 43% had experienced data leakage in production models. The median time to detect leakage was 47 days. Feature stores with built-in point-in-time correctness reduce leakage incidents by 87%.

The Feature Store That Saved Black Friday

Seattle. November 2021. Major e-commerce retailer.

Black Friday was approaching. The ML platform team was nervous. Last year, their recommendation system had crashed under load, costing an estimated $8 million in lost sales. The problem? The feature computation pipeline couldn’t keep up with 50,000 requests per second.

This year, they’d implemented Feast as their feature store. The architecture was different:

LAST YEAR (crashed):
────────────────────
Request → Compute Features On-Demand → Model → Response
          └── SQL query per request
          └── 200ms latency
          └── Can't scale past 500 RPS

THIS YEAR (Feast):
─────────────────
Pre-computed features → Redis (Online Store)
Request → Redis lookup (1ms) → Model → Response
          └── 50,000+ RPS
          └── 5ms total latency

On Black Friday, traffic hit 72,000 requests per second. The system didn’t flinch. Latency stayed under 10ms. Revenue increased 34% year-over-year.

The key insight: Feature computation is the bottleneck, not model inference. Pre-computing features and serving from an online store changed everything.

Financial impact: Black Friday revenue increased by $47 million. Feature store implementation cost: $300K.

The AutoML Model That Discriminated

Chicago. June 2023. Insurance company HealthFirst.

The compliance team flagged an anomaly: claim denials were 23% higher for customers in certain ZIP codes—ZIP codes that happened to correlate strongly with minority populations.

Investigation revealed the AutoML system had discovered a highly predictive feature: zip_code_health_score, which was derived from historical claim data. The problem? Historical claim data reflected decades of discriminatory practices. The model wasn’t being racist on purpose—it was faithfully learning patterns that encoded institutional racism.

The team’s response:

Removed proxy features: ZIP code, neighborhood, and any feature that correlated >0.3 with protected demographics
Added fairness constraints: Ensured prediction rates were within 5% across demographic groups
Implemented explainability: Required human review for any denial with unusual feature weights

# Fairness-aware AutoML configuration
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(
    label='claim_approved',
    eval_metric='roc_auc'
).fit(
    train_data,
    # Add fairness constraint
    hyperparameters={
        'GBM': {
            'constraint_type': 'demographic_parity',
            'fairness_target': 'race_proxy',
            'fairness_threshold': 0.05
        }
    }
)

Regulatory outcome: The company avoided a discrimination lawsuit by self-reporting and fixing the issue. Estimated savings: $15-20 million in legal fees and settlements.

Lesson: AutoML optimizes what you tell it to optimize. If you only optimize for accuracy, it will happily learn discriminatory patterns. Fairness must be an explicit constraint.

Did You Know? Amazon famously scrapped an AI recruiting tool in 2018 after discovering it systematically downgraded women’s resumes. The model, trained on 10 years of hiring data, learned that Amazon had historically hired mostly men—and therefore preferred male candidates. This incident led to an industry-wide push for fairness-aware ML.

Common Mistakes and How to Avoid Them

Mistake 1: Trusting AutoML Feature Importance Blindly

Wrong:

# AutoML found these are the most important features
# Great, let's use them!
top_features = model.feature_importance()[:10]
production_model = train_on(data[top_features])  #  Dangerous!

Problem: Feature importance from AutoML can be misleading. A leaky feature will show high importance. A feature that’s important for one model type might be useless for another.

Right:

def validate_feature_importance(feature_name, importance_score, data):
    """Sanity check for suspiciously important features."""

    # Check for data leakage
    if importance_score > 0.3:  # Suspiciously high
        print(f"️ WARNING: {feature_name} has importance {importance_score}")
        print("Checking for potential leakage...")

        # Check if feature correlates with target timing
        correlation_with_target = data[feature_name].corr(data['target'])
        if abs(correlation_with_target) > 0.8:
            raise LeakageWarning(
                f"{feature_name} has {correlation_with_target:.2f} correlation "
                f"with target. Likely data leakage!"
            )

    # Check if feature is available at prediction time
    if not is_available_at_prediction_time(feature_name):
        raise LeakageWarning(
            f"{feature_name} is not available at prediction time!"
        )

    return True

Mistake 2: Not Setting Time Limits on AutoML

Wrong:

# "Just let it run until it's done"
predictor = TabularPredictor(label='target').fit(train_data)
# 3 days later: still running, $2,000 in cloud costs

Problem: AutoML will happily run forever, trying more and more models. Without time limits, you waste compute and money.

Right:

# Always set explicit time limits
predictor = TabularPredictor(
    label='target',
    eval_metric='roc_auc'
).fit(
    train_data,
    time_limit=3600,  # 1 hour max
    presets='best_quality',  # Will do its best within time limit
    # AutoGluon automatically prioritizes promising models
)

Mistake 3: Ignoring Training-Serving Skew

Wrong:

# Training time
training_features = compute_features_from_warehouse(training_data)
model.fit(training_features, labels)

# Serving time (different code path!)
serving_features = compute_features_from_api(request_data)  #  Different!
prediction = model.predict(serving_features)

Problem: Subtle differences in feature computation between training and serving cause silent model degradation. Think of it as using different thermometers that are calibrated differently—your predictions will be systematically off.

Right:

# Use feature store for BOTH training and serving
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Training time
training_features = store.get_historical_features(
    entity_df=training_entities,
    features=['customer:total_purchases', 'customer:avg_order_value']
).to_df()

# Serving time (SAME feature definitions!)
serving_features = store.get_online_features(
    features=['customer:total_purchases', 'customer:avg_order_value'],
    entity_rows=[{"customer_id": request.customer_id}]
).to_dict()

# Features are guaranteed to be computed identically

Mistake 4: Not Versioning Features

Wrong:

# Features.py - Modified directly in production
customer_value = total_purchases * avg_order_value  # Changed from sum to product
# Now training data has old definition, production has new...

Problem: Changing feature definitions without versioning creates chaos. Models trained on v1 features serving with v2 features will produce garbage.

Right:

# features_v2.py - Explicit versioning
class CustomerValueV2(FeatureView):
    """
    Customer lifetime value calculation.

    v1 -> v2 changes:
    - Changed from sum to product formula
    - Added recency weighting
    - Breaking change: requires model retraining

    Migration: Models must be retrained before using v2
    """
    name = "customer_value_v2"
    version = "2.0.0"
    deprecates = "customer_value_v1"  # Mark old version
    requires_retraining = True

    def compute(self, data):
        return (data.total_purchases * data.avg_order_value *
                self.recency_weight(data.days_since_last_order))

Mistake 5: Using AutoML for Everything

Wrong:

# "AutoML is magic, let's use it everywhere!"
image_model = AutoGluon.fit(image_data)  # Works but suboptimal
text_model = AutoGluon.fit(text_data)    # Works but suboptimal
time_series = AutoGluon.fit(ts_data)      # Works but suboptimal

Problem: AutoML excels at tabular data. For images, text, and time series, specialized approaches usually win.

Right:

WHEN TO USE AUTOML vs SPECIALIZED TOOLS:
────────────────────────────────────────

Data Type    │ AutoML Good? │ Better Alternative
─────────────┼──────────────┼────────────────────────
Tabular      │  Excellent │ N/A - AutoML is best
Images       │ ️ OK       │ Transfer learning (ResNet, ViT)
Text         │ ️ OK       │ Fine-tuned LLMs, BERT
Time Series  │ ️ OK       │ Prophet, NeuralProphet, DeepAR
Graph        │  Poor     │ PyTorch Geometric, DGL
Audio        │  Poor     │ Whisper, wav2vec

Economics of AutoML and Feature Stores

AutoML ROI Calculation

Scenario	Manual ML	AutoML	Savings
Initial Model Development
Data scientist time	4 weeks ($40K)	1 week ($10K)	$30K
Compute costs	$500	$200	$300
Time to production	6 weeks	2 weeks	4 weeks faster
Ongoing Maintenance
Monthly retraining time	2 days ($4K)	4 hours ($500)	$3.5K/month
Model iteration cost	$10K per iteration	$2K per iteration	$8K/iteration
Annual Savings
Initial + 12 months maintenance	$88K	$16K	$72K/year

Feature Store ROI Calculation

Metric	Without Feature Store	With Feature Store
Feature development time	2 weeks per feature	2 days per feature (10x faster with reuse)
Feature duplication	5x average (same feature built 5 times)	1x (single source of truth)
Training-serving skew incidents	3 per quarter	0.1 per quarter (30x reduction)
Revenue lost to skew	$500K/year	$17K/year
Engineer time on debugging	20%	5%
Total Annual Impact		$800K-1.2M savings

Cost of NOT Using These Tools

REAL COSTS OF MANUAL ML:
────────────────────────

┌────────────────────────────────────────────────────────────┐
│  Problem                      │  Typical Cost             │
├────────────────────────────────────────────────────────────┤
│  Data leakage in production   │  $1M-10M (depending on    │
│                               │  time to detect)          │
├────────────────────────────────────────────────────────────┤
│  Training-serving skew        │  $100K-500K per incident  │
│  (silent model degradation)   │  (lost revenue + debug)   │
├────────────────────────────────────────────────────────────┤
│  Duplicate feature work       │  $50K-200K/year           │
│  (multiple teams building     │  (wasted engineer time)   │
│  same features)               │                           │
├────────────────────────────────────────────────────────────┤
│  Slow model iteration         │  $500K-2M/year            │
│  (opportunity cost of delayed │  (competitors move        │
│  improvements)                │  faster)                  │
└────────────────────────────────────────────────────────────┘

Did You Know? According to a 2023 MLOps survey, companies using feature stores report 73% faster time-to-production for new ML models. The median ROI for feature store implementations is 340% over 3 years, with payback periods under 9 months.

Interview Preparation: AutoML & Feature Stores

Q1: “When would you use AutoML vs. hand-crafted models?”

Strong Answer: “I use AutoML in three main scenarios. First, for establishing baselines quickly—before investing weeks in manual model development, I run AutoML to understand what’s achievable. If AutoML gets 0.75 AUC, I know my hand-crafted model should aim for at least 0.78 to justify the extra effort.

Second, for tabular data problems with clear evaluation metrics—this is AutoML’s sweet spot. In my experience, AutoGluon matches or beats hand-crafted gradient boosting models 80% of the time with a fraction of the effort.

Third, when the ML isn’t the core differentiator—if we’re building a feature where ML is a small component, I’d rather spend engineering time on the product, not model tuning.

I wouldn’t use AutoML when I need specific architectures like transformers for NLP, when there are strict latency requirements that need optimized single models, or when interpretability is critical and I need to explain every decision.”

Q2: “Explain point-in-time correctness in feature stores.”

Strong Answer: “Point-in-time correctness ensures that when training a model, we only use feature values that were available at the time the prediction would have been made. It prevents data leakage from the future.

Imagine training a model to predict customer churn. If a customer churned on March 15, their training features should reflect their state on March 14 or earlier—not their state after they churned. Without point-in-time correctness, we might accidentally include features like ‘days_since_last_login’ that jumped to 30+ after they stopped using the product—a clear signal they’ve churned that wouldn’t be available when making a real prediction.

Feature stores implement this by maintaining timestamped feature values and performing point-in-time joins. When you request historical features for training, you provide entity timestamps, and the feature store returns the most recent feature value that existed before each timestamp.

This is critical because models trained with leaked features show fantastic offline metrics but fail dramatically in production—a pattern I’ve seen called ‘suspiciously good AUC syndrome.’”

Q3: “How does multi-layer stacking work in AutoGluon?”

Strong Answer: “Multi-layer stacking is AutoGluon’s ensemble technique that significantly outperforms simple averaging or voting.

In the first layer, AutoGluon trains diverse base models—gradient boosting (LightGBM, XGBoost, CatBoost), neural networks, and linear models. Each model makes predictions on out-of-fold validation data to avoid leakage.

These first-layer predictions become features for the second layer, which trains new models to combine them. Think of it as learning ‘when to trust which model.’ If LightGBM is great on numerical features but weak on categoricals, while CatBoost is the opposite, the second layer learns to weight them appropriately based on the input.

AutoGluon can stack multiple layers—typically 2-3 work best before diminishing returns.

The key insight is that this outperforms simple ensembling because it learns non-linear combinations of model predictions. A weighted average says ‘LightGBM gets 40% weight.’ Multi-layer stacking says ‘LightGBM gets 80% weight when feature X is high, but only 20% when feature Y is low.’

In practice, I’ve seen multi-layer stacking improve AUC by 1-3% over simple ensembles—which can translate to millions in revenue for high-stakes predictions.”

Q4: “What’s the difference between online and offline feature stores?”

Strong Answer: “They serve different use cases with different latency and storage requirements.

The offline store is optimized for training workloads. It stores historical feature values with timestamps, typically in data warehouses like BigQuery, Snowflake, or object storage like S3. Latency is seconds to minutes, but it can handle huge volumes—millions of feature vectors. I use it when creating training datasets with point-in-time correctness.

The online store is optimized for inference. It stores only the latest feature values in low-latency databases like Redis, DynamoDB, or Bigtable. Latency is single-digit milliseconds. I use it when serving predictions in real-time.

The key architectural insight is that they share the same feature definitions but different storage backends. The feature store materializes features from the offline store to the online store periodically—typically every few minutes to daily, depending on freshness requirements.

This dual architecture solves the training-serving skew problem. My training code uses get_historical_features from the offline store. My serving code uses get_online_features from the online store. Both use identical feature definitions, just different storage optimized for their use case.”

System Design: Design an ML Platform with AutoML and Feature Store

Prompt: “Design an ML platform for a fintech company that needs to make real-time credit decisions at 10,000 requests per second.”

Strong Answer:

“I’d design this with five key components:

1. Feature Store Architecture:

Online Store: Redis Cluster (6 nodes)
  - Sharded by customer_id
  - 500K features, 1ms p99 latency
  - TTL: 24 hours, refreshed hourly

Offline Store: BigQuery
  - Historical features for training
  - 2 years of feature history
  - Point-in-time query support

Feature Computation: Spark on Dataproc
  - Batch features: daily runs at 2 AM
  - Real-time features: Kafka Streams
  - Features: credit_score, payment_history,
    debt_to_income, account_age, etc.

2. AutoML Pipeline:

# Monthly retraining pipeline
def train_credit_model():
    # Get training data with point-in-time correctness
    training_data = feature_store.get_historical_features(
        entity_df=approved_applications_last_6_months,
        features=CREDIT_FEATURES,
        label='defaulted_within_90_days'
    )

    # AutoML with fairness constraints
    predictor = TabularPredictor(
        label='defaulted',
        eval_metric='roc_auc'
    ).fit(
        training_data,
        time_limit=14400,  # 4 hours
        presets='optimize_for_deployment',  # Single model for latency
        excluded_model_types=['NN']  # Neural nets too slow
    )

    # Validate fairness before promotion
    if passes_fairness_audit(predictor):
        deploy_to_production(predictor)

3. Serving Architecture for 10K RPS:

Load Balancer (GCP GLB)
        │
        ├── Kubernetes Cluster (GKE)
        │   └── Model Serving Pods (50 replicas)
        │       - CPU-optimized (LightGBM)
        │       - 10ms p99 latency per request
        │       - gRPC for low overhead
        │
        └── Feature Store (Redis)
            - 1ms feature fetch
            - Pre-computed features only

4. Monitoring & Safety:

Feature drift detection: alert if distributions shift >10%
Model performance monitoring: daily AUC calculation on holdout
Fairness monitoring: automated demographic parity checks
Circuit breaker: fall back to rules-based model if ML fails

5. Cost Estimate:

Feature Store (Redis): $15K/month
Compute (GKE): $25K/month
BigQuery: $5K/month
Total: $45K/month = $540K/year

Expected value: At 10K RPS, serving 864M decisions/day. Even 0.1% improvement in precision saves millions in bad debt.

This architecture handles 10K RPS with 15ms end-to-end latency while maintaining feature consistency and enabling rapid model iteration through AutoML.”

Hands-On Exercises

Exercise 1: AutoML Baseline Challenge

Build an AutoML baseline and compare it to a hand-crafted model:

"""
AutoML vs Manual Model Comparison

Dataset: Credit Card Fraud Detection (Kaggle)
Goal: Compare development time and accuracy
"""
import pandas as pd
from autogluon.tabular import TabularPredictor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import time

# Load data
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MANUAL APPROACH
print("=" * 50)
print("MANUAL RANDOM FOREST")
print("=" * 50)
manual_start = time.time()

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict_proba(X_test)[:, 1]
rf_auc = roc_auc_score(y_test, rf_predictions)

manual_time = time.time() - manual_start
print(f"Time: {manual_time:.1f}s")
print(f"AUC: {rf_auc:.4f}")

# AUTOML APPROACH
print("\n" + "=" * 50)
print("AUTOGLUON")
print("=" * 50)
automl_start = time.time()

# Prepare data for AutoGluon
train_df = X_train.copy()
train_df['Class'] = y_train.values

predictor = TabularPredictor(
    label='Class',
    eval_metric='roc_auc',
    verbosity=1
).fit(
    train_df,
    time_limit=300,  # 5 minutes
    presets='medium_quality'
)

test_df = X_test.copy()
automl_predictions = predictor.predict_proba(test_df)
automl_auc = roc_auc_score(y_test, automl_predictions.iloc[:, 1])

automl_time = time.time() - automl_start
print(f"Time: {automl_time:.1f}s")
print(f"AUC: {automl_auc:.4f}")

# COMPARISON
print("\n" + "=" * 50)
print("COMPARISON")
print("=" * 50)
print(f"Manual RF:  {rf_auc:.4f} AUC in {manual_time:.1f}s")
print(f"AutoGluon:  {automl_auc:.4f} AUC in {automl_time:.1f}s")
print(f"Improvement: {(automl_auc - rf_auc)*100:.2f} percentage points")

# See what AutoGluon tried
print("\n" + "=" * 50)
print("AUTOGLUON LEADERBOARD")
print("=" * 50)
print(predictor.leaderboard())

Expected Learning: AutoGluon typically achieves 1-3% higher AUC than a basic Random Forest, demonstrating the value of automated algorithm selection and ensembling.

Exercise 2: Feature Store Implementation

Build a simple feature store with point-in-time correctness:

"""
Simple Feature Store Implementation

This exercise teaches the core concepts of feature stores:
- Point-in-time correctness
- Online vs offline serving
- Feature versioning
"""
import pandas as pd
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import json
import redis  # For online store

class SimpleFeatureStore:
    """
    A minimal feature store demonstrating core concepts.

    Production feature stores like Feast add:
    - Distributed storage
    - Automatic materialization
    - Schema validation
    - Access control
    """

    def __init__(self):
        # Offline store: Historical features with timestamps
        self.offline_store: Dict[str, pd.DataFrame] = {}

        # Online store: Latest features only (simulated with dict)
        self.online_store: Dict[str, Dict] = {}

        # Feature registry: Metadata about features
        self.registry: Dict[str, dict] = {}

    def register_feature(
        self,
        name: str,
        entity: str,
        description: str,
        version: str = "1.0.0"
    ):
        """Register a feature in the registry."""
        self.registry[name] = {
            'entity': entity,
            'description': description,
            'version': version,
            'created_at': datetime.now().isoformat()
        }
        print(f" Registered feature: {name} v{version}")

    def write_features(
        self,
        feature_name: str,
        data: pd.DataFrame,
        timestamp_col: str = 'event_timestamp'
    ):
        """Write features to both offline and online stores."""
        if feature_name not in self.registry:
            raise ValueError(f"Feature {feature_name} not registered!")

        # Write to offline store (full history)
        if feature_name not in self.offline_store:
            self.offline_store[feature_name] = data.copy()
        else:
            self.offline_store[feature_name] = pd.concat([
                self.offline_store[feature_name],
                data
            ]).drop_duplicates()

        # Write to online store (latest values only)
        entity_col = self.registry[feature_name]['entity']
        for _, row in data.iterrows():
            entity_id = str(row[entity_col])
            self.online_store[f"{feature_name}:{entity_id}"] = row.to_dict()

        print(f" Wrote {len(data)} rows to {feature_name}")

    def get_historical_features(
        self,
        feature_name: str,
        entity_df: pd.DataFrame,
        timestamp_col: str = 'event_timestamp'
    ) -> pd.DataFrame:
        """
        Get historical features with point-in-time correctness.

        This is the CRITICAL function that prevents data leakage.
        """
        if feature_name not in self.offline_store:
            raise ValueError(f"Feature {feature_name} not in offline store!")

        feature_data = self.offline_store[feature_name]
        entity_col = self.registry[feature_name]['entity']

        results = []
        for _, entity_row in entity_df.iterrows():
            entity_id = entity_row[entity_col]
            query_time = entity_row[timestamp_col]

            # Point-in-time filter: only use features from BEFORE query time
            valid_features = feature_data[
                (feature_data[entity_col] == entity_id) &
                (feature_data[timestamp_col] <= query_time)
            ]

            if len(valid_features) > 0:
                # Get the most recent valid feature
                latest = valid_features.sort_values(timestamp_col).iloc[-1]
                results.append(latest.to_dict())
            else:
                # No valid features - use nulls
                results.append({entity_col: entity_id})

        return pd.DataFrame(results)

    def get_online_features(
        self,
        feature_name: str,
        entity_ids: List[str]
    ) -> List[Dict]:
        """Get latest features for real-time inference."""
        results = []
        for entity_id in entity_ids:
            key = f"{feature_name}:{entity_id}"
            if key in self.online_store:
                results.append(self.online_store[key])
            else:
                results.append({'error': 'not_found'})
        return results


# Demo usage
if __name__ == "__main__":
    store = SimpleFeatureStore()

    # Register features
    store.register_feature(
        name='customer_stats',
        entity='customer_id',
        description='Aggregated customer purchase statistics'
    )

    # Create some historical feature data
    feature_data = pd.DataFrame({
        'customer_id': [1, 1, 1, 2, 2],
        'total_purchases': [10, 15, 20, 5, 8],
        'avg_order_value': [50.0, 52.0, 55.0, 30.0, 35.0],
        'event_timestamp': pd.to_datetime([
            '2024-01-01', '2024-02-01', '2024-03-01',
            '2024-01-15', '2024-02-15'
        ])
    })

    store.write_features('customer_stats', feature_data)

    # Point-in-time retrieval for training
    training_entities = pd.DataFrame({
        'customer_id': [1, 1, 2],
        'event_timestamp': pd.to_datetime([
            '2024-01-15',  # Should get Jan 1 features
            '2024-02-15',  # Should get Feb 1 features
            '2024-02-01'   # Should get Jan 15 features
        ])
    })

    historical = store.get_historical_features(
        'customer_stats',
        training_entities
    )
    print("\n Historical Features (point-in-time correct):")
    print(historical)

    # Online retrieval for inference
    online = store.get_online_features('customer_stats', ['1', '2'])
    print("\n Online Features (latest values):")
    for f in online:
        print(f)

Expected Learning: Understanding how point-in-time correctness prevents data leakage and why online/offline stores serve different purposes.

Exercise 3: Data Leakage Detection

Build a tool to detect potential data leakage in AutoML results:

"""
Data Leakage Detection Tool

Detects common patterns that indicate data leakage in AutoML models.
"""
import pandas as pd
import numpy as np
from typing import List, Tuple

class LeakageDetector:
    """Detects potential data leakage in ML features."""

    def __init__(self, model, X: pd.DataFrame, y: pd.Series):
        self.model = model
        self.X = X
        self.y = y
        self.warnings = []

    def check_feature_importance(
        self,
        importance_threshold: float = 0.3
    ) -> List[Tuple[str, float, str]]:
        """
        Flag features with suspiciously high importance.

        Leaky features often have unusually high importance because
        they directly encode the target.
        """
        try:
            importances = self.model.feature_importance()
        except AttributeError:
            # For models without built-in feature importance
            return []

        suspicious = []
        for feature, importance in importances.items():
            if importance > importance_threshold:
                suspicious.append((
                    feature,
                    importance,
                    f"️ Unusually high importance ({importance:.3f}). "
                    f"Check for data leakage!"
                ))
                self.warnings.append(f"HIGH_IMPORTANCE: {feature}")

        return suspicious

    def check_correlation_with_target(
        self,
        correlation_threshold: float = 0.9
    ) -> List[Tuple[str, float, str]]:
        """
        Flag features with very high correlation to target.

        Perfect or near-perfect correlation often indicates leakage.
        """
        suspicious = []
        for col in self.X.columns:
            if self.X[col].dtype in ['int64', 'float64']:
                corr = abs(self.X[col].corr(self.y))
                if corr > correlation_threshold:
                    suspicious.append((
                        col,
                        corr,
                        f"️ Very high correlation ({corr:.3f}). "
                        f"Likely data leakage!"
                    ))
                    self.warnings.append(f"HIGH_CORRELATION: {col}")

        return suspicious

    def check_perfect_prediction_features(self) -> List[str]:
        """
        Flag features that perfectly predict the target alone.

        If a single feature achieves >99% accuracy, it's usually leakage.
        """
        suspicious = []
        for col in self.X.columns:
            if self.X[col].nunique() < 100:  # Categorical-ish
                # Check if any value perfectly predicts target
                for value in self.X[col].unique():
                    mask = self.X[col] == value
                    if mask.sum() > 10:  # Enough samples
                        target_values = self.y[mask].unique()
                        if len(target_values) == 1:
                            suspicious.append(col)
                            self.warnings.append(
                                f"PERFECT_PREDICTOR: {col}={value} -> {target_values[0]}"
                            )
                            break

        return suspicious

    def check_temporal_leakage(
        self,
        date_columns: List[str],
        target_date_column: str = None
    ) -> List[str]:
        """
        Flag features that might be computed from future data.
        """
        suspicious = []

        # Check if any features have dates AFTER the target date
        if target_date_column and target_date_column in self.X.columns:
            target_date = pd.to_datetime(self.X[target_date_column])
            for col in date_columns:
                if col in self.X.columns and col != target_date_column:
                    feature_date = pd.to_datetime(self.X[col])
                    future_rows = (feature_date > target_date).sum()
                    if future_rows > 0:
                        suspicious.append(col)
                        self.warnings.append(
                            f"TEMPORAL_LEAKAGE: {col} has {future_rows} "
                            f"rows with dates after target"
                        )

        return suspicious

    def generate_report(self) -> str:
        """Generate a comprehensive leakage report."""
        report = []
        report.append("=" * 60)
        report.append("DATA LEAKAGE DETECTION REPORT")
        report.append("=" * 60)

        # Run all checks
        importance_issues = self.check_feature_importance()
        correlation_issues = self.check_correlation_with_target()
        perfect_predictors = self.check_perfect_prediction_features()

        # Format report
        if importance_issues:
            report.append("\n HIGH IMPORTANCE FEATURES:")
            for feature, importance, msg in importance_issues:
                report.append(f"  - {feature}: {msg}")

        if correlation_issues:
            report.append("\n HIGH CORRELATION FEATURES:")
            for feature, corr, msg in correlation_issues:
                report.append(f"  - {feature}: {msg}")

        if perfect_predictors:
            report.append("\n PERFECT PREDICTOR FEATURES:")
            for feature in perfect_predictors:
                report.append(f"  - {feature}")

        if not any([importance_issues, correlation_issues, perfect_predictors]):
            report.append("\n No obvious leakage detected")
            report.append("   (Note: Some leakage types require domain knowledge to detect)")

        report.append("\n" + "=" * 60)
        report.append(f"Total warnings: {len(self.warnings)}")

        return "\n".join(report)


# Usage example
if __name__ == "__main__":
    # Create sample data with intentional leakage
    np.random.seed(42)
    n = 1000

    # Normal features
    X = pd.DataFrame({
        'age': np.random.randint(18, 80, n),
        'income': np.random.normal(50000, 20000, n),
        'credit_score': np.random.randint(300, 850, n),
    })

    # Target: will customer default?
    y = pd.Series(np.random.binomial(1, 0.2, n))

    # Add LEAKY feature (computed from outcome!)
    # This simulates a feature that includes future information
    X['days_until_default'] = np.where(y == 1, np.random.randint(1, 90, n), -1)

    # This feature is suspicious - high correlation
    X['default_indicator'] = y * 0.99 + np.random.normal(0, 0.01, n)

    # Create mock model
    class MockModel:
        def feature_importance(self):
            return {
                'age': 0.05,
                'income': 0.08,
                'credit_score': 0.12,
                'days_until_default': 0.45,  # Suspiciously high!
                'default_indicator': 0.30
            }

    # Run detection
    detector = LeakageDetector(MockModel(), X, y)
    print(detector.generate_report())

Expected Learning: Understanding common leakage patterns and how to systematically detect them before they cause production failures.

Key Takeaways

AutoML is a force multiplier, not a replacement — It doesn’t replace ML engineers; it amplifies their productivity by automating tedious parts (algorithm selection, hyperparameter tuning) so they can focus on harder problems.
AutoGluon wins on tabular data — For structured data problems, AutoGluon’s multi-layer stacking consistently achieves state-of-the-art results. Start here before hand-crafting models.
Feature stores solve training-serving skew — By using the same feature definitions for training and serving, you eliminate a major source of production ML failures.
Point-in-time correctness is non-negotiable — Data leakage from future information is the silent killer of ML models. Feature stores with timestamp-aware joins prevent this.
Feature reuse compounds over time — Every feature you add to the store can be used by multiple models. After a year, you have a powerful feature library that accelerates all new projects.
Set time limits on AutoML — Without constraints, AutoML will run forever. Always specify time_limit and use appropriate presets for your use case.
Validate AutoML feature importance — Suspiciously high importance scores often indicate data leakage. Always sanity-check before trusting AutoML’s discoveries.
Online vs offline stores serve different needs — Offline for training (historical, high volume), online for serving (latest values, low latency). Both share feature definitions.
AutoML presets matter — ‘best_quality’ for maximum accuracy, ‘optimize_for_deployment’ for production serving. Choose based on your constraints.
The economics are compelling — Feature stores and AutoML typically deliver 300%+ ROI through faster development, reduced errors, and engineer time savings.

Next Steps

You’ve completed Phase 8: Classical ML! You now understand:

Gradient boosting (XGBoost, LightGBM)
Time series forecasting (ARIMA, Prophet)
AutoML and feature stores

Up Next: Phase 9 - AI Safety & Evaluation

Module 39 Complete! “AutoML doesn’t replace ML engineers - it multiplies their productivity.”

Time Series Forecasting

The Intern Who Beat the Team

What You’ll Be Able to Do

The AutoML Revolution

Why AutoML Matters

The ML Expertise Gap

When to Use AutoML

AutoML Landscape

Major AutoML Frameworks

AutoGluon Deep Dive

AutoML Under the Hood

Algorithm Selection

Hyperparameter Optimization

Automated Feature Engineering

Why This Module Matters

Automated Feature Engineering Techniques

Featuretools: Deep Feature Synthesis

Feature Stores

What is a Feature Store?

Feature Store Architecture

Point-in-Time Correctness

Feast: Open Source Feature Store

Feast Architecture

Defining Features in Feast

Using Feast

ML Pipeline Automation

End-to-End ML Pipeline

MLflow Integration

Practical AutoML Workflow

Step-by-Step AutoML Process

AutoGluon Presets

Common Pitfalls and Best Practices

AutoML Pitfalls

Best Practices

Summary

Production War Stories: AutoML and Feature Store Lessons

The $12 Million Feature Leak

The Feature Store That Saved Black Friday

The AutoML Model That Discriminated

Common Mistakes and How to Avoid Them

Mistake 1: Trusting AutoML Feature Importance Blindly

Mistake 2: Not Setting Time Limits on AutoML

Mistake 3: Ignoring Training-Serving Skew

Mistake 4: Not Versioning Features

Mistake 5: Using AutoML for Everything

Economics of AutoML and Feature Stores

AutoML ROI Calculation

Feature Store ROI Calculation

Cost of NOT Using These Tools

Interview Preparation: AutoML & Feature Stores

Q1: “When would you use AutoML vs. hand-crafted models?”

Q2: “Explain point-in-time correctness in feature stores.”

Q3: “How does multi-layer stacking work in AutoGluon?”

Q4: “What’s the difference between online and offline feature stores?”

System Design: Design an ML Platform with AutoML and Feature Store

Hands-On Exercises

Exercise 1: AutoML Baseline Challenge

Exercise 2: Feature Store Implementation

Exercise 3: Data Leakage Detection

Key Takeaways

Further Reading

Tools

Papers

Next Steps