Time Series Forecasting
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 6-8
The Intern Who Beat the Team
Section titled “The Intern Who Beat the Team”Seattle. August 12, 2020. 10:15 AM.
Sarah Chen was three weeks into her data science internship when she made the senior engineers uncomfortable.
The team had spent six months building a churn prediction model. Feature engineering alone took three engineers two months. Hyperparameter tuning consumed another month. The final model achieved 0.847 AUC—a number the team celebrated with champagne.
Sarah asked a naive question: “Could I try AutoGluon on the same data?”
The senior engineers exchanged knowing glances. “Sure, but don’t expect it to beat a model built by experienced engineers.”
Four hours later, Sarah walked into the standup meeting. Her AutoGluon model achieved 0.863 AUC—almost two percentage points higher than the hand-crafted solution. The room went silent.
“How is that possible?” asked the team lead.
“It tried 15 different algorithms, ensembled the top 7, and did multi-layer stacking,” Sarah replied, reading from AutoGluon’s leaderboard. “It also found a feature interaction our model missed.”
The team’s six months of work had been outperformed by an intern with four hours and an AutoML library.
“AutoML doesn’t make data scientists obsolete—it makes their time more valuable. Now you can spend six months on problems that actually need human creativity.” — Nick Erickson, Lead Developer of AutoGluon, NeurIPS Workshop 2020
This module teaches you how to use AutoML tools effectively—and why they’re not magic, but a powerful force multiplier for any ML practitioner.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand AutoML and when to use it
- Master AutoGluon for automated machine learning
- Learn feature store concepts and architecture
- Implement automated feature engineering
- Build end-to-end ML pipelines
The AutoML Revolution
Section titled “The AutoML Revolution”Imagine you’re a chef at a restaurant. Traditional ML is like cooking everything from scratch - you select ingredients (features), decide cooking methods (algorithms), adjust seasoning (hyperparameters), and taste-test constantly (validation). It requires years of expertise.
AutoML is like having a robot chef that tries hundreds of recipes automatically, learns from each attempt, and presents you with the best dish. You just need to provide the ingredients and describe what you want.
TRADITIONAL ML WORKFLOW:────────────────────────Data → [Manual Feature Eng] → [Choose Algorithm] → [Tune Hyperparameters] → Model ↑ ↑ ↑ │ │ │ Requires expertise Requires expertise Takes days/weeks │ │ │ └─────────────────────────┴──────────────────────┘ TIME: Days to Weeks
AUTOML WORKFLOW:────────────────Data → [AutoML System] → Best Model │ ├── Tries 100+ algorithms ├── Engineers features automatically ├── Tunes hyperparameters └── Ensembles best models
TIME: Minutes to HoursDid You Know? In the 2019 AutoML Challenge, AutoGluon achieved top-3 results on 39 out of 39 datasets, often beating solutions that took ML experts weeks to develop. The total computation time? About 4 hours per dataset with no human intervention!
Why AutoML Matters
Section titled “Why AutoML Matters”The ML Expertise Gap
Section titled “The ML Expertise Gap”REALITY OF ML IN INDUSTRY:──────────────────────────
Companies with ML needs: ████████████████████ 1,000,000+Companies with ML experts: ███ ~50,000Expert ML engineers: █ ~300,000
Gap: 95%+ of companies can't hire ML experts!
AutoML bridges this gap:- Democratizes ML (anyone can use it)- Accelerates expert productivity (10x faster)- Establishes strong baselines automaticallyWhen to Use AutoML
Section titled “When to Use AutoML”USE AUTOML:─────────── Establishing baselines quickly Tabular data problems Time-constrained projects Non-expert teams Comparing many algorithms Hyperparameter optimization
DON'T USE AUTOML (alone):───────────────────────── Custom architectures needed Domain-specific constraints Real-time inference requirements Highly specialized problems When interpretability is criticalAutoML Landscape
Section titled “AutoML Landscape”Major AutoML Frameworks
Section titled “Major AutoML Frameworks”AUTOML FRAMEWORK COMPARISON:────────────────────────────
┌──────────────────────────────────────────────────────────────────┐│ FRAMEWORK │ BEST FOR │ KEY STRENGTH │├──────────────────────────────────────────────────────────────────┤│ AutoGluon │ Tabular, general │ Best accuracy, ensembles ││ (Amazon) │ │ │├──────────────────────────────────────────────────────────────────┤│ auto-sklearn │ scikit-learn users │ Meta-learning, warm-start ││ (Freiburg) │ │ │├──────────────────────────────────────────────────────────────────┤│ H2O AutoML │ Enterprise │ Production-ready, scaling ││ (H2O.ai) │ │ │├──────────────────────────────────────────────────────────────────┤│ FLAML │ Fast experiments │ Low compute, fast ││ (Microsoft) │ │ │├──────────────────────────────────────────────────────────────────┤│ PyCaret │ Low-code ML │ Simple API, visualization ││ (Open Source) │ │ │└──────────────────────────────────────────────────────────────────┘AutoGluon Deep Dive
Section titled “AutoGluon Deep Dive”AutoGluon (by Amazon) consistently wins ML competitions:
AUTOGLUON ARCHITECTURE:───────────────────────
Input Data │ ┌────────────┼────────────┐ │ │ │ ▼ ▼ ▼ ┌───────┐ ┌───────┐ ┌───────┐ │NN │ │GBM │ │Linear │ │Models │ │Models │ │Models │ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ │ ┌──────┼──────┐ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌─────────────────────────────┐ │ MULTI-LAYER STACKING │ │ (Ensemble of ensembles) │ └──────────────┬──────────────┘ │ ▼ Best Model
MODELS AUTOGLUON TRIES:───────────────────────Neural Networks: - TabularNN (custom for tabular) - FastAI neural network
Gradient Boosting: - LightGBM - CatBoost - XGBoost
Linear Models: - Ridge/Lasso regression - Linear SVM
Ensemble: - Weighted ensemble - Multi-layer stackingDid You Know? AutoGluon’s “multi-layer stacking” is unique. Instead of just averaging model predictions, it trains a second layer of models on the first layer’s predictions, then a third layer, and so on. This recursive ensembling often improves accuracy by 1-3% - which can mean millions in revenue for businesses!
AutoML Under the Hood
Section titled “AutoML Under the Hood”Algorithm Selection
Section titled “Algorithm Selection”How does AutoML choose which algorithms to try?
ALGORITHM SELECTION STRATEGIES:───────────────────────────────
1. EXHAUSTIVE SEARCH Try ALL algorithms, all hyperparameters Problem: Computationally infeasible!
10 algorithms × 100 hyperparameter combos = 1,000 models If each takes 1 minute = 16+ hours
2. META-LEARNING (auto-sklearn approach) Learn from past datasets which algorithms work best
"This dataset looks like Dataset #4,523 from our database. Random Forest worked best there, let's try that first!"
Steps: a) Extract meta-features from dataset b) Find similar historical datasets c) Start with algorithms that worked on those
3. BAYESIAN OPTIMIZATION Smart search that learns as it goes
┌──────────────────────────────────────────────┐ │ Iteration 1: Try random config → Score: 0.75 │ │ Iteration 2: Try another → Score: 0.82 │ │ Iteration 3: Try similar to best → 0.85 │ │ ...learns that high learning_rate is bad... │ │ Iteration 50: Optimal found → 0.91 │ └──────────────────────────────────────────────┘
4. BANDIT-BASED (Hyperband/ASHA) Give more resources to promising configs
Start: 100 configs with 1 epoch each Keep: Top 25 configs, train for 4 epochs Keep: Top 6 configs, train for 16 epochs Keep: Top 2 configs, train to completion
Result: Find best config with 10x less compute!Hyperparameter Optimization
Section titled “Hyperparameter Optimization”Think of hyperparameter optimization like tuning a guitar. Each hyperparameter is a string that affects the sound. Turn the learning rate too high and you get noise; too low and you barely hear anything. The problem? A neural network has dozens of “strings,” and they all interact with each other.
Traditional approach: try every combination. With 10 hyperparameters and 5 values each, that’s 5^10 = 9.7 million combinations. Even at 1 minute per trial, that’s 18 years.
Smart approach: use Bayesian optimization, which learns from each trial. “High learning rate made things worse? Let’s try lower values.” It finds good configurations in 50-100 trials instead of millions.
HYPERPARAMETER SEARCH SPACE:────────────────────────────
LightGBM example:┌─────────────────────────────────────────────────┐│ Parameter │ Search Space │├─────────────────────────────────────────────────┤│ n_estimators │ [100, 200, 500, 1000] ││ learning_rate │ [0.01, 0.05, 0.1, 0.3] ││ max_depth │ [3, 5, 7, 10, -1] ││ num_leaves │ [15, 31, 63, 127] ││ min_child_weight │ [1e-3, 1e-2, 0.1, 1] ││ subsample │ [0.5, 0.7, 0.9, 1.0] ││ colsample_bytree │ [0.5, 0.7, 0.9, 1.0] │└─────────────────────────────────────────────────┘
Total combinations: 4×4×5×4×4×4×4 = 20,480 configs!
Smart search finds good config in ~50 trialsExhaustive search needs 20,480 trialsSpeedup: 400xAutomated Feature Engineering
Section titled “Automated Feature Engineering”Why This Module Matters
Section titled “Why This Module Matters”FEATURE ENGINEERING REALITY:────────────────────────────
Time spent on ML projects:┌──────────────────────────────────────────────────────┐│ Data Collection ████████████ 25% ││ Data Cleaning ████████████████ 35% ││ Feature Engineering ██████████████ 30% ││ Model Training ████ 10% │└──────────────────────────────────────────────────────┘
Feature engineering is:- Time-consuming (30% of project time)- Requires domain expertise- Often repetitive across projects- Critical for model performance
"Give me better features, and I'll give you a better model." - Every ML EngineerAutomated Feature Engineering Techniques
Section titled “Automated Feature Engineering Techniques”AUTO-FEATURE TECHNIQUES:────────────────────────
1. AGGREGATION (for relational data) ───────────────────────────────── customer_id → orders table
Auto-generated features: - count(orders) - sum(order_amount) - avg(order_amount) - max(order_amount) - days_since_last_order
2. TRANSFORMATION ─────────────── Original: [price, quantity]
Auto-generated: - log(price) - sqrt(quantity) - price * quantity (interaction) - price / quantity (ratio) - price ** 2 (polynomial)
3. TIME-BASED (from timestamps) ──────────────────────────── Original: purchase_datetime
Auto-generated: - hour_of_day - day_of_week - is_weekend - month - quarter - days_since_signup
4. ENCODING (for categoricals) ─────────────────────────── Original: category = ["A", "B", "C"]
Auto-generated: - One-hot encoding - Target encoding - Frequency encoding - Label encodingFeaturetools: Deep Feature Synthesis
Section titled “Featuretools: Deep Feature Synthesis”DEEP FEATURE SYNTHESIS (DFS):─────────────────────────────
Given relational tables, automatically generate features:
TABLES: customers(id, signup_date, country) orders(id, customer_id, date, amount) products(id, order_id, product_type, price)
DFS GENERATES:──────────────Depth 1: Simple aggregations - COUNT(orders) - SUM(orders.amount) - AVG(orders.amount)
Depth 2: Stacked aggregations - COUNT(orders.products) - AVG(orders.SUM(products.price)) - MODE(orders.MODE(products.product_type))
Depth 3: Triple-stacked! - STD(orders.AVG(products.price))
Result: 100s of features from 3 tables!
PRIMITIVES USED:────────────────Aggregation: sum, mean, count, max, min, std, modeTransform: year, month, weekday, cum_sum, diffDid You Know? In the 2015 KDD Cup, a team used Featuretools to generate over 1,000 features automatically. They finished in the top 10% of the competition with minimal manual feature engineering. The winning insight: more features (properly regularized) often beats carefully hand-crafted few features.
Feature Stores
Section titled “Feature Stores”What is a Feature Store?
Section titled “What is a Feature Store?”Think of a feature store as a “data warehouse for ML features” - a centralized repository where teams can share, discover, and reuse features.
Imagine a restaurant where every chef prepares their own spice blends. Chef A makes curry powder. Chef B makes the same curry powder differently. Chef C needs curry powder but doesn’t know it already exists, so they make a third version. Different dishes taste inconsistent, ingredients are wasted, and no one knows which recipe is “official.”
Now imagine a central spice cabinet with standardized, labeled blends. Every chef uses the same curry powder. New chefs can see what’s available. If the curry powder recipe improves, all dishes improve automatically.
That’s what a feature store does for ML features—centralizes, standardizes, and shares them across teams.
WITHOUT FEATURE STORE:──────────────────────
Team A: Builds "customer_lifetime_value" feature ├── Writes SQL query ├── Schedules daily job └── Stores in their own table
Team B: Needs same feature ├── Doesn't know Team A has it ├── Builds their own version └── Gets slightly different results!
Team C: Needs feature for real-time inference ├── Can't use batch SQL └── Builds third version!
Result: 3 versions of the same feature, inconsistent!
WITH FEATURE STORE:───────────────────
┌─────────────────────────────┐ │ FEATURE STORE │ │ ┌─────────────────────┐ │ │ │ customer_lifetime_ │ │ │ │ value │ │ │ │ - Batch: Daily SQL │ │ │ │ - Online: Redis │ │ │ │ - Owner: Team A │ │ │ │ - Version: 2.3 │ │ │ └─────────────────────┘ │ └──────────────┬──────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ Team A Team B Team C (training) (training) (inference)
All teams use the SAME feature definition!Feature Store Architecture
Section titled “Feature Store Architecture”FEATURE STORE COMPONENTS:─────────────────────────
┌─────────────────────────────────────────────────────────────────┐│ FEATURE STORE ││ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ FEATURE REGISTRY │ ││ │ - Feature definitions (schema, transformations) │ ││ │ - Metadata (owner, description, data lineage) │ ││ │ - Versioning │ ││ └────────────────────────────────────────────────────────────┘ ││ │ ││ ┌────────────────────┴────────────────────┐ ││ │ │ ││ ┌──────▼──────┐ ┌───────▼───────┐ ││ │ OFFLINE │ │ ONLINE │ ││ │ STORE │ │ STORE │ ││ │ │ │ │ ││ │ - Historical│ Sync │ - Latest │ ││ │ - Training │ ◄──────────────────────►│ - Real-time │ ││ │ - BigQuery/ │ │ - Redis/ │ ││ │ S3/HDFS │ │ DynamoDB │ ││ └─────────────┘ └───────────────┘ ││ │ │ ││ │ │ ││ ┌──────▼──────────────────────────────────────────▼──────┐ ││ │ SERVING LAYER │ ││ │ - Batch serving (training) │ ││ │ - Online serving (inference) │ ││ │ - Point-in-time correctness │ ││ └────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘Point-in-Time Correctness
Section titled “Point-in-Time Correctness”This is the MOST important concept in feature stores:
POINT-IN-TIME PROBLEM:──────────────────────
Training data preparation: "What was the customer's purchase_count on 2024-01-15?"
WRONG APPROACH (data leakage!): SELECT purchase_count FROM current_features WHERE customer_id = 123
Problem: Returns TODAY's count, not 2024-01-15's! The model trains on future information = cheating!
CORRECT APPROACH (point-in-time join): SELECT purchase_count FROM feature_history WHERE customer_id = 123 AND feature_timestamp <= '2024-01-15' ORDER BY feature_timestamp DESC LIMIT 1
Returns: Value as it was on 2024-01-15
TIMELINE:───────── 2024-01-15 Today │ │ ▼ ▼ ──────────●─────────────────●──────► │ │ Training event Don't use Use features these values! from HEREDid You Know? Uber’s feature store “Michelangelo” serves over 10 million feature vector lookups per second. They estimate that having a centralized feature store reduced their ML feature development time by 50% and virtually eliminated training-serving skew issues that previously caused silent model degradation.
Feast: Open Source Feature Store
Section titled “Feast: Open Source Feature Store”Feast Architecture
Section titled “Feast Architecture”FEAST COMPONENTS:─────────────────
┌─────────────────────────────────────────────────────────────┐│ FEAST ││ ││ ┌──────────────────────────────────────────────────────┐ ││ │ FEATURE REPO │ ││ │ feature_store.yaml # Configuration │ ││ │ features.py # Feature definitions │ ││ └──────────────────────────────────────────────────────┘ ││ │ ││ │ feast apply ││ ▼ ││ ┌──────────────────────────────────────────────────────┐ ││ │ REGISTRY │ ││ │ - Feature views │ ││ │ - Entities │ ││ │ - Data sources │ ││ └──────────────────────────────────────────────────────┘ ││ │ ││ ┌───────────────┼───────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ Offline │ │ Online │ │ Serving │ ││ │ Store │ │ Store │ │ API │ ││ │ (Parquet) │ │ (Redis) │ │ (gRPC) │ ││ └───────────┘ └───────────┘ └───────────┘ │└─────────────────────────────────────────────────────────────┘Defining Features in Feast
Section titled “Defining Features in Feast”from feast import Entity, Feature, FeatureView, FileSourcefrom feast.types import Float32, Int64
# Define entity (the thing we're building features for)customer = Entity( name="customer_id", join_keys=["customer_id"], description="Customer identifier")
# Define data sourcecustomer_stats_source = FileSource( path="data/customer_stats.parquet", timestamp_field="event_timestamp")
# Define feature viewcustomer_stats = FeatureView( name="customer_stats", entities=[customer], ttl=timedelta(days=90), # How long features are valid schema=[ Field(name="total_purchases", dtype=Int64), Field(name="avg_order_value", dtype=Float32), Field(name="days_since_last_order", dtype=Int64), ], online=True, # Serve from online store source=customer_stats_source,)Using Feast
Section titled “Using Feast”# Training: Get historical featuresfrom feast import FeatureStore
store = FeatureStore(repo_path=".")
# Entity dataframe (what we want features for)entity_df = pd.DataFrame({ "customer_id": [1, 2, 3, 4, 5], "event_timestamp": pd.to_datetime(["2024-01-15"] * 5)})
# Get training data with point-in-time correctness!training_df = store.get_historical_features( entity_df=entity_df, features=[ "customer_stats:total_purchases", "customer_stats:avg_order_value", "customer_stats:days_since_last_order" ]).to_df()
# Inference: Get online featuresfeature_vector = store.get_online_features( features=[ "customer_stats:total_purchases", "customer_stats:avg_order_value", ], entity_rows=[{"customer_id": 123}]).to_dict()
# Returns: {"customer_id": [123], "total_purchases": [47], ...}ML Pipeline Automation
Section titled “ML Pipeline Automation”End-to-End ML Pipeline
Section titled “End-to-End ML Pipeline”AUTOMATED ML PIPELINE:──────────────────────
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│ Data │──►│ Feature │──►│ Model │──►│ Model │──►│ Deploy ││ Ingest │ │ Eng │ │ Train │ │ Eval │ │ │└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ Scheduled Feature AutoML Metrics A/B Test Jobs Store Trains Tracked or Canary
ORCHESTRATION: - Airflow / Dagster / Prefect - MLflow for experiment tracking - Feature store for features - Model registry for models - CI/CD for deploymentMLflow Integration
Section titled “MLflow Integration”MLFLOW EXPERIMENT TRACKING:───────────────────────────
with mlflow.start_run(): # Log parameters mlflow.log_param("model_type", "LightGBM") mlflow.log_param("n_estimators", 100)
# Train model model = train_model(...)
# Log metrics mlflow.log_metric("accuracy", 0.95) mlflow.log_metric("f1_score", 0.93)
# Log model mlflow.sklearn.log_model(model, "model")
MLFLOW UI shows:┌────────────────────────────────────────────────────────────┐│ Run ID │ Model │ n_estimators │ Accuracy │ F1 │├────────────────────────────────────────────────────────────┤│ abc123 │ LightGBM │ 100 │ 0.95 │ 0.93 ││ def456 │ XGBoost │ 200 │ 0.94 │ 0.92 ││ ghi789 │ RF │ 150 │ 0.92 │ 0.90 │└────────────────────────────────────────────────────────────┘
Easy comparison across experiments!Practical AutoML Workflow
Section titled “Practical AutoML Workflow”Step-by-Step AutoML Process
Section titled “Step-by-Step AutoML Process”PRODUCTION AUTOML WORKFLOW:───────────────────────────
1. DATA PREPARATION ├── Load data ├── Basic cleaning (handle missing, remove duplicates) ├── Define target variable └── Split train/validation/test
2. QUICK AUTOML RUN (1 hour) ├── Use AutoGluon with time_limit=3600 ├── Get baseline performance └── Identify promising models
3. ANALYZE RESULTS ├── Check feature importance ├── Look for data leakage └── Understand model behavior
4. EXTENDED RUN (optional) ├── Run AutoGluon with more time ├── Try different presets (best_quality vs optimize_for_deployment) └── Experiment with hyperparameter ranges
5. MODEL SELECTION ├── Compare accuracy vs inference time ├── Consider interpretability needs └── Check model size for deployment
6. PRODUCTION PREPARATION ├── Export best model ├── Create feature pipeline ├── Set up monitoring └── Deploy with gradual rolloutAutoGluon Presets
Section titled “AutoGluon Presets”AUTOGLUON PRESETS:──────────────────
PRESET: "best_quality" - Maximum accuracy - Uses all algorithms - Deep stacking ensembles - Time: 4-8x longer - Use for: Final production models
PRESET: "high_quality" - Near-optimal accuracy - Good ensemble - Reasonable time - Use for: Most use cases
PRESET: "good_quality" - Good accuracy - Faster training - Use for: Prototyping
PRESET: "medium_quality" - Decent accuracy - Much faster - Use for: Quick baselines
PRESET: "optimize_for_deployment" - Single model (no ensemble) - Fast inference - Smaller model size - Use for: Real-time servingCommon Pitfalls and Best Practices
Section titled “Common Pitfalls and Best Practices”AutoML Pitfalls
Section titled “AutoML Pitfalls”COMMON AUTOML MISTAKES:───────────────────────
1. DATA LEAKAGE ─────────────── Problem: Target information leaks into features
Example: Predicting "will customer churn?" Bad feature: "cancellation_date" (directly reveals answer!)
Solution: Review feature importance, suspicious features that are too predictive are often leaky
2. OVERFITTING TO VALIDATION ────────────────────────── Problem: Running AutoML many times, picking best
Each run: Accuracy = 0.91, 0.92, 0.93, 0.94, 0.90... Pick best: 0.94! Test set: 0.88 (overfit to validation)
Solution: Hold out a TRUE test set, evaluate once at end
3. IGNORING BUSINESS CONSTRAINTS ─────────────────────────────── Problem: Best model has 200ms latency, need < 10ms
AutoGluon best model: Stacked ensemble Inference time: 200ms
Solution: Use optimize_for_deployment preset or constrain model types
4. FEATURE STORE SKEW ──────────────────── Problem: Training features differ from serving features
Training: feature_v1 (old transformation) Serving: feature_v2 (new transformation)
Solution: Use feature store with versioningBest Practices
Section titled “Best Practices”AUTOML BEST PRACTICES:──────────────────────
1. START SIMPLE - Run quick AutoML first (30 min - 1 hour) - Get baseline before investing more time - Understand what's possible
2. FEATURE ENGINEERING STILL MATTERS - AutoML optimizes models, not features - Domain features often help - Combine AutoML + manual feature eng
3. USE PROPER VALIDATION - Time-based splits for time series - Stratified for imbalanced classes - Group splits for related samples
4. MONITOR IN PRODUCTION - Track feature distributions - Monitor model performance - Set up drift detection
5. DOCUMENT EVERYTHING - Which AutoML settings? - What features used? - Business metrics impact?Did You Know? Google’s AutoML Tables (now part of Vertex AI) automatically applies over 100 different data preprocessing and feature engineering transformations. When tested on the OpenML benchmark suite, it achieved state-of-the-art results on 60% of datasets - often surpassing manually tuned models!
Summary
Section titled “Summary”KEY CONCEPTS RECAP:───────────────────
AUTOML: - Automates algorithm selection + hyperparameter tuning - AutoGluon: Best accuracy, multi-layer stacking - Use presets based on needs (quality vs speed)
FEATURE STORES: - Centralized feature management - Point-in-time correctness for training - Online/offline serving - Feast: Open source, easy to start
AUTOMATED FEATURE ENGINEERING: - Aggregations, transformations, time features - Featuretools for deep feature synthesis - Still combine with domain knowledge
ML PIPELINES: - End-to-end automation - MLflow for experiment tracking - Orchestration (Airflow, Dagster)
WHEN TO USE WHAT:─────────────────
┌─────────────────────────────────────────────────────────────┐│ SCENARIO │ RECOMMENDATION │├─────────────────────────────────────────────────────────────┤│ Quick baseline │ AutoGluon, medium_quality ││ Production model │ AutoGluon, best_quality ││ Real-time serving │ AutoGluon, optimize_for_deployment││ Feature reuse │ Feast feature store ││ Relational data │ Featuretools + AutoML ││ Team collaboration │ Feature store + MLflow │└─────────────────────────────────────────────────────────────┘Production War Stories: AutoML and Feature Store Lessons
Section titled “Production War Stories: AutoML and Feature Store Lessons”The $12 Million Feature Leak
Section titled “The $12 Million Feature Leak”Singapore. March 2022. Fintech startup CredFlow.
The data science team was celebrating. Their credit scoring model, built with AutoGluon in just two days, achieved 0.94 AUC—far better than their previous hand-built model at 0.78. The model went into production, and loan approvals skyrocketed.
Three months later, the collections team noticed something disturbing: default rates had tripled. Loans approved by the new model were failing at unprecedented rates.
The forensic analysis revealed the horror: One feature—days_until_first_payment—had an importance score of 0.42. This feature was calculated from a field populated after the loan was already approved. Customers with short payment windows (who’d been approved quickly) showed different patterns than those with longer windows.
The AutoML system had found a perfect predictor: a feature that leaked the outcome. It’s like trying to predict who will win a race by looking at the finish photo—technically accurate, but useless for making predictions before the race.
Financial impact: $12.3 million in bad loans before detection. The CRO was fired. The startup nearly collapsed.
The fix implemented:
# Before: Feature calculated wheneverdays_until_first_payment = payment_date - approval_date
# After: Strict point-in-time feature validationdef validate_feature_timing(feature_name, feature_timestamp, prediction_timestamp): """Ensure feature was available BEFORE prediction was needed.""" if feature_timestamp > prediction_timestamp: raise DataLeakageError( f"Feature '{feature_name}' has timestamp {feature_timestamp} " f"but prediction was needed at {prediction_timestamp}. " f"This is data leakage!" )Lesson: AutoML will find ANY signal, including signals from the future. Point-in-time validation isn’t optional—it’s essential.
Did You Know? A 2023 survey of ML practitioners found that 43% had experienced data leakage in production models. The median time to detect leakage was 47 days. Feature stores with built-in point-in-time correctness reduce leakage incidents by 87%.
The Feature Store That Saved Black Friday
Section titled “The Feature Store That Saved Black Friday”Seattle. November 2021. Major e-commerce retailer.
Black Friday was approaching. The ML platform team was nervous. Last year, their recommendation system had crashed under load, costing an estimated $8 million in lost sales. The problem? The feature computation pipeline couldn’t keep up with 50,000 requests per second.
This year, they’d implemented Feast as their feature store. The architecture was different:
LAST YEAR (crashed):────────────────────Request → Compute Features On-Demand → Model → Response └── SQL query per request └── 200ms latency └── Can't scale past 500 RPS
THIS YEAR (Feast):─────────────────Pre-computed features → Redis (Online Store)Request → Redis lookup (1ms) → Model → Response └── 50,000+ RPS └── 5ms total latencyOn Black Friday, traffic hit 72,000 requests per second. The system didn’t flinch. Latency stayed under 10ms. Revenue increased 34% year-over-year.
The key insight: Feature computation is the bottleneck, not model inference. Pre-computing features and serving from an online store changed everything.
Financial impact: Black Friday revenue increased by $47 million. Feature store implementation cost: $300K.
The AutoML Model That Discriminated
Section titled “The AutoML Model That Discriminated”Chicago. June 2023. Insurance company HealthFirst.
The compliance team flagged an anomaly: claim denials were 23% higher for customers in certain ZIP codes—ZIP codes that happened to correlate strongly with minority populations.
Investigation revealed the AutoML system had discovered a highly predictive feature: zip_code_health_score, which was derived from historical claim data. The problem? Historical claim data reflected decades of discriminatory practices. The model wasn’t being racist on purpose—it was faithfully learning patterns that encoded institutional racism.
The team’s response:
- Removed proxy features: ZIP code, neighborhood, and any feature that correlated >0.3 with protected demographics
- Added fairness constraints: Ensured prediction rates were within 5% across demographic groups
- Implemented explainability: Required human review for any denial with unusual feature weights
# Fairness-aware AutoML configurationfrom autogluon.tabular import TabularPredictor
predictor = TabularPredictor( label='claim_approved', eval_metric='roc_auc').fit( train_data, # Add fairness constraint hyperparameters={ 'GBM': { 'constraint_type': 'demographic_parity', 'fairness_target': 'race_proxy', 'fairness_threshold': 0.05 } })Regulatory outcome: The company avoided a discrimination lawsuit by self-reporting and fixing the issue. Estimated savings: $15-20 million in legal fees and settlements.
Lesson: AutoML optimizes what you tell it to optimize. If you only optimize for accuracy, it will happily learn discriminatory patterns. Fairness must be an explicit constraint.
Did You Know? Amazon famously scrapped an AI recruiting tool in 2018 after discovering it systematically downgraded women’s resumes. The model, trained on 10 years of hiring data, learned that Amazon had historically hired mostly men—and therefore preferred male candidates. This incident led to an industry-wide push for fairness-aware ML.
Common Mistakes and How to Avoid Them
Section titled “Common Mistakes and How to Avoid Them”Mistake 1: Trusting AutoML Feature Importance Blindly
Section titled “Mistake 1: Trusting AutoML Feature Importance Blindly”Wrong:
# AutoML found these are the most important features# Great, let's use them!top_features = model.feature_importance()[:10]production_model = train_on(data[top_features]) # Dangerous!Problem: Feature importance from AutoML can be misleading. A leaky feature will show high importance. A feature that’s important for one model type might be useless for another.
Right:
def validate_feature_importance(feature_name, importance_score, data): """Sanity check for suspiciously important features."""
# Check for data leakage if importance_score > 0.3: # Suspiciously high print(f"️ WARNING: {feature_name} has importance {importance_score}") print("Checking for potential leakage...")
# Check if feature correlates with target timing correlation_with_target = data[feature_name].corr(data['target']) if abs(correlation_with_target) > 0.8: raise LeakageWarning( f"{feature_name} has {correlation_with_target:.2f} correlation " f"with target. Likely data leakage!" )
# Check if feature is available at prediction time if not is_available_at_prediction_time(feature_name): raise LeakageWarning( f"{feature_name} is not available at prediction time!" )
return TrueMistake 2: Not Setting Time Limits on AutoML
Section titled “Mistake 2: Not Setting Time Limits on AutoML”Wrong:
# "Just let it run until it's done"predictor = TabularPredictor(label='target').fit(train_data)# 3 days later: still running, $2,000 in cloud costsProblem: AutoML will happily run forever, trying more and more models. Without time limits, you waste compute and money.
Right:
# Always set explicit time limitspredictor = TabularPredictor( label='target', eval_metric='roc_auc').fit( train_data, time_limit=3600, # 1 hour max presets='best_quality', # Will do its best within time limit # AutoGluon automatically prioritizes promising models)Mistake 3: Ignoring Training-Serving Skew
Section titled “Mistake 3: Ignoring Training-Serving Skew”Wrong:
# Training timetraining_features = compute_features_from_warehouse(training_data)model.fit(training_features, labels)
# Serving time (different code path!)serving_features = compute_features_from_api(request_data) # Different!prediction = model.predict(serving_features)Problem: Subtle differences in feature computation between training and serving cause silent model degradation. Think of it as using different thermometers that are calibrated differently—your predictions will be systematically off.
Right:
# Use feature store for BOTH training and servingfrom feast import FeatureStore
store = FeatureStore(repo_path=".")
# Training timetraining_features = store.get_historical_features( entity_df=training_entities, features=['customer:total_purchases', 'customer:avg_order_value']).to_df()
# Serving time (SAME feature definitions!)serving_features = store.get_online_features( features=['customer:total_purchases', 'customer:avg_order_value'], entity_rows=[{"customer_id": request.customer_id}]).to_dict()
# Features are guaranteed to be computed identicallyMistake 4: Not Versioning Features
Section titled “Mistake 4: Not Versioning Features”Wrong:
# Features.py - Modified directly in productioncustomer_value = total_purchases * avg_order_value # Changed from sum to product# Now training data has old definition, production has new...Problem: Changing feature definitions without versioning creates chaos. Models trained on v1 features serving with v2 features will produce garbage.
Right:
# features_v2.py - Explicit versioningclass CustomerValueV2(FeatureView): """ Customer lifetime value calculation.
v1 -> v2 changes: - Changed from sum to product formula - Added recency weighting - Breaking change: requires model retraining
Migration: Models must be retrained before using v2 """ name = "customer_value_v2" version = "2.0.0" deprecates = "customer_value_v1" # Mark old version requires_retraining = True
def compute(self, data): return (data.total_purchases * data.avg_order_value * self.recency_weight(data.days_since_last_order))Mistake 5: Using AutoML for Everything
Section titled “Mistake 5: Using AutoML for Everything”Wrong:
# "AutoML is magic, let's use it everywhere!"image_model = AutoGluon.fit(image_data) # Works but suboptimaltext_model = AutoGluon.fit(text_data) # Works but suboptimaltime_series = AutoGluon.fit(ts_data) # Works but suboptimalProblem: AutoML excels at tabular data. For images, text, and time series, specialized approaches usually win.
Right:
WHEN TO USE AUTOML vs SPECIALIZED TOOLS:────────────────────────────────────────
Data Type │ AutoML Good? │ Better Alternative─────────────┼──────────────┼────────────────────────Tabular │ Excellent │ N/A - AutoML is bestImages │ ️ OK │ Transfer learning (ResNet, ViT)Text │ ️ OK │ Fine-tuned LLMs, BERTTime Series │ ️ OK │ Prophet, NeuralProphet, DeepARGraph │ Poor │ PyTorch Geometric, DGLAudio │ Poor │ Whisper, wav2vecEconomics of AutoML and Feature Stores
Section titled “Economics of AutoML and Feature Stores”AutoML ROI Calculation
Section titled “AutoML ROI Calculation”| Scenario | Manual ML | AutoML | Savings |
|---|---|---|---|
| Initial Model Development | |||
| Data scientist time | 4 weeks ($40K) | 1 week ($10K) | $30K |
| Compute costs | $500 | $200 | $300 |
| Time to production | 6 weeks | 2 weeks | 4 weeks faster |
| Ongoing Maintenance | |||
| Monthly retraining time | 2 days ($4K) | 4 hours ($500) | $3.5K/month |
| Model iteration cost | $10K per iteration | $2K per iteration | $8K/iteration |
| Annual Savings | |||
| Initial + 12 months maintenance | $88K | $16K | $72K/year |
Feature Store ROI Calculation
Section titled “Feature Store ROI Calculation”| Metric | Without Feature Store | With Feature Store |
|---|---|---|
| Feature development time | 2 weeks per feature | 2 days per feature (10x faster with reuse) |
| Feature duplication | 5x average (same feature built 5 times) | 1x (single source of truth) |
| Training-serving skew incidents | 3 per quarter | 0.1 per quarter (30x reduction) |
| Revenue lost to skew | $500K/year | $17K/year |
| Engineer time on debugging | 20% | 5% |
| Total Annual Impact | $800K-1.2M savings |
Cost of NOT Using These Tools
Section titled “Cost of NOT Using These Tools”REAL COSTS OF MANUAL ML:────────────────────────
┌────────────────────────────────────────────────────────────┐│ Problem │ Typical Cost │├────────────────────────────────────────────────────────────┤│ Data leakage in production │ $1M-10M (depending on ││ │ time to detect) │├────────────────────────────────────────────────────────────┤│ Training-serving skew │ $100K-500K per incident ││ (silent model degradation) │ (lost revenue + debug) │├────────────────────────────────────────────────────────────┤│ Duplicate feature work │ $50K-200K/year ││ (multiple teams building │ (wasted engineer time) ││ same features) │ │├────────────────────────────────────────────────────────────┤│ Slow model iteration │ $500K-2M/year ││ (opportunity cost of delayed │ (competitors move ││ improvements) │ faster) │└────────────────────────────────────────────────────────────┘Did You Know? According to a 2023 MLOps survey, companies using feature stores report 73% faster time-to-production for new ML models. The median ROI for feature store implementations is 340% over 3 years, with payback periods under 9 months.
Interview Preparation: AutoML & Feature Stores
Section titled “Interview Preparation: AutoML & Feature Stores”Q1: “When would you use AutoML vs. hand-crafted models?”
Section titled “Q1: “When would you use AutoML vs. hand-crafted models?””Strong Answer: “I use AutoML in three main scenarios. First, for establishing baselines quickly—before investing weeks in manual model development, I run AutoML to understand what’s achievable. If AutoML gets 0.75 AUC, I know my hand-crafted model should aim for at least 0.78 to justify the extra effort.
Second, for tabular data problems with clear evaluation metrics—this is AutoML’s sweet spot. In my experience, AutoGluon matches or beats hand-crafted gradient boosting models 80% of the time with a fraction of the effort.
Third, when the ML isn’t the core differentiator—if we’re building a feature where ML is a small component, I’d rather spend engineering time on the product, not model tuning.
I wouldn’t use AutoML when I need specific architectures like transformers for NLP, when there are strict latency requirements that need optimized single models, or when interpretability is critical and I need to explain every decision.”
Q2: “Explain point-in-time correctness in feature stores.”
Section titled “Q2: “Explain point-in-time correctness in feature stores.””Strong Answer: “Point-in-time correctness ensures that when training a model, we only use feature values that were available at the time the prediction would have been made. It prevents data leakage from the future.
Imagine training a model to predict customer churn. If a customer churned on March 15, their training features should reflect their state on March 14 or earlier—not their state after they churned. Without point-in-time correctness, we might accidentally include features like ‘days_since_last_login’ that jumped to 30+ after they stopped using the product—a clear signal they’ve churned that wouldn’t be available when making a real prediction.
Feature stores implement this by maintaining timestamped feature values and performing point-in-time joins. When you request historical features for training, you provide entity timestamps, and the feature store returns the most recent feature value that existed before each timestamp.
This is critical because models trained with leaked features show fantastic offline metrics but fail dramatically in production—a pattern I’ve seen called ‘suspiciously good AUC syndrome.’”
Q3: “How does multi-layer stacking work in AutoGluon?”
Section titled “Q3: “How does multi-layer stacking work in AutoGluon?””Strong Answer: “Multi-layer stacking is AutoGluon’s ensemble technique that significantly outperforms simple averaging or voting.
In the first layer, AutoGluon trains diverse base models—gradient boosting (LightGBM, XGBoost, CatBoost), neural networks, and linear models. Each model makes predictions on out-of-fold validation data to avoid leakage.
These first-layer predictions become features for the second layer, which trains new models to combine them. Think of it as learning ‘when to trust which model.’ If LightGBM is great on numerical features but weak on categoricals, while CatBoost is the opposite, the second layer learns to weight them appropriately based on the input.
AutoGluon can stack multiple layers—typically 2-3 work best before diminishing returns.
The key insight is that this outperforms simple ensembling because it learns non-linear combinations of model predictions. A weighted average says ‘LightGBM gets 40% weight.’ Multi-layer stacking says ‘LightGBM gets 80% weight when feature X is high, but only 20% when feature Y is low.’
In practice, I’ve seen multi-layer stacking improve AUC by 1-3% over simple ensembles—which can translate to millions in revenue for high-stakes predictions.”
Q4: “What’s the difference between online and offline feature stores?”
Section titled “Q4: “What’s the difference between online and offline feature stores?””Strong Answer: “They serve different use cases with different latency and storage requirements.
The offline store is optimized for training workloads. It stores historical feature values with timestamps, typically in data warehouses like BigQuery, Snowflake, or object storage like S3. Latency is seconds to minutes, but it can handle huge volumes—millions of feature vectors. I use it when creating training datasets with point-in-time correctness.
The online store is optimized for inference. It stores only the latest feature values in low-latency databases like Redis, DynamoDB, or Bigtable. Latency is single-digit milliseconds. I use it when serving predictions in real-time.
The key architectural insight is that they share the same feature definitions but different storage backends. The feature store materializes features from the offline store to the online store periodically—typically every few minutes to daily, depending on freshness requirements.
This dual architecture solves the training-serving skew problem. My training code uses get_historical_features from the offline store. My serving code uses get_online_features from the online store. Both use identical feature definitions, just different storage optimized for their use case.”
System Design: Design an ML Platform with AutoML and Feature Store
Section titled “System Design: Design an ML Platform with AutoML and Feature Store”Prompt: “Design an ML platform for a fintech company that needs to make real-time credit decisions at 10,000 requests per second.”
Strong Answer:
“I’d design this with five key components:
1. Feature Store Architecture:
Online Store: Redis Cluster (6 nodes) - Sharded by customer_id - 500K features, 1ms p99 latency - TTL: 24 hours, refreshed hourly
Offline Store: BigQuery - Historical features for training - 2 years of feature history - Point-in-time query support
Feature Computation: Spark on Dataproc - Batch features: daily runs at 2 AM - Real-time features: Kafka Streams - Features: credit_score, payment_history, debt_to_income, account_age, etc.2. AutoML Pipeline:
# Monthly retraining pipelinedef train_credit_model(): # Get training data with point-in-time correctness training_data = feature_store.get_historical_features( entity_df=approved_applications_last_6_months, features=CREDIT_FEATURES, label='defaulted_within_90_days' )
# AutoML with fairness constraints predictor = TabularPredictor( label='defaulted', eval_metric='roc_auc' ).fit( training_data, time_limit=14400, # 4 hours presets='optimize_for_deployment', # Single model for latency excluded_model_types=['NN'] # Neural nets too slow )
# Validate fairness before promotion if passes_fairness_audit(predictor): deploy_to_production(predictor)3. Serving Architecture for 10K RPS:
Load Balancer (GCP GLB) │ ├── Kubernetes Cluster (GKE) │ └── Model Serving Pods (50 replicas) │ - CPU-optimized (LightGBM) │ - 10ms p99 latency per request │ - gRPC for low overhead │ └── Feature Store (Redis) - 1ms feature fetch - Pre-computed features only4. Monitoring & Safety:
- Feature drift detection: alert if distributions shift >10%
- Model performance monitoring: daily AUC calculation on holdout
- Fairness monitoring: automated demographic parity checks
- Circuit breaker: fall back to rules-based model if ML fails
5. Cost Estimate:
- Feature Store (Redis): $15K/month
- Compute (GKE): $25K/month
- BigQuery: $5K/month
- Total: $45K/month = $540K/year
Expected value: At 10K RPS, serving 864M decisions/day. Even 0.1% improvement in precision saves millions in bad debt.
This architecture handles 10K RPS with 15ms end-to-end latency while maintaining feature consistency and enabling rapid model iteration through AutoML.”
Hands-On Exercises
Section titled “Hands-On Exercises”Exercise 1: AutoML Baseline Challenge
Section titled “Exercise 1: AutoML Baseline Challenge”Build an AutoML baseline and compare it to a hand-crafted model:
"""AutoML vs Manual Model Comparison
Dataset: Credit Card Fraud Detection (Kaggle)Goal: Compare development time and accuracy"""import pandas as pdfrom autogluon.tabular import TabularPredictorfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import roc_auc_scoreimport time
# Load datadata = pd.read_csv('creditcard.csv')X = data.drop('Class', axis=1)y = data['Class']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# MANUAL APPROACHprint("=" * 50)print("MANUAL RANDOM FOREST")print("=" * 50)manual_start = time.time()
rf = RandomForestClassifier( n_estimators=100, max_depth=10, min_samples_leaf=5, random_state=42, n_jobs=-1)rf.fit(X_train, y_train)rf_predictions = rf.predict_proba(X_test)[:, 1]rf_auc = roc_auc_score(y_test, rf_predictions)
manual_time = time.time() - manual_startprint(f"Time: {manual_time:.1f}s")print(f"AUC: {rf_auc:.4f}")
# AUTOML APPROACHprint("\n" + "=" * 50)print("AUTOGLUON")print("=" * 50)automl_start = time.time()
# Prepare data for AutoGluontrain_df = X_train.copy()train_df['Class'] = y_train.values
predictor = TabularPredictor( label='Class', eval_metric='roc_auc', verbosity=1).fit( train_df, time_limit=300, # 5 minutes presets='medium_quality')
test_df = X_test.copy()automl_predictions = predictor.predict_proba(test_df)automl_auc = roc_auc_score(y_test, automl_predictions.iloc[:, 1])
automl_time = time.time() - automl_startprint(f"Time: {automl_time:.1f}s")print(f"AUC: {automl_auc:.4f}")
# COMPARISONprint("\n" + "=" * 50)print("COMPARISON")print("=" * 50)print(f"Manual RF: {rf_auc:.4f} AUC in {manual_time:.1f}s")print(f"AutoGluon: {automl_auc:.4f} AUC in {automl_time:.1f}s")print(f"Improvement: {(automl_auc - rf_auc)*100:.2f} percentage points")
# See what AutoGluon triedprint("\n" + "=" * 50)print("AUTOGLUON LEADERBOARD")print("=" * 50)print(predictor.leaderboard())Expected Learning: AutoGluon typically achieves 1-3% higher AUC than a basic Random Forest, demonstrating the value of automated algorithm selection and ensembling.
Exercise 2: Feature Store Implementation
Section titled “Exercise 2: Feature Store Implementation”Build a simple feature store with point-in-time correctness:
"""Simple Feature Store Implementation
This exercise teaches the core concepts of feature stores:- Point-in-time correctness- Online vs offline serving- Feature versioning"""import pandas as pdfrom datetime import datetime, timedeltafrom typing import Dict, List, Optionalimport jsonimport redis # For online store
class SimpleFeatureStore: """ A minimal feature store demonstrating core concepts.
Production feature stores like Feast add: - Distributed storage - Automatic materialization - Schema validation - Access control """
def __init__(self): # Offline store: Historical features with timestamps self.offline_store: Dict[str, pd.DataFrame] = {}
# Online store: Latest features only (simulated with dict) self.online_store: Dict[str, Dict] = {}
# Feature registry: Metadata about features self.registry: Dict[str, dict] = {}
def register_feature( self, name: str, entity: str, description: str, version: str = "1.0.0" ): """Register a feature in the registry.""" self.registry[name] = { 'entity': entity, 'description': description, 'version': version, 'created_at': datetime.now().isoformat() } print(f" Registered feature: {name} v{version}")
def write_features( self, feature_name: str, data: pd.DataFrame, timestamp_col: str = 'event_timestamp' ): """Write features to both offline and online stores.""" if feature_name not in self.registry: raise ValueError(f"Feature {feature_name} not registered!")
# Write to offline store (full history) if feature_name not in self.offline_store: self.offline_store[feature_name] = data.copy() else: self.offline_store[feature_name] = pd.concat([ self.offline_store[feature_name], data ]).drop_duplicates()
# Write to online store (latest values only) entity_col = self.registry[feature_name]['entity'] for _, row in data.iterrows(): entity_id = str(row[entity_col]) self.online_store[f"{feature_name}:{entity_id}"] = row.to_dict()
print(f" Wrote {len(data)} rows to {feature_name}")
def get_historical_features( self, feature_name: str, entity_df: pd.DataFrame, timestamp_col: str = 'event_timestamp' ) -> pd.DataFrame: """ Get historical features with point-in-time correctness.
This is the CRITICAL function that prevents data leakage. """ if feature_name not in self.offline_store: raise ValueError(f"Feature {feature_name} not in offline store!")
feature_data = self.offline_store[feature_name] entity_col = self.registry[feature_name]['entity']
results = [] for _, entity_row in entity_df.iterrows(): entity_id = entity_row[entity_col] query_time = entity_row[timestamp_col]
# Point-in-time filter: only use features from BEFORE query time valid_features = feature_data[ (feature_data[entity_col] == entity_id) & (feature_data[timestamp_col] <= query_time) ]
if len(valid_features) > 0: # Get the most recent valid feature latest = valid_features.sort_values(timestamp_col).iloc[-1] results.append(latest.to_dict()) else: # No valid features - use nulls results.append({entity_col: entity_id})
return pd.DataFrame(results)
def get_online_features( self, feature_name: str, entity_ids: List[str] ) -> List[Dict]: """Get latest features for real-time inference.""" results = [] for entity_id in entity_ids: key = f"{feature_name}:{entity_id}" if key in self.online_store: results.append(self.online_store[key]) else: results.append({'error': 'not_found'}) return results
# Demo usageif __name__ == "__main__": store = SimpleFeatureStore()
# Register features store.register_feature( name='customer_stats', entity='customer_id', description='Aggregated customer purchase statistics' )
# Create some historical feature data feature_data = pd.DataFrame({ 'customer_id': [1, 1, 1, 2, 2], 'total_purchases': [10, 15, 20, 5, 8], 'avg_order_value': [50.0, 52.0, 55.0, 30.0, 35.0], 'event_timestamp': pd.to_datetime([ '2024-01-01', '2024-02-01', '2024-03-01', '2024-01-15', '2024-02-15' ]) })
store.write_features('customer_stats', feature_data)
# Point-in-time retrieval for training training_entities = pd.DataFrame({ 'customer_id': [1, 1, 2], 'event_timestamp': pd.to_datetime([ '2024-01-15', # Should get Jan 1 features '2024-02-15', # Should get Feb 1 features '2024-02-01' # Should get Jan 15 features ]) })
historical = store.get_historical_features( 'customer_stats', training_entities ) print("\n Historical Features (point-in-time correct):") print(historical)
# Online retrieval for inference online = store.get_online_features('customer_stats', ['1', '2']) print("\n Online Features (latest values):") for f in online: print(f)Expected Learning: Understanding how point-in-time correctness prevents data leakage and why online/offline stores serve different purposes.
Exercise 3: Data Leakage Detection
Section titled “Exercise 3: Data Leakage Detection”Build a tool to detect potential data leakage in AutoML results:
"""Data Leakage Detection Tool
Detects common patterns that indicate data leakage in AutoML models."""import pandas as pdimport numpy as npfrom typing import List, Tuple
class LeakageDetector: """Detects potential data leakage in ML features."""
def __init__(self, model, X: pd.DataFrame, y: pd.Series): self.model = model self.X = X self.y = y self.warnings = []
def check_feature_importance( self, importance_threshold: float = 0.3 ) -> List[Tuple[str, float, str]]: """ Flag features with suspiciously high importance.
Leaky features often have unusually high importance because they directly encode the target. """ try: importances = self.model.feature_importance() except AttributeError: # For models without built-in feature importance return []
suspicious = [] for feature, importance in importances.items(): if importance > importance_threshold: suspicious.append(( feature, importance, f"️ Unusually high importance ({importance:.3f}). " f"Check for data leakage!" )) self.warnings.append(f"HIGH_IMPORTANCE: {feature}")
return suspicious
def check_correlation_with_target( self, correlation_threshold: float = 0.9 ) -> List[Tuple[str, float, str]]: """ Flag features with very high correlation to target.
Perfect or near-perfect correlation often indicates leakage. """ suspicious = [] for col in self.X.columns: if self.X[col].dtype in ['int64', 'float64']: corr = abs(self.X[col].corr(self.y)) if corr > correlation_threshold: suspicious.append(( col, corr, f"️ Very high correlation ({corr:.3f}). " f"Likely data leakage!" )) self.warnings.append(f"HIGH_CORRELATION: {col}")
return suspicious
def check_perfect_prediction_features(self) -> List[str]: """ Flag features that perfectly predict the target alone.
If a single feature achieves >99% accuracy, it's usually leakage. """ suspicious = [] for col in self.X.columns: if self.X[col].nunique() < 100: # Categorical-ish # Check if any value perfectly predicts target for value in self.X[col].unique(): mask = self.X[col] == value if mask.sum() > 10: # Enough samples target_values = self.y[mask].unique() if len(target_values) == 1: suspicious.append(col) self.warnings.append( f"PERFECT_PREDICTOR: {col}={value} -> {target_values[0]}" ) break
return suspicious
def check_temporal_leakage( self, date_columns: List[str], target_date_column: str = None ) -> List[str]: """ Flag features that might be computed from future data. """ suspicious = []
# Check if any features have dates AFTER the target date if target_date_column and target_date_column in self.X.columns: target_date = pd.to_datetime(self.X[target_date_column]) for col in date_columns: if col in self.X.columns and col != target_date_column: feature_date = pd.to_datetime(self.X[col]) future_rows = (feature_date > target_date).sum() if future_rows > 0: suspicious.append(col) self.warnings.append( f"TEMPORAL_LEAKAGE: {col} has {future_rows} " f"rows with dates after target" )
return suspicious
def generate_report(self) -> str: """Generate a comprehensive leakage report.""" report = [] report.append("=" * 60) report.append("DATA LEAKAGE DETECTION REPORT") report.append("=" * 60)
# Run all checks importance_issues = self.check_feature_importance() correlation_issues = self.check_correlation_with_target() perfect_predictors = self.check_perfect_prediction_features()
# Format report if importance_issues: report.append("\n HIGH IMPORTANCE FEATURES:") for feature, importance, msg in importance_issues: report.append(f" - {feature}: {msg}")
if correlation_issues: report.append("\n HIGH CORRELATION FEATURES:") for feature, corr, msg in correlation_issues: report.append(f" - {feature}: {msg}")
if perfect_predictors: report.append("\n PERFECT PREDICTOR FEATURES:") for feature in perfect_predictors: report.append(f" - {feature}")
if not any([importance_issues, correlation_issues, perfect_predictors]): report.append("\n No obvious leakage detected") report.append(" (Note: Some leakage types require domain knowledge to detect)")
report.append("\n" + "=" * 60) report.append(f"Total warnings: {len(self.warnings)}")
return "\n".join(report)
# Usage exampleif __name__ == "__main__": # Create sample data with intentional leakage np.random.seed(42) n = 1000
# Normal features X = pd.DataFrame({ 'age': np.random.randint(18, 80, n), 'income': np.random.normal(50000, 20000, n), 'credit_score': np.random.randint(300, 850, n), })
# Target: will customer default? y = pd.Series(np.random.binomial(1, 0.2, n))
# Add LEAKY feature (computed from outcome!) # This simulates a feature that includes future information X['days_until_default'] = np.where(y == 1, np.random.randint(1, 90, n), -1)
# This feature is suspicious - high correlation X['default_indicator'] = y * 0.99 + np.random.normal(0, 0.01, n)
# Create mock model class MockModel: def feature_importance(self): return { 'age': 0.05, 'income': 0.08, 'credit_score': 0.12, 'days_until_default': 0.45, # Suspiciously high! 'default_indicator': 0.30 }
# Run detection detector = LeakageDetector(MockModel(), X, y) print(detector.generate_report())Expected Learning: Understanding common leakage patterns and how to systematically detect them before they cause production failures.
Key Takeaways
Section titled “Key Takeaways”-
AutoML is a force multiplier, not a replacement — It doesn’t replace ML engineers; it amplifies their productivity by automating tedious parts (algorithm selection, hyperparameter tuning) so they can focus on harder problems.
-
AutoGluon wins on tabular data — For structured data problems, AutoGluon’s multi-layer stacking consistently achieves state-of-the-art results. Start here before hand-crafting models.
-
Feature stores solve training-serving skew — By using the same feature definitions for training and serving, you eliminate a major source of production ML failures.
-
Point-in-time correctness is non-negotiable — Data leakage from future information is the silent killer of ML models. Feature stores with timestamp-aware joins prevent this.
-
Feature reuse compounds over time — Every feature you add to the store can be used by multiple models. After a year, you have a powerful feature library that accelerates all new projects.
-
Set time limits on AutoML — Without constraints, AutoML will run forever. Always specify time_limit and use appropriate presets for your use case.
-
Validate AutoML feature importance — Suspiciously high importance scores often indicate data leakage. Always sanity-check before trusting AutoML’s discoveries.
-
Online vs offline stores serve different needs — Offline for training (historical, high volume), online for serving (latest values, low latency). Both share feature definitions.
-
AutoML presets matter — ‘best_quality’ for maximum accuracy, ‘optimize_for_deployment’ for production serving. Choose based on your constraints.
-
The economics are compelling — Feature stores and AutoML typically deliver 300%+ ROI through faster development, reduced errors, and engineer time savings.
Further Reading
Section titled “Further Reading”- AutoGluon: https://auto.gluon.ai/ - Amazon’s state-of-the-art AutoML framework, excels at tabular data
- Feast: https://feast.dev/ - Open-source feature store, great for getting started
- MLflow: https://mlflow.org/ - Experiment tracking and model registry
- Featuretools: https://featuretools.alteryx.com/ - Automated feature engineering for relational data
Papers
Section titled “Papers”- “AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data” (2020) - The paper behind AutoGluon’s design, explains multi-layer stacking
- “Feast: Feature Store for Machine Learning” (2021) - Architecture and design decisions for Feast
- “Auto-sklearn 2.0” (2020) - Meta-learning approach to AutoML
- “Michelangelo: Uber’s Machine Learning Platform” (2017) - Original feature store architecture at scale
Next Steps
Section titled “Next Steps”You’ve completed Phase 8: Classical ML! You now understand:
- Gradient boosting (XGBoost, LightGBM)
- Time series forecasting (ARIMA, Prophet)
- AutoML and feature stores
Up Next: Phase 9 - AI Safety & Evaluation
Module 39 Complete! “AutoML doesn’t replace ML engineers - it multiplies their productivity.”