Module 9.3: Feature Stores
Цей контент ще не доступний вашою мовою.
Toolkit Track | Complexity:
[MEDIUM]| Time: 40-45 minutes
Overview
Section titled “Overview”Feature stores solve the most frustrating problem in ML: feature reuse. Data scientists spend 80% of their time on feature engineering, then those features sit in notebooks, unused by others. Feature stores centralize features, ensure training-serving consistency, and make features discoverable. This module covers Feast, the leading open-source feature store, on Kubernetes.
What You’ll Learn:
- Feature store concepts and architecture
- Feast installation and configuration
- Defining and managing features
- Online vs offline serving
- Integration with ML pipelines
Prerequisites:
- Basic ML concepts
- Python familiarity
- Kubernetes fundamentals
- MLOps Discipline recommended
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Feast as a feature store for serving ML features consistently between training and inference
- Configure feature definitions, data sources, and materialization jobs for online and offline stores
- Implement real-time feature serving with point-in-time joins for model prediction pipelines
- Evaluate feature store architectures for reducing training-serving skew in production ML systems
Why This Module Matters
Section titled “Why This Module Matters”Every ML team eventually builds the same features. User click counts, transaction aggregates, text embeddings—they’re reinvented in every project. Without a feature store, you have notebooks full of feature logic that nobody can find. Feature stores make features first-class citizens: versioned, documented, and shared across the organization.
💡 Did You Know? Uber built Michelangelo (their ML platform) in 2017 and discovered that 60% of engineer time was spent on features. They created a feature store, and that 60% dropped to 15%. Other companies noticed: Airbnb built Zipline, LinkedIn built Feathr, and the pattern became a standard. Feast emerged as the open-source solution that anyone could use.
Feature Store Architecture
Section titled “Feature Store Architecture”FEATURE STORE ARCHITECTURE════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────┐│ FEAST ARCHITECTURE ││ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Feature │ │ Feast │ │ Feature │ ││ │ Definitions │───▶│ Registry │◀───│ Views │ ││ │ (Python) │ │ (Storage) │ │ (Queries) │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ │ ││ ┌────────────────────┼────────────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Offline │ │ Online │ │ Feature │ ││ │ Store │ │ Store │ │ Server │ ││ │ (BigQuery, │ │ (Redis, │ │ (gRPC) │ ││ │ Parquet) │ │ DynamoDB) │ │ │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ │ │ │ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Training │ │ Real-time │ │ Batch │ ││ │ (Batch) │ │ Inference │ │ Inference │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘
DATA FLOW:─────────────────────────────────────────────────────────────────
Raw Data Feature Store ML Model │ │ │ │ Transform │ │ │ ──────────────▶ Offline Store │ │ │ │ │ │ Materialize │ │ │ ──────────────▶ Online Store │ │ │ │ │ ◀────────│ │ │ Get Features │ │ │Key Concepts
Section titled “Key Concepts”| Concept | Description | Example |
|---|---|---|
| Feature | Single computed value | user_total_orders |
| Feature View | Group of related features | user_features (orders, spend, tenure) |
| Entity | What features describe | user_id, product_id |
| Data Source | Where raw data comes from | BigQuery, Parquet, Kafka |
| Offline Store | Historical features for training | Data warehouse |
| Online Store | Latest features for inference | Redis, DynamoDB |
💡 Did You Know? The training-serving skew problem was so common that it got its own name. Models trained on batch-computed features would fail in production where features were computed differently. Feature stores guarantee that the exact same feature computation runs in both places—solving a problem that caused countless production incidents.
Feast Installation
Section titled “Feast Installation”Local Development
Section titled “Local Development”# Install Feastpip install feast
# Create new projectfeast init my_feature_repocd my_feature_repo
# Project structure:# my_feature_repo/# ├── feature_repo/# │ ├── __init__.py# │ ├── example_repo.py # Feature definitions# │ └── feature_store.yaml # Configuration# └── data/# └── driver_stats.parquetKubernetes Deployment
Section titled “Kubernetes Deployment”apiVersion: apps/v1kind: Deploymentmetadata: name: feast-feature-server namespace: feastspec: replicas: 3 selector: matchLabels: app: feast-server template: metadata: labels: app: feast-server spec: containers: - name: feast image: feastdev/feature-server:0.35.0 command: - feast - serve - --host=0.0.0.0 - --port=6566 ports: - containerPort: 6566 env: - name: FEAST_REGISTRY value: "s3://feast-bucket/registry.pb" - name: FEAST_ONLINE_STORE_TYPE value: "redis" - name: FEAST_REDIS_HOST value: "redis.feast:6379" volumeMounts: - name: feast-config mountPath: /app/feature_repo volumes: - name: feast-config configMap: name: feast-config---apiVersion: v1kind: Servicemetadata: name: feast-server namespace: feastspec: selector: app: feast-server ports: - port: 6566 targetPort: 6566Redis Online Store
Section titled “Redis Online Store”apiVersion: apps/v1kind: Deploymentmetadata: name: redis namespace: feastspec: replicas: 1 selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis:7 ports: - containerPort: 6379 resources: requests: memory: 256Mi limits: memory: 1Gi---apiVersion: v1kind: Servicemetadata: name: redis namespace: feastspec: selector: app: redis ports: - port: 6379Defining Features
Section titled “Defining Features”Feature Store Configuration
Section titled “Feature Store Configuration”project: my_projectregistry: s3://feast-bucket/registry.pbprovider: aws
online_store: type: redis connection_string: redis.feast:6379
offline_store: type: file # or bigquery, redshift, snowflake
entity_key_serialization_version: 2Feature Definitions
Section titled “Feature Definitions”from datetime import timedeltafrom feast import Entity, Feature, FeatureView, FileSource, Fieldfrom feast.types import Float32, Int64, String
# Define entities (what features describe)user = Entity( name="user_id", description="Unique user identifier", join_keys=["user_id"])
# Define data sourceuser_stats_source = FileSource( name="user_stats", path="s3://data-lake/user_stats.parquet", timestamp_field="event_timestamp", created_timestamp_column="created_timestamp")
# Define feature viewuser_features = FeatureView( name="user_features", entities=[user], ttl=timedelta(days=1), # How long features are valid schema=[ Field(name="total_orders", dtype=Int64), Field(name="total_spend", dtype=Float32), Field(name="avg_order_value", dtype=Float32), Field(name="days_since_last_order", dtype=Int64), Field(name="favorite_category", dtype=String), ], source=user_stats_source, online=True, # Materialize to online store tags={"team": "growth", "version": "v1"},)
# Streaming source examplefrom feast import KafkaSource
user_activity_source = KafkaSource( name="user_activity_stream", kafka_bootstrap_servers="kafka.default:9092", topic="user-activity", timestamp_field="event_timestamp", message_format=AvroFormat(schema_json=AVRO_SCHEMA),)
realtime_user_features = FeatureView( name="realtime_user_features", entities=[user], ttl=timedelta(minutes=5), # Short TTL for real-time schema=[ Field(name="clicks_last_5min", dtype=Int64), Field(name="session_duration", dtype=Float32), ], source=user_activity_source, online=True,)Feature Services
Section titled “Feature Services”from feast import FeatureService
# Group features for a specific use caserecommendation_features = FeatureService( name="recommendation_features", features=[ user_features[["total_orders", "favorite_category"]], realtime_user_features[["clicks_last_5min"]], ], tags={"model": "recommendation-v2"},)
fraud_detection_features = FeatureService( name="fraud_detection_features", features=[ user_features[["total_spend", "days_since_last_order"]], transaction_features, # Another feature view ],)💡 Did You Know? Feature Services solved the “which features does my model need?” problem. Before Feature Services, data scientists had to remember which features went with which model. Now they define it once: “This model needs these features.” Anyone deploying the model just references the Feature Service, and Feast handles the rest.
Materializing and Serving Features
Section titled “Materializing and Serving Features”Apply Feature Definitions
Section titled “Apply Feature Definitions”# Register features with Feastfeast apply
# Output:# Created entity user_id# Created feature view user_features# Created feature service recommendation_featuresMaterialize to Online Store
Section titled “Materialize to Online Store”# Materialize historical data to online storefeast materialize-incremental $(date +%Y-%m-%dT%H:%M:%S)
# Or with specific date rangefeast materialize 2024-01-01T00:00:00 2024-01-15T00:00:00
# Scheduled materialization (cron job)apiVersion: batch/v1kind: CronJobmetadata: name: feast-materialize namespace: feastspec: schedule: "0 * * * *" # Every hour jobTemplate: spec: template: spec: containers: - name: feast image: feastdev/feature-server:0.35.0 command: - feast - materialize-incremental - $(date -Iseconds) restartPolicy: OnFailureGetting Features for Training
Section titled “Getting Features for Training”from feast import FeatureStoreimport pandas as pd
store = FeatureStore(repo_path=".")
# Entity DataFrame (what you want features for)entity_df = pd.DataFrame({ "user_id": [1, 2, 3, 4, 5], "event_timestamp": pd.to_datetime([ "2024-01-15 10:00:00", "2024-01-15 11:00:00", "2024-01-15 12:00:00", "2024-01-15 13:00:00", "2024-01-15 14:00:00", ])})
# Get historical features (point-in-time correct)training_df = store.get_historical_features( entity_df=entity_df, features=[ "user_features:total_orders", "user_features:total_spend", "user_features:favorite_category", ],).to_df()
print(training_df)# user_id | event_timestamp | total_orders | total_spend | favorite_category# 1 | 2024-01-15 10:00| 42 | 1250.00 | electronics# ...Getting Features for Inference
Section titled “Getting Features for Inference”from feast import FeatureStore
store = FeatureStore(repo_path=".")
# Get online features (latest values)feature_vector = store.get_online_features( features=[ "user_features:total_orders", "user_features:total_spend", "realtime_user_features:clicks_last_5min", ], entity_rows=[ {"user_id": 123}, {"user_id": 456}, ],).to_dict()
print(feature_vector)# {# "user_id": [123, 456],# "total_orders": [42, 15],# "total_spend": [1250.00, 450.00],# "clicks_last_5min": [7, 2],# }gRPC Feature Server
Section titled “gRPC Feature Server”# Using gRPC for low-latency inferencefrom feast import FeatureStorefrom feast.protos.feast.serving.ServingService_pb2 import GetOnlineFeaturesRequestfrom feast.protos.feast.serving.ServingService_pb2_grpc import ServingServiceStubimport grpc
channel = grpc.insecure_channel("feast-server.feast:6566")stub = ServingServiceStub(channel)
request = GetOnlineFeaturesRequest( feature_service="recommendation_features", entities={"user_id": [123, 456]},)
response = stub.GetOnlineFeatures(request)Feature Store Patterns
Section titled “Feature Store Patterns”Point-in-Time Joins
Section titled “Point-in-Time Joins”POINT-IN-TIME CORRECTNESS════════════════════════════════════════════════════════════════════
Problem: Training data must use features AS THEY WERE at prediction time
Timeline:─────────────────────────────────────────────────────────────────Jan 1 Jan 5 Jan 10 Jan 15 Jan 20 │ │ │ │ │ ▼ ▼ ▼ ▼ ▼Feature: Feature: Feature: Feature: Feature:total=10 total=12 total=15 total=20 total=25
Training Example:If user converted on Jan 10, what was their feature value?
WRONG: Use latest value (25)RIGHT: Use value at Jan 10 (15)
Feast handles this automatically:─────────────────────────────────────────────────────────────────entity_df = pd.DataFrame({ "user_id": [1], "event_timestamp": ["2024-01-10"] # Training timestamp})
# Returns total_orders=15 (the value AT that time)store.get_historical_features(entity_df, features)Feature Transformation
Section titled “Feature Transformation”# On-demand features (computed at request time)from feast import on_demand_feature_view, Fieldfrom feast.types import Float64
@on_demand_feature_view( sources=[user_features], schema=[Field(name="spend_per_order", dtype=Float64)],)def user_derived_features(inputs: pd.DataFrame) -> pd.DataFrame: df = pd.DataFrame() df["spend_per_order"] = inputs["total_spend"] / inputs["total_orders"] return dfFeature Freshness
Section titled “Feature Freshness”FEATURE FRESHNESS PATTERNS════════════════════════════════════════════════════════════════════
Pattern Latency Use Case─────────────────────────────────────────────────────────────────Batch Hours Historical aggregates (total_lifetime_spend)
Streaming Seconds-Minutes Real-time aggregates (clicks_last_hour)
On-Demand Milliseconds Request-time computation (distance_to_store)
ARCHITECTURE:─────────────────────────────────────────────────────────────────
┌─────────────────┐ │ ML Model │ │ │ └────────┬────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ▼ ▼ ▼┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ Batch │ │ Streaming │ │ On-Demand ││ Features │ │ Features │ │ Features ││ (Nightly) │ │ (Kafka) │ │ (Request) │└─────────────┘ └─────────────┘ └─────────────┘💡 Did You Know? On-demand features solved the “I need to compute something at request time” problem. Before Feast 0.20, you had two choices: materialize everything (slow) or compute outside Feast (training-serving skew). On-demand features let you define transformations that run during
get_online_features(), keeping logic centralized.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No TTL on features | Stale data served | Set appropriate ttl |
| Ignoring point-in-time | Data leakage in training | Always use get_historical_features |
| Too many feature views | Query complexity | Group related features |
| No feature documentation | Features undiscoverable | Use tags and descriptions |
| Skipping materialization | Online store empty | Schedule regular materialization |
| Same features in multiple views | Inconsistency | Create shared feature views |
War Story: The Time-Traveling Feature
Section titled “War Story: The Time-Traveling Feature”A fraud detection model had 99% accuracy in training but 60% in production. The team couldn’t figure out why.
What went wrong:
- Training data included
is_fraudlabel - Feature
transactions_after_fraud_reportwas computed for all data - In training, this used future knowledge (transactions AFTER fraud was reported)
- In production, this feature was always 0 (no future data)
The fix:
# Entity DataFrame with correct timestampsentity_df = pd.DataFrame({ "user_id": user_ids, "event_timestamp": transaction_timestamps # Not report timestamps!})
# Feast returns features AS OF transaction time# No future leakage possibletraining_df = store.get_historical_features( entity_df=entity_df, features=["transaction_features:amount", "user_features:total_spend"]).to_df()Lesson: Point-in-time correctness isn’t optional. If you’re not using a feature store for training data, you’re probably leaking future information.
Question 1
Section titled “Question 1”What’s the difference between offline and online stores?
Show Answer
Offline Store:
- Stores historical feature values
- Used for training data generation
- High latency (seconds to minutes)
- High throughput for batch queries
- Examples: BigQuery, Parquet, Redshift
Online Store:
- Stores latest feature values
- Used for real-time inference
- Low latency (milliseconds)
- Key-value lookups by entity
- Examples: Redis, DynamoDB, Bigtable
Workflow:
- Compute features → Offline store
- Materialize → Copy latest values to Online store
- Training → Query Offline store
- Inference → Query Online store
Question 2
Section titled “Question 2”Why is point-in-time correctness important?
Show Answer
Point-in-time correctness prevents data leakage:
Without it:
Training example: User A, January 15Feature "total_purchases" = 100 (includes purchases through March!)Result: Model learns from future dataProduction: Only knows purchases up to "now"Result: Model performs worse than expectedWith point-in-time:
Training example: User A, January 15Feature "total_purchases" = 42 (only purchases up to January 15)Result: Model learns from realistic dataProduction: Gets features as of prediction timeResult: Model performance matches trainingFeast’s get_historical_features handles this automatically using event_timestamp.
Question 3
Section titled “Question 3”When should you use on-demand features vs batch features?
Show Answer
Batch Features when:
- Feature computation is expensive
- Values don’t change frequently
- Historical aggregates (lifetime totals)
- Can tolerate staleness (hours)
On-Demand Features when:
- Depends on request context
- Simple transformations
- Needs to be fresh
- Combines other features
Examples:
# Batch: Compute nightly, materializeuser_lifetime_spend # Doesn't change fast
# On-demand: Compute per request@on_demand_feature_viewdef request_features(inputs): # Uses request-time data df["distance_to_store"] = haversine( inputs["user_lat"], inputs["user_lon"], inputs["store_lat"], inputs["store_lon"] ) return dfRule of thumb: If it needs fresh request data, on-demand. Otherwise, batch and materialize.
Hands-On Exercise
Section titled “Hands-On Exercise”Objective
Section titled “Objective”Set up Feast and serve features for an ML model.
-
Initialize Feast project:
Terminal window pip install feastfeast init feast_democd feast_demo -
Create feature definitions (save as
feature_repo/user_features.py):from datetime import timedeltafrom feast import Entity, FeatureView, FileSource, Fieldfrom feast.types import Float32, Int64user = Entity(name="user_id", join_keys=["user_id"])user_source = FileSource(path="data/user_features.parquet",timestamp_field="event_timestamp")user_features = FeatureView(name="user_features",entities=[user],ttl=timedelta(days=1),schema=[Field(name="total_purchases", dtype=Int64),Field(name="avg_purchase_amount", dtype=Float32),],source=user_source,online=True,) -
Create sample data (save as
create_data.py):import pandas as pdfrom datetime import datetimedf = pd.DataFrame({"user_id": [1, 2, 3, 1, 2, 3],"total_purchases": [10, 5, 20, 12, 7, 22],"avg_purchase_amount": [50.0, 100.0, 25.0, 55.0, 95.0, 30.0],"event_timestamp": pd.to_datetime(["2024-01-01", "2024-01-01", "2024-01-01","2024-01-15", "2024-01-15", "2024-01-15",]),"created_timestamp": pd.to_datetime(["2024-01-01"] * 6),})df.to_parquet("feature_repo/data/user_features.parquet")print("Data created!")Terminal window mkdir -p feature_repo/datapython create_data.py -
Apply and materialize:
Terminal window cd feature_repofeast applyfeast materialize-incremental $(date +%Y-%m-%dT%H:%M:%S) -
Query features:
query_features.py from feast import FeatureStoreimport pandas as pdstore = FeatureStore(repo_path=".")# Historical featuresentity_df = pd.DataFrame({"user_id": [1, 2],"event_timestamp": pd.to_datetime(["2024-01-10", "2024-01-10"])})training_df = store.get_historical_features(entity_df=entity_df,features=["user_features:total_purchases", "user_features:avg_purchase_amount"]).to_df()print("Training features:")print(training_df)# Online featuresonline_features = store.get_online_features(features=["user_features:total_purchases", "user_features:avg_purchase_amount"],entity_rows=[{"user_id": 1}, {"user_id": 2}]).to_dict()print("\nOnline features:")print(online_features)Terminal window python query_features.py -
Clean up:
Terminal window cd ..rm -rf feast_demo
Success Criteria
Section titled “Success Criteria”- Feast project initialized
- Feature definitions created
- Sample data generated
- Features applied to registry
- Features materialized to online store
- Can query historical and online features
Bonus Challenge
Section titled “Bonus Challenge”Add an on-demand feature that computes purchases_per_dollar from the existing features.
Further Reading
Section titled “Further Reading”Toolkit Complete!
Section titled “Toolkit Complete!”Congratulations on completing the ML Platforms Toolkit! You’ve learned:
- Kubeflow for ML workflows and pipelines
- MLflow for experiment tracking and model registry
- Feast for feature storage and serving
These tools form the foundation of a modern MLOps platform on Kubernetes.
“Features are the fuel of machine learning. A feature store is the gas station that keeps your models running—consistent, fresh, and available whenever you need them.”