Skip to content

Model Serving

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6


Prerequisites: Module 50 (ML Pipeline Orchestration)

Seattle. November 2021. The Zillow Offers division, previously hailed as the vanguard of algorithmic real estate, was in systemic freefall. Their sophisticated home price prediction model—the foundational core of their iBuying business—was generating systematically flawed appraisals. The machine learning model failed to account for a rapid, unforeseen cooldown in the housing market, leading the company to dramatically overpay for thousands of properties across the country.

The catastrophic failure wasn’t isolated to the model’s statistical predictive capabilities; it was a fundamental breakdown of model serving, observability, and automated deployment guardrails. Zillow had deployed a powerful machine learning engine but completely lacked the rapid feedback loops and progressive canary deployment strategies required to safely throttle the system when real-world market conditions diverged from the training data distributions. Without a mature serving layer, there was no automated circuit breaker to halt the runaway algorithms.

Because the engineering teams could not quickly identify the severe data drift or safely roll back to a more conservative pricing algorithm without paralyzing their core operations, the financial losses cascaded uncontrollably. By the end of the quarter, Zillow lost over $500 million, shuttered the entire iBuying division, and laid off 2,000 employees. This incident perfectly illustrates that deploying a model to production without an escape hatch is an unacceptable existential risk to the business.

By the end of this module, you will be able to:

  • Design robust, scalable model serving architectures utilizing load balancers, API gateways, and specialized inference serving layers.
  • Evaluate and compare advanced, framework-specific serving solutions (Triton, TorchServe) against general-purpose web frameworks for production workloads.
  • Implement progressive delivery strategies (Canary, Blue-Green) on Kubernetes v1.35+ to mitigate the risk of catastrophic model rollouts.
  • Diagnose production bottlenecks by implementing rigorous request validation, health checks, and graceful shutdown patterns.
  • Debug misconfigured serving infrastructures by analyzing resource saturation and latency telemetry.

Training a highly accurate machine learning model is merely the starting line. Getting that model into a production environment, serving predictions reliably at a massive scale, and maintaining its integrity over time constitutes the vast majority of an ML engineer’s workload. Data scientists often optimize for accuracy in a pristine, static environment, whereas ML engineers must optimize for reliability, latency, and throughput in a chaotic, distributed system.

Think of training a machine learning model like engineering a prototype hypercar in a closed laboratory. It is blindingly fast, powerful, and performs beautifully on a meticulously controlled test track. However, deployment is akin to entering that exact car into a grueling 24-hour endurance race. Suddenly, you require a highly coordinated pit crew (DevOps), comprehensive telemetry (monitoring), race strategy (deployment patterns), and a backup vehicle for when catastrophic failures occur (rollback). Most laboratory prototypes never survive race day without this massive infrastructure supporting them.

The stark contrast between the research phase and the production phase is often referred to as the deployment gap:

RESEARCH PRODUCTION
======== ==========
Jupyter notebooks REST/gRPC APIs
Single GPU Distributed serving
Batch predictions Real-time (<100ms)
"Works on my machine" 99.9% uptime SLA
Manual updates Automated rollouts
No monitoring Full observability

Did You Know? According to a 2022 Gartner report, only 54% of machine learning models ever make it to a production environment, primarily due to extreme deployment complexity and a lack of mature MLOps pipelines within organizations.

2. Serving Architecture and Web Frameworks

Section titled “2. Serving Architecture and Web Frameworks”

A robust model serving architecture decouples the client applications from the raw inference engines. This decoupling allows engineers to route traffic, split loads for A/B testing, and scale the inference hardware independently of the front-end clients. By treating the model as a modular microservice, updates can be rolled out invisibly to the end user.

flowchart TD
subgraph Clients
W[Web App]
M[Mobile App]
B[Backend Service]
end
LB[LOAD BALANCER / API GATEWAY\nNginx, Kong, AWS ALB, Istio]
subgraph Models
M1[Model v1\n90%]
M2[Model v2\n10% Canary / A/B]
end
SL[MODEL SERVING LAYER\nFastAPI / TorchServe / Triton / TF Serving]
MS[MODEL STORAGE\nS3 / GCS / Model Registry]
W --> LB
M --> LB
B --> LB
LB --> M1
LB --> M2
M1 --> SL
M2 --> SL
SL --> MS

FastAPI has emerged as the premier framework for writing custom ML model serving layers in Python. It offers asynchronous execution by default and strict type validation out of the box via Pydantic, ensuring that malformed tensors do not crash your backend inference engines.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Optional
import numpy as np
app = FastAPI(
title="ML Model API",
description="Production model serving API",
version="1.0.0"
)
# Request/Response Models
class PredictionRequest(BaseModel):
features: List[float] = Field(..., min_items=1, max_items=100)
model_version: Optional[str] = "latest"
class Config:
schema_extra = {
"example": {
"features": [0.5, 0.3, 0.8, 0.2],
"model_version": "v1.2.0"
}
}
class PredictionResponse(BaseModel):
prediction: float
confidence: float
model_version: str
latency_ms: float
# Load model (in production, use proper model registry)
model = None # Your trained model
@app.on_event("startup")
async def load_model():
global model
model = load_trained_model("models/production/model.pkl")
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
"""
Generate prediction for input features.
"""
import time
start = time.time()
try:
# Preprocess
features = np.array(request.features).reshape(1, -1)
# Predict
prediction = model.predict(features)[0]
confidence = model.predict_proba(features).max()
latency = (time.time() - start) * 1000
return PredictionResponse(
prediction=float(prediction),
confidence=float(confidence),
model_version=request.model_version,
latency_ms=latency
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
"""Health check endpoint."""
return {"status": "healthy", "model_loaded": model is not None}
@app.get("/model/info")
async def model_info():
"""Get model metadata."""
return {
"name": "churn_predictor",
"version": "1.2.0",
"framework": "sklearn",
"features": 4,
"classes": ["no_churn", "churn"]
}

Hardware accelerators like GPUs are drastically underutilized when processing single, sequential requests. To maximize throughput and justify the high hardware costs, you must implement batch predictions. FastAPI handles this elegantly using its asynchronous background tasks capabilities.

from fastapi import BackgroundTasks
from typing import List
import asyncio
class BatchRequest(BaseModel):
instances: List[List[float]]
async_mode: bool = False
class BatchResponse(BaseModel):
predictions: List[float]
batch_size: int
total_latency_ms: float
@app.post("/predict/batch", response_model=BatchResponse)
async def predict_batch(request: BatchRequest):
"""
Batch prediction for multiple instances.
More efficient than individual calls.
"""
start = time.time()
features = np.array(request.instances)
predictions = model.predict(features).tolist()
return BatchResponse(
predictions=predictions,
batch_size=len(request.instances),
total_latency_ms=(time.time() - start) * 1000
)
# Async batch processing
@app.post("/predict/async")
async def predict_async(
request: BatchRequest,
background_tasks: BackgroundTasks
):
"""
Submit batch for async processing.
Returns job_id to check status later.
"""
job_id = generate_job_id()
background_tasks.add_task(
process_batch_async,
job_id,
request.instances
)
return {"job_id": job_id, "status": "processing"}

As your service scales to thousands of requests per second, the overhead of parsing JSON text payloads becomes a severe bottleneck. Think of REST and gRPC like sending a hand-written letter versus utilizing an automated binary Morse code transmission. REST sends human-readable JSON, which is excellent for debugging but extremely slow for machines to deserialize.

gRPC uses Protocol Buffers—a strictly typed, compact binary format that allows machines to parse complex multi-dimensional tensors much more efficiently.

graph TD
subgraph REST JSON Payload
R1[JSON Text: ~50 bytes]
R2[\{features: 0.5, 0.3, 0.8, id: 1\}]
R3[Requires Heavy CPU Parsing]
R1 --- R2 --- R3
end
subgraph gRPC Protobuf Payload
G1[Protobuf Binary: ~20 bytes]
G2[0x0a0c0d0000003f15cdcc4c...]
G3[Zero Parsing Overhead]
G1 --- G2 --- G3
end

Pause and predict: If your gRPC service is receiving 10,000 requests per second, how would adding a dynamic batching layer of 50ms alter your P99 latency and overall throughput?

Answer: It would increase the P99 latency by a maximum of 50ms (plus inference time) but would exponentially increase your throughput by allowing the GPU to process thousands of requests simultaneously in a single forward pass, preventing the server from collapsing under the sheer volume of sequential connections.

To utilize gRPC, you define your data structures strictly in a .proto file. This definition acts as an unbreakable binding contract between the client and the serving cluster, eliminating runtime schema errors.

model_service.proto
syntax = "proto3";
package ml_serving;
service ModelService {
// Unary prediction
rpc Predict(PredictRequest) returns (PredictResponse);
// Streaming predictions (for real-time data)
rpc PredictStream(stream PredictRequest) returns (stream PredictResponse);
// Batch prediction
rpc PredictBatch(BatchRequest) returns (BatchResponse);
// Model info
rpc GetModelInfo(Empty) returns (ModelInfo);
}
message PredictRequest {
repeated float features = 1;
string model_version = 2;
}
message PredictResponse {
float prediction = 1;
float confidence = 2;
string model_version = 3;
float latency_ms = 4;
}
message BatchRequest {
repeated PredictRequest instances = 1;
}
message BatchResponse {
repeated PredictResponse predictions = 1;
int32 batch_size = 2;
float total_latency_ms = 3;
}
message ModelInfo {
string name = 1;
string version = 2;
string framework = 3;
int32 num_features = 4;
repeated string classes = 5;
}
message Empty {}

The Python implementation maps these strongly typed objects directly into the runtime context, bypassing the standard HTTP stack entirely for maximum performance.

import grpc
from concurrent import futures
import model_service_pb2
import model_service_pb2_grpc
import numpy as np
import time
class ModelServicer(model_service_pb2_grpc.ModelServiceServicer):
def __init__(self, model):
self.model = model
def Predict(self, request, context):
"""Single prediction."""
start = time.time()
features = np.array(request.features).reshape(1, -1)
prediction = self.model.predict(features)[0]
confidence = self.model.predict_proba(features).max()
return model_service_pb2.PredictResponse(
prediction=float(prediction),
confidence=float(confidence),
model_version=request.model_version or "v1.0",
latency_ms=(time.time() - start) * 1000
)
def PredictStream(self, request_iterator, context):
"""Streaming predictions - process as they arrive."""
for request in request_iterator:
features = np.array(request.features).reshape(1, -1)
prediction = self.model.predict(features)[0]
yield model_service_pb2.PredictResponse(
prediction=float(prediction),
confidence=0.95,
model_version="v1.0",
latency_ms=1.0
)
def PredictBatch(self, request, context):
"""Batch prediction."""
start = time.time()
responses = []
for instance in request.instances:
features = np.array(instance.features).reshape(1, -1)
prediction = self.model.predict(features)[0]
responses.append(model_service_pb2.PredictResponse(
prediction=float(prediction),
confidence=0.95,
model_version="v1.0"
))
return model_service_pb2.BatchResponse(
predictions=responses,
batch_size=len(responses),
total_latency_ms=(time.time() - start) * 1000
)
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
model_service_pb2_grpc.add_ModelServiceServicer_to_server(
ModelServicer(model), server
)
server.add_insecure_port('[::]:50051')
server.start()
server.wait_for_termination()

When your traffic volume outgrows custom FastAPI or gRPC scripts, you must migrate to specialized production frameworks that offer dynamic batching out-of-the-box.

Developed jointly by AWS and Meta, TorchServe provides a robust, standardized environment for serving PyTorch models without writing complex multithreading boilerplate code. It natively supports dynamic batching, rich metrics logging, and multi-model serving on single instances.

# TorchServe handler example
# Save as model_handler.py
from ts.torch_handler.base_handler import BaseHandler
import torch
import json
class ModelHandler(BaseHandler):
def __init__(self):
super().__init__()
self.model = None
def initialize(self, context):
"""Load the model."""
self.manifest = context.manifest
model_dir = context.system_properties.get("model_dir")
# Load model
model_path = f"{model_dir}/model.pt"
self.model = torch.jit.load(model_path)
self.model.eval()
def preprocess(self, data):
"""Preprocess input data."""
inputs = []
for row in data:
input_data = row.get("data") or row.get("body")
if isinstance(input_data, (bytes, bytearray)):
input_data = json.loads(input_data.decode('utf-8'))
inputs.append(torch.tensor(input_data["features"]))
return torch.stack(inputs)
def inference(self, inputs):
"""Run model inference."""
with torch.no_grad():
outputs = self.model(inputs)
return outputs
def postprocess(self, outputs):
"""Format outputs for response."""
predictions = outputs.numpy().tolist()
return [{"prediction": p} for p in predictions]

NVIDIA’s Triton is the pinnacle of high-performance hardware-accelerated serving. It supports concurrent model execution, dynamic batching, and highly efficient ensemble scheduling. Ensemble scheduling is incredibly powerful because it allows you to chain preprocessing models directly to inference models completely within GPU memory, avoiding any costly network hops or memory transfers back to the CPU.

config.pbtxt
# Triton model configuration
name: "ensemble_model"
platform: "ensemble"
max_batch_size: 64
input [
{
name: "INPUT"
data_type: TYPE_FP32
dims: [ -1, 128 ] # Dynamic batch, 128 features
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_FP32
dims: [ -1, 1 ]
}
]
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map {
key: "INPUT"
value: "INPUT"
}
output_map {
key: "PROCESSED"
value: "preprocessed"
}
},
{
model_name: "main_model"
model_version: -1
input_map {
key: "preprocessed"
value: "PROCESSED"
}
output_map {
key: "OUTPUT"
value: "OUTPUT"
}
}
]
}
FeatureFastAPITorchServeTritonTF Serving
FrameworkAnyPyTorchMultiTensorFlow
ProtocolRESTREST/gRPCREST/gRPCREST/gRPC
BatchingManualDynamicDynamicDynamic
GPU SupportManualBuilt-inBuilt-inBuilt-in
Model FormatAnyTorchScriptONNX/TRTSavedModel
ComplexityLowMediumHighMedium
Use CaseSimple APIsPyTorch modelsHigh perf multi-modelTF models

Advanced deployment patterns are non-negotiable for enterprise ML platforms. They provide the safety nets required to deploy models confidently, ensuring you can pull the plug instantly if the new weights behave erratically.

Blue-Green routing provisions an entirely identical production environment alongside your current one. You perform the deployment to the hidden environment, run extensive smoke tests, and then instruct the load balancer to switch traffic instantaneously. The rollback is equally instantaneous, providing massive psychological safety for the operations team.

flowchart TD
subgraph Phase 1: Before Update
LB1[Load Balancer]
B1[BLUE v1.0\n100% traffic]
G1[GREEN v1.0\n0% standby]
LB1 --> B1
LB1 -.-> G1
end
subgraph Phase 2: Deploy & Test
LB2[Load Balancer]
B2[BLUE v1.0\n100% traffic]
G2[GREEN v2.0\nTesting]
LB2 --> B2
LB2 -.-> G2
end
subgraph Phase 3: Switch Traffic
LB3[Load Balancer]
B3[BLUE v1.0\n0% standby]
G3[GREEN v2.0\n100% traffic]
LB3 -.-> B3
LB3 --> G3
end

Canary deployments gradually expose the new model version to a tiny subset of real production traffic, slowly increasing the percentage as confidence grows. Automatic rollback typically occurs quickly if error rates spike, latency breaches P99 thresholds, or downstream systems report anomalies.

flowchart TD
subgraph Progressive Rollout
LB[Load Balancer]
S[STABLE v1.0\n95% traffic]
C[CANARY v2.0\n5% traffic]
LB -->|95%| S
LB -->|5%| C
end

A/B testing. Canary deployment focuses strictly on operational system safety by gradually shifting traffic. A/B testing is explicitly designed to measure long-term business outcomes by splitting traffic deterministically (via hashing) and statistically comparing the resulting engagement metrics.

class ABTestRouter:
"""
Route requests to different model versions for A/B testing.
"""
def __init__(self):
self.experiments = {}
def create_experiment(
self,
experiment_id: str,
control_model: str,
treatment_model: str,
traffic_split: float = 0.5
):
"""Create new A/B experiment."""
self.experiments[experiment_id] = {
"control": control_model,
"treatment": treatment_model,
"split": traffic_split,
"metrics": {"control": [], "treatment": []}
}
def route_request(self, experiment_id: str, user_id: str) -> str:
"""
Deterministically route user to model variant.
Same user always gets same variant (for consistency).
"""
experiment = self.experiments[experiment_id]
# Hash user_id for deterministic assignment
hash_value = hash(f"{experiment_id}:{user_id}") % 100
is_treatment = hash_value < (experiment["split"] * 100)
return experiment["treatment"] if is_treatment else experiment["control"]
def record_outcome(
self,
experiment_id: str,
variant: str,
prediction: float,
actual: float
):
"""Record prediction outcome for analysis."""
self.experiments[experiment_id]["metrics"][variant].append({
"prediction": prediction,
"actual": actual,
"correct": (prediction > 0.5) == (actual > 0.5)
})
def analyze_experiment(self, experiment_id: str) -> dict:
"""
Statistical analysis of A/B test results.
"""
from scipy import stats
exp = self.experiments[experiment_id]
control = exp["metrics"]["control"]
treatment = exp["metrics"]["treatment"]
control_accuracy = sum(m["correct"] for m in control) / len(control)
treatment_accuracy = sum(m["correct"] for m in treatment) / len(treatment)
# Statistical significance test
control_correct = [m["correct"] for m in control]
treatment_correct = [m["correct"] for m in treatment]
t_stat, p_value = stats.ttest_ind(control_correct, treatment_correct)
return {
"control_accuracy": control_accuracy,
"treatment_accuracy": treatment_accuracy,
"improvement": treatment_accuracy - control_accuracy,
"p_value": p_value,
"significant": p_value < 0.05,
"recommendation": "deploy_treatment" if (
treatment_accuracy > control_accuracy and p_value < 0.05
) else "keep_control"
}

Stop and think: When implementing A/B testing for a recommendation engine, why is deterministic routing based on a hash of the user ID critical for the validity of your statistical results?

Answer: If routing isn’t deterministic, a single user refreshing the page might randomly bounce between the control and treatment models. This obliterates user experience and completely invalidates the statistical independence required for an accurate hypothesis test.

To scale massively without burning through capital, models must be heavily optimized post-training.

Did You Know? The ONNX format was announced at the 2017 Neural Information Processing Systems (NeurIPS) conference and within two years, it had support from over 30 frameworks and hardware platforms.

import onnx
import onnxruntime as ort
import torch
# Export PyTorch model to ONNX
def export_to_onnx(model, sample_input, output_path):
"""Export PyTorch model to ONNX format."""
model.eval()
torch.onnx.export(
model,
sample_input,
output_path,
export_params=True,
opset_version=14,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
# Verify the model
onnx_model = onnx.load(output_path)
onnx.checker.check_model(onnx_model)
print(f"Model exported to {output_path}")
# Run inference with ONNX Runtime
class ONNXPredictor:
def __init__(self, model_path: str):
# Use GPU if available
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
self.session = ort.InferenceSession(model_path, providers=providers)
# Get input/output names
self.input_name = self.session.get_inputs()[0].name
self.output_name = self.session.get_outputs()[0].name
def predict(self, features: np.ndarray) -> np.ndarray:
"""Run inference."""
return self.session.run(
[self.output_name],
{self.input_name: features.astype(np.float32)}
)[0]
def benchmark(self, features: np.ndarray, iterations: int = 100) -> dict:
"""Benchmark inference performance."""
import time
# Warmup
for _ in range(10):
self.predict(features)
# Benchmark
latencies = []
for _ in range(iterations):
start = time.time()
self.predict(features)
latencies.append((time.time() - start) * 1000)
return {
"mean_ms": np.mean(latencies),
"p50_ms": np.percentile(latencies, 50),
"p95_ms": np.percentile(latencies, 95),
"p99_ms": np.percentile(latencies, 99),
"throughput_qps": 1000 / np.mean(latencies)
}

Converting to TensorRT yields exponential gains on NVIDIA hardware by fusing kernel operations and quantizing weights. It rewrites the computational graph to perfectly match the underlying physical architecture of the specific GPU model.

# TensorRT for NVIDIA GPU optimization
import tensorrt as trt
def optimize_with_tensorrt(onnx_path: str, engine_path: str):
"""
Convert ONNX model to TensorRT engine.
Can provide 2-5x speedup on NVIDIA GPUs.
"""
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX
with open(onnx_path, 'rb') as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
raise RuntimeError("ONNX parsing failed")
# Configure builder
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1GB
# Enable FP16 for faster inference (with minimal accuracy loss)
if builder.platform_has_fast_fp16:
config.set_flag(trt.BuilderFlag.FP16)
# Build engine
engine = builder.build_engine(network, config)
# Save
with open(engine_path, 'wb') as f:
f.write(engine.serialize())
print(f"TensorRT engine saved to {engine_path}")

Optimization Impact:

FrameworkLatency (ms)Throughput (QPS)
PyTorch (eager)15.266
PyTorch (compiled)8.5118
ONNX Runtime6.2161
TensorRT FP324.1244
TensorRT FP162.3435
TensorRT INT81.5667

Model serving operates in a hostile environment where networks drop packets and downstream services constantly fail. Defensive programming is strictly required.

@app.get("/health")
async def health():
"""Liveness check."""
return {"status": "alive"}
@app.get("/ready")
async def ready():
"""Readiness check - is the model loaded?"""
if model is None:
raise HTTPException(503, "Model not loaded")
return {"status": "ready", "model_version": model.version}
import signal
class GracefulShutdown:
def __init__(self):
self.shutdown = False
signal.signal(signal.SIGTERM, self._handler)
signal.signal(signal.SIGINT, self._handler)
def _handler(self, signum, frame):
print("Shutdown signal received")
self.shutdown = True
async def wait_for_requests(self, timeout: int = 30):
"""Wait for in-flight requests to complete."""
start = time.time()
while active_requests > 0 and (time.time() - start) < timeout:
await asyncio.sleep(0.1)
from pydantic import validator
class PredictionRequest(BaseModel):
features: List[float]
@validator('features')
def validate_features(cls, v):
if len(v) != 10:
raise ValueError(f"Expected 10 features, got {len(v)}")
if any(not (-100 <= f <= 100) for f in v):
raise ValueError("Features must be in range [-100, 100]")
return v

A centralized model registry serves as the system of record. Reverting to a previous state must be a fast, automated API call, not an archaeological expedition through an S3 bucket.

class ModelRegistry:
"""
Simple model registry for version management.
"""
def __init__(self, storage_path: str):
self.storage_path = Path(storage_path)
self.storage_path.mkdir(parents=True, exist_ok=True)
self.registry_file = self.storage_path / "registry.json"
self.registry = self._load_registry()
def _load_registry(self) -> dict:
if self.registry_file.exists():
return json.loads(self.registry_file.read_text())
return {"models": {}, "production": {}}
def _save_registry(self):
self.registry_file.write_text(json.dumps(self.registry, indent=2))
def register_model(
self,
model_name: str,
version: str,
model_path: str,
metrics: dict,
metadata: dict = None
):
"""Register a new model version."""
if model_name not in self.registry["models"]:
self.registry["models"][model_name] = {}
self.registry["models"][model_name][version] = {
"path": model_path,
"metrics": metrics,
"metadata": metadata or {},
"registered_at": datetime.now().isoformat(),
"stage": "staging"
}
self._save_registry()
return f"Registered {model_name}:{version}"
def promote_to_production(self, model_name: str, version: str):
"""Promote a model version to production."""
if model_name not in self.registry["models"]:
raise ValueError(f"Model {model_name} not found")
if version not in self.registry["models"][model_name]:
raise ValueError(f"Version {version} not found")
# Demote current production
if model_name in self.registry["production"]:
old_version = self.registry["production"][model_name]
self.registry["models"][model_name][old_version]["stage"] = "archived"
# Promote new version
self.registry["models"][model_name][version]["stage"] = "production"
self.registry["production"][model_name] = version
self._save_registry()
return f"Promoted {model_name}:{version} to production"
def rollback(self, model_name: str, to_version: str):
"""Rollback to a previous version."""
return self.promote_to_production(model_name, to_version)
def get_production_model(self, model_name: str) -> dict:
"""Get current production model info."""
if model_name not in self.registry["production"]:
raise ValueError(f"No production model for {model_name}")
version = self.registry["production"][model_name]
return {
"version": version,
**self.registry["models"][model_name][version]
}

8. Common Mistakes and The Economics of Inference

Section titled “8. Common Mistakes and The Economics of Inference”

Did You Know? OpenAI reportedly spends over 700,000perdayoninferencecomputeforChatGPTasofearly2023,whichiswhyamere1700,000 per day on inference compute for ChatGPT as of early 2023, which is why a mere 1% optimization in model serving can save over 2.5 million annually.

Did You Know? In 2012, Amazon found that every 100ms of latency cost them 1% in sales, a metric that remains foundational in modern ML serving architectures where inference targets often sit below 50ms.

If you don’t track your serving costs, your successful model will bankrupt the engineering department.

ComponentTypical CostNotes
Compute (CPU)$0.05-0.10/hour/vCPUFor preprocessing, lightweight models
Compute (GPU)$0.50-4.00/hourA10G: 0.50,A100:0.50, A100: 3-4
Load balancer0.025/hour+0.025/hour + 0.008/GBOften overlooked fixed cost
Storage (model artifacts)$0.02/GB/monthMinimal unless you keep many versions
Network egress$0.09/GBCan add up with high throughput
Model registry$0-500/monthFree tiers available, enterprise costs more
ScaleMonthly CostCost per 1M Predictions
Small (100K/day)$500-2,000$15-60
Medium (10M/day)$5,000-20,000$1.50-6
Large (1B/day)$100,000-500,000$0.10-0.50
MistakeWhy it HappensFix
Deploying Without RollbackTeams prioritize speed over safety, assuming the model is “perfect.”Automate versioned deployments and keep old versions on standby.
Ignoring P99 LatencyP50 looks fine, but tail latency affects your most active/important users.Set alerts based on P95/P99 metrics and optimize tail latency.
Testing with Stale DataProduction data drifts; models overfit to historical distributions.Use shadow deployments or replay recent production traffic.
Missing Load Tests”Works on my machine” mentality; lack of simulated peak traffic.Run soak tests at 2-5x expected peak load.
Hardcoding Model PathsModel paths are baked into the container, requiring full rebuilds to swap.Use dynamic loading from a Model Registry or environmental variables.
Training/Serving SkewRe-implementing feature engineering in serving differently than training.Utilize a centralized Feature Store for both training and inference.

The most inexcusable error is overwriting active files directly in production without a versioned escape hatch:

# BAD: Overwriting the production model
cp new_model.pt /models/production/model.pt
# There is no old model anymore. Rollback = find it on someone's laptop.
# GOOD: Versioned deployment
cp new_model.pt /models/v2.1.0/model.pt
ln -sf /models/v2.1.0 /models/current
# Rollback = ln -sf /models/v2.0.0 /models/current
MODEL DEPLOYMENT PATTERNS
=========================
SERVING OPTIONS:
FastAPI - Simple, flexible, Python-native
gRPC - High performance, strong typing
TorchServe - PyTorch-native, dynamic batching
Triton - Multi-framework, GPU optimized
TF Serving - TensorFlow models
DEPLOYMENT PATTERNS:
Blue-Green - Instant rollback, zero downtime
Canary - Progressive rollout, risk mitigation
A/B Testing - Statistical comparison, data-driven
OPTIMIZATION:
ONNX - Universal format, cross-platform
TensorRT - NVIDIA GPU optimization (2-5x speedup)
Quantization - INT8 for smaller models, faster inference
KEY METRICS:
Latency - P50, P95, P99
Throughput - Queries per second
Availability - 99.9% uptime target

1. Scenario: You are tasked with serving a massive Deep Learning model where requests contain high-dimensional image tensors. The current REST API is saturating network bandwidth and CPU parsing cycles. Which protocol should you migrate to and why? gRPC is the optimal choice. It uses Protocol Buffers, a compact binary format that drastically reduces payload size compared to JSON's text-based serialization. This minimizes network overhead and serialization latency, critical for massive high-dimensional tensor data.
2. Scenario: Your new fraud detection model is ready for production. A false positive blocks a legitimate user's transaction, which is highly disruptive. Which deployment strategy minimizes operational risk while validating the model against real traffic? Canary deployment. By routing only a strictly limited percentage (e.g., 5%) of live traffic to the new model, you can empirically validate its real-world performance. If metrics degrade, you automatically halt the rollout and revert, strictly capping the blast radius.
3. What is the primary architectural purpose of exporting a PyTorch model to the ONNX format? ONNX provides a widely supported, framework-agnostic mathematical format for model weights and execution graphs. Exporting to ONNX decouples the training framework from the serving infrastructure, allowing the model to be optimized and deployed across diverse hardware and high-performance execution engines like TensorRT.
4. Scenario: A recommendation model's initialization takes 15 seconds due to heavy GPU allocation routines. During traffic spikes, Kubernetes autoscales the pods, but users routed to these new pods experience extreme timeouts. How do you resolve this architectural flaw? Implement a pre-warming and readiness strategy. The container should execute a dummy inference payload during startup to comprehensively initialize the GPU memory before the `/ready` Kubernetes probe returns an HTTP 200 OK. This ensures Kubernetes only routes traffic to fully warmed instances.
5. Why does dynamic batching significantly improve GPU utilization during model serving compared to handling requests sequentially? GPUs are specifically designed for massive parallel processing. Processing a batch of 64 requests simultaneously takes only fractionally longer than processing a single request. Dynamic batching accumulates incoming network requests for a brief time window (e.g., 10ms) and processes them in a single forward pass, dramatically increasing aggregate throughput.
6. Scenario: You need to deploy multiple iterations of a pricing model to measure which one generates higher overall user conversion rates over a 2-week period. Should you use Canary deployment or A/B testing? A/B testing. Canary deployment focuses strictly on operational system safety by gradually shifting traffic. A/B testing is explicitly designed to measure long-term business outcomes by splitting traffic deterministically (via hashing) and statistically comparing the resulting engagement metrics.
7. What occurs if a deployed serving container lacks a configured liveness probe in its Kubernetes Pod specification? If the serving application deadlocks or crashes silently without exiting the main process, Kubernetes has no mechanism to detect the failure state. The dead container will continue to receive traffic from the Service, resulting in dropped requests and severely degraded system availability until manual intervention occurs.

Hands-On Exercise: Full-Stack Model Deployment

Section titled “Hands-On Exercise: Full-Stack Model Deployment”

In this progressive lab, you will build, containerize, and safely deploy a machine learning API to a Kubernetes v1.35+ cluster using modern rollout practices.

First, create the model artifact that we will eventually serve.

Solution & Checkpoint

Create a file named train.py:

import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate dummy churn data
X, y = make_classification(n_samples=1000, n_features=4, random_state=42)
# Train model
clf = RandomForestClassifier(n_estimators=10, random_state=42)
clf.fit(X, y)
# Save artifact
joblib.dump(clf, 'model.joblib')
print("Model saved to model.joblib")

Execution:

Terminal window
pip install scikit-learn joblib
python train.py
  • Checkpoint: Verify model.joblib exists in your working directory.

Create the REST interface that implements liveness and readiness logic.

Solution & Checkpoint

Create a file named server.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
model = None
class PredictRequest(BaseModel):
features: list[float]
@app.on_event("startup")
async def load_model():
global model
model = joblib.load('model.joblib')
@app.post("/predict")
async def predict(req: PredictRequest):
if len(req.features) != 4:
raise HTTPException(status_code=400, detail="Require exactly 4 features")
pred = model.predict(np.array([req.features]))
return {"prediction": int(pred[0])}
@app.get("/health")
async def health():
return {"status": "alive"}
@app.get("/ready")
async def ready():
if model is None:
raise HTTPException(status_code=503, detail="Model loading")
return {"status": "ready"}

Execution:

Terminal window
pip install fastapi uvicorn
uvicorn server:app --host 0.0.0.0 --port 8000 &
  • Checkpoint: Run curl http://localhost:8000/health and ensure it returns {"status": "alive"}. Stop the background process (kill %1) before proceeding.

Package the model and API into an immutable, versioned container artifact.

Solution & Checkpoint

Create a file named Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.joblib .
COPY server.py .
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Create requirements.txt:

fastapi==0.103.1
uvicorn==0.23.2
scikit-learn==1.3.0
joblib==1.3.2

Execution:

Terminal window
docker build -t local-registry/churn-api:v1 .
  • Checkpoint: Run docker image ls | grep churn-api to confirm the image is built and tagged locally.

Deploy the container using standard Kubernetes v1.35 resource definitions, ensuring readiness probes are properly defined.

Solution & Checkpoint

Create a file named deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
name: churn-api-v1
spec:
replicas: 2
selector:
matchLabels:
app: churn-api
version: v1
template:
metadata:
labels:
app: churn-api
version: v1
spec:
containers:
- name: api
image: local-registry/churn-api:v1
imagePullPolicy: Never # For local testing
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 2
periodSeconds: 5
apiVersion: v1
kind: Service
metadata:
name: churn-api-service-v1
spec:
selector:
app: churn-api
version: v1
ports:
- port: 80
targetPort: 8000

Execution:

Terminal window
kubectl apply -f deployment.yaml
  • Checkpoint: Run kubectl get pods -l app=churn-api and verify both replicas reach the Running state and 1/1 readiness.

Simulate a risky rollout of v2 by routing precisely 10% of traffic to a new canary deployment utilizing standard Ingress annotations.

Solution & Checkpoint

Create a file named ingress-canary.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: churn-api-ingress-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /predict
pathType: Prefix
backend:
service:
name: churn-api-service-v2 # Assume v2 service is deployed
port:
number: 80

Execution:

Terminal window
kubectl apply -f ingress-canary.yaml
  • Checkpoint: Use a bash loop for i in {1..100}; do curl -H "Host: api.example.com" http://<ingress-ip>/predict -s; done to validate that exactly ~10% of the traffic routes to the canary pods while the rest hits the stable v1 service.

Now that you have established robust, scalable deployment strategies for serving predictions, you must ensure you have deep visibility into their behavior. Proceed to Module 1.10: ML Monitoring, where we dive into establishing latency telemetry, alert fatigue, and detecting statistical model drift.

  • Triton Inference Server — Covers dynamic batching, ensembles, metrics, and inference protocol options for production model serving.
  • Liveness, Readiness, and Startup Probes — Authoritative reference for probe behavior, readiness gating, and startup handling in Kubernetes deployments.
  • Optimize Models for Inference — Summarizes serving-time optimization tradeoffs around latency, throughput, and cost in a way that fits this module’s optimization section.