Module 9.2: MLflow
Toolkit Track | Complexity:
[MEDIUM]| Time: 40-45 minutes
Overview
Section titled “Overview”MLflow is the experiment tracking and model registry tool that data scientists actually use. While Kubeflow handles infrastructure, MLflow handles the metadata—what parameters did you try, what metrics did you get, where is the model? This module covers MLflow’s core components and Kubernetes deployment.
What You’ll Learn:
- MLflow Tracking for experiments
- Model Registry for lifecycle management
- MLflow on Kubernetes deployment
- Integration with Kubeflow Pipelines
- When to use MLflow vs alternatives
Prerequisites:
- Python basics
- Basic ML concepts
- Kubernetes fundamentals
- MLOps Discipline recommended
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy MLflow on Kubernetes with tracking server, artifact store, and model registry components
- Implement experiment tracking with MLflow’s API for metrics, parameters, and model versioning
- Configure MLflow’s model registry for staging, production, and archived model lifecycle management
- Integrate MLflow with Kubernetes for model serving using Seldon Core or KServe inference endpoints
Why This Module Matters
Section titled “Why This Module Matters”Data scientists run hundreds of experiments. Without tracking, they lose which parameters produced which results. MLflow provides a central record of every experiment, making ML reproducible. It’s the Git for machine learning—you wouldn’t code without version control, and you shouldn’t train models without experiment tracking.
💡 Did You Know? MLflow was created by Databricks and open-sourced in 2018. It became the most popular ML experiment tracking tool because it was framework-agnostic and simple. While competitors required specific ML frameworks or complex setups, MLflow worked with any Python code by adding just a few lines. Simplicity won.
MLflow Architecture
Section titled “MLflow Architecture”MLFLOW COMPONENTS════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────┐ │ MLflow Tracking Server │ │ │ │ ┌────────────┐ ┌────────────────┐ │ │ │ Backend │ │ Artifact Store │ │ │ │ Store │ │ (S3/GCS) │ │ │ │ (Postgres) │ │ │ │ │ └────────────┘ └────────────────┘ │ └────────────────────┬────────────────────┘ │ ┌───────────────────────────────┼───────────────────────────────┐ │ │ │ ▼ ▼ ▼┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Experiments │ │ Models │ │ Projects ││ │ │ │ │ ││ • Runs │ │ • Registry │ │ • Packaging ││ • Parameters │ │ • Versions │ │ • Dependencies ││ • Metrics │ │ • Stages │ │ • Reproducibility││ • Artifacts │ │ • Aliases │ │ │└─────────────────┘ └─────────────────┘ └─────────────────┘
DATA FLOW:─────────────────────────────────────────────────────────────────
Training Code MLflow Server Storage │ │ │ │ mlflow.log_param("lr", 0.01)│ │ │ ─────────────────────────────▶ │ │ │ Store in Postgres │ │ │ ────────────────────────▶ │ │ │ │ │ mlflow.log_artifact(model) │ │ │ ─────────────────────────────▶ │ │ │ Upload to S3 │ │ │ ────────────────────────▶ │Core Components
Section titled “Core Components”| Component | Purpose | Key Features |
|---|---|---|
| Tracking | Log experiments | Parameters, metrics, artifacts |
| Registry | Model lifecycle | Versions, stages, aliases |
| Projects | Reproducibility | Package code + dependencies |
| Models | Deployment | Unified model format |
💡 Did You Know? MLflow’s Model Registry introduced the concept of “model stages” (Staging, Production, Archived) to ML. This simple idea transformed how teams think about models—not as files but as artifacts with lifecycles. A model in “Staging” means something different than one in “Production,” and everyone on the team knows it.
MLflow Tracking
Section titled “MLflow Tracking”Basic Tracking
Section titled “Basic Tracking”import mlflowimport mlflow.sklearnfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_split
# Set tracking URI (server address)mlflow.set_tracking_uri("http://mlflow-server:5000")
# Set experimentmlflow.set_experiment("iris-classification")
# Load datairis = load_iris()X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2)
# Start a runwith mlflow.start_run(): # Log parameters n_estimators = 100 max_depth = 10 mlflow.log_param("n_estimators", n_estimators) mlflow.log_param("max_depth", max_depth)
# Train model model = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth ) model.fit(X_train, y_train)
# Log metrics accuracy = model.score(X_test, y_test) mlflow.log_metric("accuracy", accuracy)
# Log model mlflow.sklearn.log_model(model, "model")
print(f"Run ID: {mlflow.active_run().info.run_id}") print(f"Accuracy: {accuracy}")Autologging
Section titled “Autologging”import mlflow
# Enable autologging for supported frameworksmlflow.autolog()
# For specific frameworks:mlflow.sklearn.autolog()mlflow.pytorch.autolog()mlflow.tensorflow.autolog()mlflow.xgboost.autolog()
# Now just train - everything is logged automaticallyfrom sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)model.fit(X_train, y_train)# Parameters, metrics, and model logged automatically!Tracking Custom Artifacts
Section titled “Tracking Custom Artifacts”import mlflowimport matplotlib.pyplot as pltimport json
with mlflow.start_run(): # Log parameters and train...
# Log confusion matrix as image plt.figure() # ... create confusion matrix plot ... plt.savefig("confusion_matrix.png") mlflow.log_artifact("confusion_matrix.png")
# Log custom JSON metadata = { "training_date": "2024-01-15", "data_version": "v2.3", "feature_count": 10 } with open("metadata.json", "w") as f: json.dump(metadata, f) mlflow.log_artifact("metadata.json")
# Log entire directory mlflow.log_artifacts("./plots", artifact_path="visualizations")MLflow on Kubernetes
Section titled “MLflow on Kubernetes”Deployment Architecture
Section titled “Deployment Architecture”apiVersion: apps/v1kind: Deploymentmetadata: name: mlflow-server namespace: mlflowspec: replicas: 2 selector: matchLabels: app: mlflow template: metadata: labels: app: mlflow spec: containers: - name: mlflow image: ghcr.io/mlflow/mlflow:v2.9.0 command: - mlflow - server - --host=0.0.0.0 - --port=5000 - --backend-store-uri=postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@postgres:5432/mlflow - --default-artifact-root=s3://mlflow-artifacts/ - --serve-artifacts ports: - containerPort: 5000 env: - name: POSTGRES_USER valueFrom: secretKeyRef: name: mlflow-secrets key: postgres-user - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: mlflow-secrets key: postgres-password - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: mlflow-secrets key: aws-access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: mlflow-secrets key: aws-secret-key resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi livenessProbe: httpGet: path: /health port: 5000 initialDelaySeconds: 30 readinessProbe: httpGet: path: /health port: 5000---apiVersion: v1kind: Servicemetadata: name: mlflow-server namespace: mlflowspec: selector: app: mlflow ports: - port: 5000 targetPort: 5000 type: ClusterIPPostgreSQL Backend
Section titled “PostgreSQL Backend”apiVersion: apps/v1kind: StatefulSetmetadata: name: postgres namespace: mlflowspec: serviceName: postgres replicas: 1 selector: matchLabels: app: postgres template: metadata: labels: app: postgres spec: containers: - name: postgres image: postgres:15 ports: - containerPort: 5432 env: - name: POSTGRES_DB value: mlflow - name: POSTGRES_USER valueFrom: secretKeyRef: name: mlflow-secrets key: postgres-user - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: mlflow-secrets key: postgres-password volumeMounts: - name: postgres-data mountPath: /var/lib/postgresql/data volumeClaimTemplates: - metadata: name: postgres-data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10GiIngress Configuration
Section titled “Ingress Configuration”apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: mlflow-ingress namespace: mlflow annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/auth-type: basic nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-authspec: ingressClassName: nginx tls: - hosts: - mlflow.example.com secretName: mlflow-tls rules: - host: mlflow.example.com http: paths: - path: / pathType: Prefix backend: service: name: mlflow-server port: number: 5000💡 Did You Know? MLflow 2.0 introduced “serve-artifacts” mode where the tracking server proxies artifact uploads. This solved a common pain point: clients no longer need direct access to S3/GCS. They just talk to MLflow, which handles storage. This simplified network configurations significantly, especially in Kubernetes.
💡 Did You Know? MLflow’s
search_runs()API uses a SQL-like syntax that can query across thousands of experiments in milliseconds. You can find “all runs with accuracy > 0.95 and learning_rate < 0.01” instantly. Before this, teams would export CSVs and write pandas queries—now it’s built into the platform. Some organizations have millions of logged runs, all queryable through this single API.
Model Registry
Section titled “Model Registry”Registering Models
Section titled “Registering Models”import mlflowfrom mlflow.tracking import MlflowClient
# Register during trainingwith mlflow.start_run(): # Train model... mlflow.sklearn.log_model( model, "model", registered_model_name="iris-classifier" )
# Or register from existing runclient = MlflowClient()result = client.create_registered_model("iris-classifier")
# Create version from runrun_id = "abc123..."model_uri = f"runs:/{run_id}/model"client.create_model_version( name="iris-classifier", source=model_uri, run_id=run_id)Model Stages and Aliases
Section titled “Model Stages and Aliases”from mlflow.tracking import MlflowClient
client = MlflowClient()
# Transition to stagingclient.transition_model_version_stage( name="iris-classifier", version=1, stage="Staging")
# Transition to productionclient.transition_model_version_stage( name="iris-classifier", version=1, stage="Production")
# Using aliases (MLflow 2.3+)client.set_registered_model_alias( name="iris-classifier", alias="champion", version=1)
# Load by aliasmodel = mlflow.pyfunc.load_model("models:/iris-classifier@champion")Model Registry Workflow
Section titled “Model Registry Workflow”MODEL REGISTRY WORKFLOW════════════════════════════════════════════════════════════════════
Version 1 ──▶ None ──▶ Staging ──▶ Production ──▶ Archived │ │ │ │ ▼ ▼ │ Testing Serving │ Validation Inference │Version 2 ──▶ None ──▶ Staging ──▶ Production │ │ ▼ ▼ Testing (replaces v1) A/B Test
ALIAS WORKFLOW (Recommended):─────────────────────────────────────────────────────────────────
Version 1 ◀── alias: "champion" ◀── Production uses thisVersion 2 ◀── alias: "challenger"Version 3 (no alias - testing)
# Switch production in one command:client.set_registered_model_alias("model", "champion", version=2)Integration with Kubeflow
Section titled “Integration with Kubeflow”MLflow in Kubeflow Pipelines
Section titled “MLflow in Kubeflow Pipelines”from kfp import dslfrom kfp.dsl import Dataset, Model, Output, Input
@dsl.component( base_image="python:3.10", packages_to_install=["mlflow", "scikit-learn", "boto3"])def train_with_mlflow( mlflow_uri: str, experiment_name: str, n_estimators: int, output_run_id: Output[str]): import mlflow import mlflow.sklearn from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split
mlflow.set_tracking_uri(mlflow_uri) mlflow.set_experiment(experiment_name)
iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2 )
with mlflow.start_run() as run: mlflow.log_param("n_estimators", n_estimators)
model = RandomForestClassifier(n_estimators=n_estimators) model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test) mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model( model, "model", registered_model_name="iris-classifier" )
# Output run ID for downstream components with open(output_run_id.path, "w") as f: f.write(run.info.run_id)
@dsl.pipeline(name="mlflow-training-pipeline")def training_pipeline( mlflow_uri: str = "http://mlflow-server.mlflow:5000", experiment_name: str = "kubeflow-experiments", n_estimators: int = 100): train_task = train_with_mlflow( mlflow_uri=mlflow_uri, experiment_name=experiment_name, n_estimators=n_estimators )MLflow Model Serving
Section titled “MLflow Model Serving”Local Serving
Section titled “Local Serving”# Serve model from registrymlflow models serve \ --model-uri "models:/iris-classifier@champion" \ --host 0.0.0.0 \ --port 5001
# Test endpointcurl -X POST http://localhost:5001/invocations \ -H "Content-Type: application/json" \ -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'Kubernetes Serving
Section titled “Kubernetes Serving”apiVersion: apps/v1kind: Deploymentmetadata: name: iris-model-server namespace: ml-servingspec: replicas: 3 selector: matchLabels: app: iris-model template: metadata: labels: app: iris-model spec: containers: - name: model-server image: ghcr.io/mlflow/mlflow:v2.9.0 command: - mlflow - models - serve - --model-uri=models:/iris-classifier@champion - --host=0.0.0.0 - --port=5001 - --no-conda ports: - containerPort: 5001 env: - name: MLFLOW_TRACKING_URI value: "http://mlflow-server.mlflow:5000" - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: mlflow-secrets key: aws-access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: mlflow-secrets key: aws-secret-key resources: requests: cpu: 500m memory: 512Mi readinessProbe: httpGet: path: /health port: 5001---apiVersion: v1kind: Servicemetadata: name: iris-model-server namespace: ml-servingspec: selector: app: iris-model ports: - port: 80 targetPort: 5001Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No tracking URI set | Logs to local ./mlruns | Set MLFLOW_TRACKING_URI |
| Missing artifact store | Model files lost | Configure S3/GCS backend |
| No run context | Metrics not logged | Use with mlflow.start_run(): |
| Logging in loops | Thousands of metrics | Log after training, not per iteration |
| No model signature | Serving issues | Use infer_signature() |
| Ignoring run IDs | Can’t reproduce | Always log/save run ID |
War Story: The Lost Experiment
Section titled “War Story: The Lost Experiment”A data scientist achieved 98% accuracy. Best model ever! But they forgot which parameters produced it. They spent a week trying to reproduce it.
What went wrong:
- Running experiments in Jupyter without MLflow
- Overwriting variables as they experimented
- No systematic parameter tracking
- “I’ll remember which one was best”
The fix:
# Always track, even in notebooksimport mlflow
mlflow.set_tracking_uri("http://mlflow-server:5000")mlflow.set_experiment("notebook-experiments")
# Now every experiment is recordedwith mlflow.start_run(run_name="experiment-v42"): mlflow.log_params({ "learning_rate": 0.01, "batch_size": 32, "epochs": 100, "architecture": "resnet50" }) # Train... mlflow.log_metric("accuracy", 0.98) mlflow.log_model(model, "model")
# Later: "Which run had 98% accuracy?"# Just filter in MLflow UI or:runs = mlflow.search_runs(filter_string="metrics.accuracy > 0.97")Lesson: If you didn’t log it, it didn’t happen.
Question 1
Section titled “Question 1”What’s the difference between MLflow Tracking and Model Registry?
Show Answer
MLflow Tracking:
- Logs individual training runs
- Captures parameters, metrics, artifacts
- Organized by experiments
- Focus: “What did I try?”
Model Registry:
- Manages model lifecycle
- Versions of registered models
- Stages (Staging, Production)
- Aliases for easy reference
- Focus: “Which model should I use?”
Tracking is for experimentation. Registry is for production. A model logged in tracking can be promoted to the registry when it’s ready for deployment.
Question 2
Section titled “Question 2”Why use MLflow with Kubeflow Pipelines?
Show Answer
They solve different problems:
Kubeflow Pipelines:
- Orchestrates ML workflows
- Manages compute resources
- Handles data flow between steps
- Kubernetes-native execution
MLflow:
- Tracks experiment metadata
- Manages model versions
- Framework-agnostic
- Works outside Kubernetes too
Together:
- Kubeflow runs the pipeline
- MLflow tracks what happened
- Models registered in MLflow can be served by KServe
- Best of both: reproducible workflows + experiment tracking
Question 3
Section titled “Question 3”How do model aliases improve deployment?
Show Answer
Without aliases (stage-based):
# Load "Production" stage modelmodel = mlflow.pyfunc.load_model("models:/mymodel/Production")
# Problem: Only one model per stage# Problem: Stage transitions are destructiveWith aliases:
# Load by aliasmodel = mlflow.pyfunc.load_model("models:/mymodel@champion")
# Benefits:# - Multiple aliases per model# - Switch production with one command# - Keep "champion" and "challenger" simultaneously# - Alias history tracked
# Switch production:client.set_registered_model_alias("mymodel", "champion", version=5)# All services loading @champion now get version 5Aliases provide atomic, rollback-safe deployments.
Hands-On Exercise
Section titled “Hands-On Exercise”Objective
Section titled “Objective”Deploy MLflow on Kubernetes and track an ML experiment.
-
Create MLflow namespace and secrets:
Terminal window kubectl create namespace mlflowkubectl create secret generic mlflow-secrets \--namespace mlflow \--from-literal=postgres-user=mlflow \--from-literal=postgres-password=mlflow123 -
Deploy minimal MLflow (SQLite, no S3):
# Save as mlflow-minimal.yamlapiVersion: apps/v1kind: Deploymentmetadata:name: mlflow-servernamespace: mlflowspec:replicas: 1selector:matchLabels:app: mlflowtemplate:metadata:labels:app: mlflowspec:containers:- name: mlflowimage: ghcr.io/mlflow/mlflow:v2.9.0command: ["mlflow", "server", "--host=0.0.0.0", "--port=5000"]ports:- containerPort: 5000---apiVersion: v1kind: Servicemetadata:name: mlflow-servernamespace: mlflowspec:selector:app: mlflowports:- port: 5000Terminal window kubectl apply -f mlflow-minimal.yaml -
Port-forward and access UI:
Terminal window kubectl port-forward -n mlflow svc/mlflow-server 5000:5000# Open http://localhost:5000 -
Run tracking example:
tracking_example.py import mlflowmlflow.set_tracking_uri("http://localhost:5000")mlflow.set_experiment("k8s-demo")with mlflow.start_run():mlflow.log_param("learning_rate", 0.01)mlflow.log_param("epochs", 10)mlflow.log_metric("accuracy", 0.95)mlflow.log_metric("loss", 0.05)print(f"Run: {mlflow.active_run().info.run_id}")Terminal window pip install mlflowpython tracking_example.py -
View in UI:
- Refresh http://localhost:5000
- Click on “k8s-demo” experiment
- See logged parameters and metrics
-
Clean up:
Terminal window kubectl delete namespace mlflow
Success Criteria
Section titled “Success Criteria”- MLflow deployed on Kubernetes
- UI accessible via port-forward
- Experiment created and visible
- Parameters and metrics logged
- Can compare multiple runs
Bonus Challenge
Section titled “Bonus Challenge”Register a model to the Model Registry and load it by alias.
Further Reading
Section titled “Further Reading”Next Module
Section titled “Next Module”Continue to Module 9.3: Feature Stores to learn about managing ML features at scale.
“Experiment tracking isn’t overhead—it’s insurance. Every hour you spend logging is paid back tenfold when you need to reproduce results or debug a production model.”