Перейти до вмісту

ML DevOps Foundations

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6 Prerequisites: Phase 9 complete


The $440 Million Commit That Changed Everything

Section titled “The $440 Million Commit That Changed Everything”

Knight Capital Group. August 1, 2012. 9:30 AM Eastern.

The opening bell rings on Wall Street, and Knight Capital’s new trading software goes live. For exactly 45 minutes, their system executes trades at a rate of 40 orders per second—buying high, selling low. By 10:15 AM, the company had lost $440 million. By the end of the week, Knight Capital was sold off in pieces.

The root cause? A technician had deployed new code but forgot to update one of eight servers. The old code, dormant for years, suddenly awoke and started executing a test flag meant for a system that no longer existed. There was no automated deployment. No consistency checks. No rollback mechanism.

Gene Kim, co-author of “The Phoenix Project” and researcher at IT Revolution, spent two years studying the Knight Capital disaster. His conclusion? “This wasn’t a technology failure. It was a DevOps failure. They had no way to ensure all servers had the same code, no automated testing, and no kill switch.”

Knight Capital’s catastrophe became a defining case study in why DevOps matters. But here’s the thing: ML systems are even more complex than the trading system that killed Knight Capital. You’re not just deploying code—you’re deploying code, data, models, and the intricate dependencies between them.

This module will teach you how to never become the next Knight Capital.


By the end of this module, you will:

  • Master Git workflows specifically designed for ML projects (because standard Git Flow doesn’t cut it)
  • Implement version control for code, data, and models as a unified system
  • Build comprehensive testing strategies that catch ML-specific bugs before production
  • Set up pre-commit hooks that prevent common ML disasters
  • Understand why 55% of ML models never reach production—and how to beat those odds

Think about what happens when a traditional software bug sneaks into production. The code behaves unexpectedly, users complain, you roll back, you fix it. Annoying, but manageable.

Now think about what happens when an ML model goes wrong in production. Is it the code? The data? The model weights? A subtle shift in the input distribution? A dependency that was silently updated? The training environment that’s different from production?

Did You Know? In 2015, D. Sculley and his team at Google published a paper called “Hidden Technical Debt in Machine Learning Systems” that sent shockwaves through the industry. They revealed that ML systems accumulate technical debt at an accelerating rate compared to traditional software. The reason? Every component depends on every other component. Change the data, and the model changes. Change the features, and the data pipeline changes. It’s a spider web of dependencies that traditional DevOps wasn’t designed to handle.

TRADITIONAL SOFTWARE vs ML SOFTWARE
====================================
Traditional Software: ML Software:
├── Code changes ├── Code changes
├── Config changes ├── Config changes
└── Dependencies ├── Dependencies
├── DATA changes (huge!)
├── MODEL changes (huge!)
├── Hyperparameters
├── Training environment
├── Feature definitions
├── Label definitions
└── Random seeds (yes, really)
Things that can break your system:
Traditional: ~3 ML: ~10+
This is why ML DevOps is a distinct discipline, not just "DevOps + ML"

Here’s a sobering statistic that should keep you up at night: In a 2019 study, researchers at McGill University found that only 6% of machine learning papers could be fully reproduced. Six percent!

The culprits weren’t malicious or incompetent researchers. They were:

  • Unreported random seeds (34% of cases)
  • Missing hyperparameters (28% of cases)
  • Undocumented preprocessing steps (23% of cases)
  • Environment differences (15% of cases)

Joelle Pineau, a professor at McGill and researcher at Facebook AI, led a crusade for ML reproducibility that eventually resulted in NeurIPS requiring reproducibility checklists for all submitted papers. Her mantra: “If you can’t reproduce it, you can’t trust it. And if you can’t trust it, you can’t deploy it.”

Traditional version control gives you one pillar: code. ML needs three, standing together like a tripod. Remove any one, and the whole thing falls over.

┌─────────────────────────────────────────────────────────────────────────┐
│ THE ML VERSION CONTROL TRIPOD │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CODE VERSIONING (Git) │
│ ├── Training scripts → "How did we train this?" │
│ ├── Inference code → "How do we use this?" │
│ ├── Data preprocessing → "How did we prepare the data?" │
│ └── Configuration files → "What settings did we use?" │
│ │
│ 2. DATA VERSIONING (DVC, Delta Lake, etc.) │
│ ├── Training datasets → "What did we learn from?" │
│ ├── Validation datasets → "How did we evaluate?" │
│ ├── Feature stores → "What features existed when?" │
│ └── Data transformations → "How did we process it?" │
│ │
│ 3. MODEL VERSIONING (MLflow, W&B, etc.) │
│ ├── Model weights → "What are the learned parameters?" │
│ ├── Hyperparameters → "What knobs did we turn?" │
│ ├── Metrics → "How well did it work?" │
│ └── Artifacts → "What did it produce?" │
│ │
│ Remove ANY pillar = You cannot reproduce or debug your system │
└─────────────────────────────────────────────────────────────────────────┘

Did You Know? Algorithmia’s 2022 State of ML survey found that 55% of companies have never deployed a single ML model to production. The number one reason cited? Lack of MLOps practices and infrastructure. The models work in notebooks. They just never make it to the real world. DevOps fundamentals are the bridge between “works on my machine” and “works in production.”


If you’ve worked in software development, you probably know Git Flow: feature branches, develop branch, release branches, hotfixes. It’s elegant. It’s proven. And it’s completely inadequate for ML.

Here’s why: Git Flow assumes that a feature is either done or not done. You branch, you build, you merge. But ML experiments are exploratory. You might run 50 experiments, and 49 of them teach you something valuable even if they don’t produce a deployable model. Where do those go in Git Flow?

Pete Warden, who led TensorFlow Mobile at Google before joining Apple, described the problem perfectly: “Traditional branching strategies treat code as a series of discrete features. But ML development is more like scientific research—you’re running experiments, most of which ‘fail’ in the sense that they don’t beat your baseline, but all of which generate knowledge.”

Here’s a workflow designed specifically for the exploratory nature of ML development:

ML GIT WORKFLOW
===============
main ─────────────────●─────────────────●────────────────────→
│ │
│ │ (merge after validation)
│ │
staging ──────●───────┼────●────────────┼─────────────────────→
│ │ │ │
│ │ │ (model validated on staging data)
│ │ │
experiment/ │ │ │
exp-001 ────┴───────┘ │
experiment/ │
exp-002 ─────────────────┘
KEY DIFFERENCES FROM STANDARD GIT FLOW:
1. "experiment" branches for ML experiments (can be long-lived)
2. Staging branch for model validation (not just code review)
3. Longer validation cycles (days, not hours)
4. Experiments may never merge—and that's OK!

The crucial insight is that experiment branches are first-class citizens, not just feature branches with a different name. An experiment branch might:

  • Live for weeks while you iterate
  • Never merge to main (but still provide valuable learnings)
  • Spawn multiple sub-experiments
  • Require its own data version (tracked via DVC)

When you’re running 20 experiments in parallel, branch names become critical. A colleague should be able to understand what an experiment is testing just from the branch name.

# ML-specific branch naming conventions
BRANCH_PATTERNS = {
# Standard development
"feature/": "New functionality (feature/add-streaming-inference)",
"fix/": "Bug fixes (fix/data-leak-in-validation)",
# ML-specific branches
"experiment/": "ML experiments (experiment/bert-large-lr-sweep)",
"exp/": "Short form for quick experiments (exp/dropout-0.3)",
"model/": "Model architecture changes (model/transformer-v2)",
"data/": "Dataset changes (data/add-2024-q4-samples)",
"baseline/": "Baseline experiments (baseline/logistic-regression)",
# Emergency
"hotfix/": "Production fixes (hotfix/oom-on-large-batch)",
}
# Good branch names tell a story
GOOD_NAMES = [
"experiment/gpt4-finetune-customer-support-v2", # What, why, version
"data/incorporate-user-feedback-nov-2024", # What, when
"model/switch-attention-to-flash-attention", # What, how
"exp/learning-rate-1e-5-warmup-1000", # Hyperparameters visible
]
# Bad branch names create confusion
BAD_NAMES = [
"test", # Test what?
"my-changes", # What changes?
"experiment1", # Experiment about what?
"final", # Nothing in ML is ever final
"final-v2", # Proof that "final" is a lie
"asdf", # Future you will hate past you
]

Did You Know? Netflix’s ML platform team conducted an internal study and found that descriptive branch names reduced “experiment archaeology” time by 40%. Experiment archaeology is when you’re trying to figure out what a past experiment actually tested—a surprisingly common time sink when branches have names like exp-47 or johns-test.

In traditional software, a commit message like “fix bug” is annoying but survivable. In ML, where you might need to reproduce results from six months ago, vague commit messages are catastrophic.

# Conventional Commits adapted for ML
COMMIT_TYPES = {
"feat": "New feature in the codebase",
"fix": "Bug fix",
"exp": "ML experiment (results included!)", # ML-specific
"data": "Data changes (describe what changed)", # ML-specific
"model": "Model architecture changes", # ML-specific
"perf": "Performance improvement",
"refactor": "Code refactoring (no behavior change)",
"test": "Adding or updating tests",
"docs": "Documentation only",
"chore": "Maintenance tasks",
}
# The secret sauce: Include metrics in experiment commits!
EXPERIMENT_COMMIT_TEMPLATE = """
exp: {short_description}
Experiment: {experiment_name}
Hypothesis: {what_you_expected}
Result: {what_actually_happened}
Metrics (vs baseline):
- Accuracy: {accuracy} ({delta_accuracy:+.2%})
- F1 Score: {f1} ({delta_f1:+.2%})
- Latency: {latency}ms ({delta_latency:+}ms)
Config Changes:
- learning_rate: {baseline_lr}{new_lr}
- batch_size: {baseline_batch}{new_batch}
- epochs: {epochs}
Notes: {any_observations}
"""
# Real example
GOOD_COMMIT = """
exp: Test BERT-large with cosine LR scheduler
Experiment: bert-large-cosine-lr-v3
Hypothesis: Cosine annealing should help with convergence stability
Result: Confirmed - lower variance in final metrics
Metrics (vs baseline):
- Accuracy: 0.892 (+1.2%)
- F1 Score: 0.876 (+0.8%)
- Latency: 45ms (+5ms)
Config Changes:
- learning_rate: 2e-5 → 1e-5 (peak)
- scheduler: linear → cosine
- warmup_steps: 100 → 500
Notes: Training loss curve much smoother. Worth the latency tradeoff.
"""

Did You Know? Google’s ML teams require all experiment commits to include a “hypothesis” field. This practice, borrowed from scientific research methodology, helps teams understand not just what changed but why it was expected to help. An internal study credited this practice with reducing duplicate experiments by 30%—teams stopped accidentally re-running experiments that had already been tried.


Data Version Control (DVC): Git for Your Data

Section titled “Data Version Control (DVC): Git for Your Data”

Let me paint a picture that every ML engineer has experienced:

You train a model. It works great. You commit your code, push to main, celebrate. Three months later, someone tries to reproduce your results. Same code, same hyperparameters, same random seeds. The accuracy is 15% lower. What happened?

The data changed. Someone “cleaned up” the training set. Someone “fixed” some labels. Someone added new samples. And no one versioned any of it.

THE PROBLEM WITH LARGE FILES IN GIT
===================================
Git stores EVERY version of EVERY file.
Your ML project starts innocently:
├── training_data.csv (500 MB)
├── model_v1.pkl (200 MB)
└── embeddings.npy (1 GB)
After 10 model iterations:
├── training_data.csv (500 MB × 1) = 500 MB
├── model_v1.pkl (200 MB)
├── model_v2.pkl (200 MB)
├── model_v3.pkl (200 MB)
├── ... (200 MB × 7)
└── embeddings.npy (1 GB × 3 versions)
Total repository size: 500 + 2000 + 3000 = 5.5 GB
And now you try to push to GitHub...
"Error: File model_v8.pkl is 200 MB; max file size is 100 MB"

DVC (Data Version Control) was created by Dmitry Petrov, a data scientist who got frustrated with exactly this problem while working at Microsoft. His insight: “What if we could use Git’s elegant versioning model, but store the actual files somewhere else?”

DVC creates small “pointer files” that Git tracks, while the actual large files live in remote storage (S3, GCS, Azure, or even a local drive). It’s like Git LFS on steroids, specifically designed for ML workflows.

Terminal window
# Install DVC
pip install dvc
# Initialize DVC in a Git repo
cd my-ml-project
dvc init
# Track a large file
dvc add data/training_data.csv
# What just happened?
# 1. Created: data/training_data.csv.dvc (small pointer file, ~100 bytes)
# 2. Created: data/.gitignore (ignores the actual data file)
# 3. The actual data stays local for now
# Set up remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage
# Push data to remote
dvc push
# Now your collaborator can:
git clone <repo>
dvc pull # Downloads the actual data files!

Think of DVC and Git as dance partners. Git leads (tracking code and DVC pointer files), and DVC follows (tracking the actual large files). They move together, always in sync.

"""
DVC + Git Workflow: The Complete Picture
"""
# STEP 1: You make data changes
# Added 5000 new labeled samples to training data
# Fixed 200 incorrect labels
# Removed 50 duplicate entries
# STEP 2: Tell DVC about the changes
"""
$ dvc add data/training_data.csv
$ dvc add data/labels.json
DVC updates the .dvc pointer files with new hashes:
data/training_data.csv.dvc → md5: a1b2c3d4...
data/labels.json.dvc → md5: e5f6g7h8...
"""
# STEP 3: Commit the pointer files with Git
"""
$ git add data/training_data.csv.dvc data/labels.json.dvc
$ git commit -m "data: Add 5000 Q4 samples, fix 200 labels, remove dupes"
"""
# STEP 4: Push both
"""
$ dvc push # Uploads actual data to remote storage
$ git push # Uploads code + DVC pointers to Git
"""
# THE MAGIC: Time travel for data!
"""
$ git checkout experiment/bert-large # Go to old experiment
$ dvc checkout # DVC fetches the DATA from that time!
Your data directory now has the EXACT data from when that experiment ran.
You can reproduce results from any point in history.
"""

DVC isn’t just for versioning—it can define entire ML pipelines. This is where it gets really powerful.

# dvc.yaml - Your ML pipeline as code
stages:
prepare:
cmd: python src/prepare_data.py
deps:
- src/prepare_data.py
- data/raw/
outs:
- data/processed/
featurize:
cmd: python src/featurize.py --config configs/features.yaml
deps:
- src/featurize.py
- data/processed/
- configs/features.yaml
outs:
- data/features/
train:
cmd: python src/train.py --config configs/train.yaml
deps:
- src/train.py
- data/features/
- configs/train.yaml
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false # Always show latest
plots:
- metrics/loss_curve.csv:
x: epoch
y: loss
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/test/
metrics:
- metrics/eval_metrics.json:
cache: false
Terminal window
# Run the full pipeline
$ dvc repro
# DVC is smart: it only re-runs stages whose dependencies changed
# Changed training config? Only re-runs train and evaluate
# Changed preprocessing? Re-runs everything
# Compare metrics across experiments
$ dvc metrics diff
Path Metric HEAD workspace
metrics/eval.json accuracy 0.845 0.872
metrics/eval.json f1 0.831 0.859
# Visualize the pipeline
$ dvc dag
+---------+
| prepare |
+---------+
|
v
+-----------+
| featurize |
+-----------+
|
v
+-------+
| train |
+-------+
|
v
+----------+
| evaluate |
+----------+

Did You Know? Iterative (the company behind DVC) conducted a study of their users and found that teams using DVC pipelines reduced “data drift” bugs by 60%. Data drift—when training data changes unexpectedly between experiments—is one of the most insidious causes of ML system failures. With DVC, every experiment is pinned to an exact data version.


You know the traditional testing pyramid: unit tests at the base, integration tests in the middle, end-to-end tests at the top. ML needs an extended pyramid with two additional layers that traditional software doesn’t need.

THE ML TESTING PYRAMID
======================
/│\
/ │ \ END-TO-END TESTS
/ │ \ Full pipeline, production-like data
/ │ \ "Does the whole thing work?"
/────┼────\
/ │ \ MODEL QUALITY TESTS ← ML-specific!
/ │ \ Accuracy, fairness, robustness
/ │ \ "Is the model good enough?"
/────────┼────────\
/ │ \ DATA QUALITY TESTS ← ML-specific!
/ │ \ Schema, distributions, drift
/ │ \ "Is the data valid?"
/────────────┼────────────\
/ │ \ INTEGRATION TESTS
/ │ \Components together
/ │ \"Do pieces work together?"
/────────────────┼────────────────\
/ │ \ UNIT TESTS
/ │ \Individual functions
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔"Do building blocks work?"
MORE TESTS ────────────────────────────────► FEWER TESTS
RUN FASTER ────────────────────────────────► RUN SLOWER

Unit tests in ML are about testing your preprocessing, feature extraction, and utility functions—the code that surrounds your model.

import pytest
import numpy as np
from src.preprocessing import normalize, tokenize, extract_features
class TestPreprocessing:
"""Unit tests for preprocessing functions.
These should be FAST and test edge cases that could silently
corrupt your data pipeline.
"""
def test_normalize_scales_to_unit_range(self):
"""Normalization should scale values to [0, 1]."""
data = np.array([0, 50, 100])
result = normalize(data)
assert result.min() >= 0, "Normalized data has negative values!"
assert result.max() <= 1, "Normalized data exceeds 1!"
assert np.isclose(result[0], 0), "Min value should normalize to 0"
assert np.isclose(result[2], 1), "Max value should normalize to 1"
def test_normalize_handles_constant_values(self):
"""Normalization shouldn't crash on constant input.
This is a sneaky edge case: if all values are the same,
naive normalization divides by zero.
"""
data = np.array([5, 5, 5])
result = normalize(data)
# Should handle gracefully, not produce NaN
assert not np.any(np.isnan(result)), "NaN in output!"
def test_normalize_handles_empty_array(self):
"""Empty arrays shouldn't crash the pipeline."""
data = np.array([])
result = normalize(data)
assert len(result) == 0
def test_tokenize_handles_empty_string(self):
"""Tokenizer should handle empty input gracefully."""
result = tokenize("")
assert result == [] or result == [""], "Unexpected empty string handling"
def test_tokenize_preserves_important_tokens(self):
"""Tokenizer shouldn't drop semantically important words."""
text = "machine learning is transforming artificial intelligence"
tokens = tokenize(text)
# Core concepts should survive tokenization (allow for stemming)
important_stems = ["machin", "learn", "transform", "artificial", "intelligen"]
found_important = [any(stem in t.lower() for t in tokens) for stem in important_stems]
assert all(found_important), f"Lost important tokens. Got: {tokens}"
def test_extract_features_output_shape(self):
"""Feature extraction should produce consistent dimensions."""
text = "sample input text for testing"
features = extract_features(text)
assert features.shape == (768,), f"Expected 768-dim, got {features.shape}"
assert features.dtype == np.float32, f"Expected float32, got {features.dtype}"
class TestModelInference:
"""Unit tests for model inference behavior."""
def test_model_output_shape(self, model):
"""Model output should have correct shape."""
input_data = np.random.randn(1, 768).astype(np.float32)
output = model.predict(input_data)
assert output.shape == (1, 10), f"Expected (1, 10), got {output.shape}"
def test_model_output_is_probability(self, model):
"""For classification, output should be valid probabilities."""
input_data = np.random.randn(1, 768).astype(np.float32)
output = model.predict(input_data)
assert np.all(output >= 0), "Negative probabilities!"
assert np.all(output <= 1), "Probabilities > 1!"
assert np.isclose(output.sum(), 1.0, atol=1e-5), "Probabilities don't sum to 1!"
def test_model_deterministic_inference(self, model):
"""Same input should always produce same output."""
input_data = np.random.randn(1, 768).astype(np.float32)
output1 = model.predict(input_data)
output2 = model.predict(input_data)
np.testing.assert_array_almost_equal(
output1, output2, decimal=6,
err_msg="Model gives different outputs for same input!"
)

This is where ML testing diverges from traditional software. You need to test your data, not just your code.

import pytest
import pandas as pd
import great_expectations as ge
from scipy import stats
class TestDataQuality:
"""Tests for data quality and schema validation.
These tests catch data issues BEFORE they corrupt your model.
"""
@pytest.fixture
def training_data(self):
return pd.read_csv("data/training_data.csv")
def test_no_missing_labels(self, training_data):
"""All samples should have labels.
Missing labels during training silently skews your loss function.
"""
missing = training_data["label"].isna().sum()
assert missing == 0, f"Found {missing} samples with missing labels!"
def test_label_distribution_not_severely_imbalanced(self, training_data):
"""Labels should be reasonably balanced.
Severe imbalance leads to models that predict the majority class.
"""
label_counts = training_data["label"].value_counts()
imbalance_ratio = label_counts.max() / label_counts.min()
assert imbalance_ratio < 10, (
f"Label imbalance ratio is {imbalance_ratio:.1f}:1. "
f"Consider class weighting or resampling."
)
def test_no_data_leakage_between_splits(self, training_data):
"""Training data shouldn't contain test samples.
Data leakage is one of the most common causes of overly
optimistic model evaluations.
"""
test_ids = set(pd.read_csv("data/test_ids.csv")["id"])
train_ids = set(training_data["id"])
overlap = train_ids & test_ids
assert len(overlap) == 0, (
f"DATA LEAKAGE DETECTED! "
f"{len(overlap)} samples appear in both train and test: {list(overlap)[:5]}..."
)
def test_feature_values_in_expected_ranges(self, training_data):
"""Features should be within plausible ranges.
Out-of-range values often indicate data corruption or encoding errors.
"""
ge_df = ge.from_pandas(training_data)
# Age should be reasonable for humans
result = ge_df.expect_column_values_to_be_between(
"age", min_value=0, max_value=120
)
assert result.success, f"Invalid ages found: {result.result}"
# Prices should be positive
result = ge_df.expect_column_values_to_be_between(
"price", min_value=0, max_value=1_000_000
)
assert result.success, f"Invalid prices found: {result.result}"
def test_no_duplicate_samples(self, training_data):
"""No duplicate samples in training data.
Duplicates cause the model to memorize instead of generalize.
"""
duplicate_mask = training_data.duplicated(subset=["text", "label"])
n_duplicates = duplicate_mask.sum()
assert n_duplicates == 0, (
f"Found {n_duplicates} duplicate samples! "
f"First duplicates at indices: {training_data[duplicate_mask].index[:5].tolist()}"
)
class TestDataDrift:
"""Tests for data drift between training and production.
Your model was trained on historical data. If production data
looks different, performance will silently degrade.
"""
def test_feature_distributions_stable(self):
"""Production features should match training distribution."""
train_stats = load_training_statistics()
prod_sample = get_recent_production_sample(n=1000)
drift_detected = []
for feature in ["age", "income", "engagement_score"]:
# Two-sample KS test for distribution shift
train_values = train_stats[feature]["sample_values"]
prod_values = prod_sample[feature].values
statistic, p_value = stats.ks_2samp(train_values, prod_values)
if p_value < 0.01: # Significant drift
drift_detected.append(
f"{feature}: KS statistic={statistic:.3f}, p={p_value:.4f}"
)
assert len(drift_detected) == 0, (
f"DATA DRIFT DETECTED in features:\n" + "\n".join(drift_detected)
)
def test_categorical_distribution_stable(self):
"""Categorical feature distributions shouldn't shift dramatically."""
train_dist = load_training_distributions()
prod_sample = get_recent_production_sample(n=1000)
for feature in ["category", "region", "device_type"]:
prod_dist = prod_sample[feature].value_counts(normalize=True)
for category, train_freq in train_dist[feature].items():
prod_freq = prod_dist.get(category, 0)
# Alert if frequency changed by more than 50%
if abs(prod_freq - train_freq) / train_freq > 0.5:
pytest.fail(
f"Category '{category}' in '{feature}' shifted: "
f"train={train_freq:.2%} → prod={prod_freq:.2%}"
)

These tests determine whether your model is good enough to deploy.

import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_recall_curve
from fairlearn.metrics import demographic_parity_difference, equalized_odds_difference
class TestModelQuality:
"""Tests for model quality and fairness.
These are your last line of defense before deployment.
"""
@pytest.fixture
def model_and_data(self):
model = load_model("models/production_model.pkl")
X_test, y_test = load_test_data()
return model, X_test, y_test
def test_accuracy_meets_business_threshold(self, model_and_data):
"""Model accuracy should meet minimum business requirement."""
model, X_test, y_test = model_and_data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
threshold = 0.85 # Business requirement
assert accuracy >= threshold, (
f"Accuracy {accuracy:.2%} below required {threshold:.2%}. "
f"DO NOT DEPLOY."
)
def test_no_class_left_behind(self, model_and_data):
"""Every class should have acceptable F1 score.
A model with 95% overall accuracy but 20% F1 on class 3
is dangerous if class 3 matters to your users.
"""
model, X_test, y_test = model_and_data
y_pred = model.predict(X_test)
f1_per_class = f1_score(y_test, y_pred, average=None)
for class_idx, f1 in enumerate(f1_per_class):
assert f1 >= 0.6, (
f"Class {class_idx} has F1 score {f1:.2f}. "
f"This class is being underserved by the model."
)
def test_no_performance_regression(self, model_and_data):
"""New model should not be worse than current production."""
model, X_test, y_test = model_and_data
prod_metrics = load_production_metrics()
y_pred = model.predict(X_test)
new_accuracy = accuracy_score(y_test, y_pred)
# Allow 1% degradation for statistical noise
threshold = prod_metrics["accuracy"] - 0.01
assert new_accuracy >= threshold, (
f"REGRESSION DETECTED! "
f"New: {new_accuracy:.2%}, Prod: {prod_metrics['accuracy']:.2%}. "
f"Rolling back."
)
def test_fairness_demographic_parity(self, model_and_data):
"""Model should have similar outcomes across demographic groups.
Required for compliance in many regulated industries.
"""
model, X_test, y_test = model_and_data
sensitive_features = X_test["gender"]
y_pred = model.predict(X_test)
dpd = demographic_parity_difference(
y_test, y_pred,
sensitive_features=sensitive_features
)
assert abs(dpd) < 0.1, (
f"FAIRNESS VIOLATION: Demographic parity difference = {dpd:.3f}. "
f"Positive outcomes differ by more than 10% between groups."
)
class TestModelRobustness:
"""Tests for edge cases that could crash production."""
def test_handles_missing_values(self, model):
"""Model should handle missing values gracefully."""
input_with_nan = pd.DataFrame({
"feature1": [1.0, np.nan, 3.0],
"feature2": [np.nan, 2.0, 3.0],
})
try:
predictions = model.predict(input_with_nan)
assert len(predictions) == 3
assert not np.any(np.isnan(predictions)), "Model returned NaN predictions!"
except Exception as e:
pytest.fail(f"Model crashed on missing values: {e}")
def test_handles_extreme_values(self, model):
"""Model should handle outliers without crashing or NaN."""
extreme_input = pd.DataFrame({
"feature1": [1e10, -1e10, 0],
"feature2": [0, 0, 1e-10],
})
predictions = model.predict(extreme_input)
assert not np.any(np.isnan(predictions)), "NaN on extreme input!"
assert not np.any(np.isinf(predictions)), "Inf on extreme input!"
assert np.all((predictions >= 0) & (predictions <= 1)), "Invalid probabilities!"

Did You Know? Netflix runs over 500 automated tests on every model before deployment. These include not just accuracy tests but fairness tests across demographic groups, latency tests, and “chaos tests” that simulate production failures (missing features, delayed data, corrupted inputs). This comprehensive testing reduced their model rollback rate by 70%.


Pre-commit Hooks: Your First Line of Defense

Section titled “Pre-commit Hooks: Your First Line of Defense”

Pre-commit hooks are automated checks that run before every commit. Think of them as a bouncer at the door of your repository—they won’t let bad code in.

For ML projects, pre-commit hooks are especially critical because ML has unique failure modes:

  • Accidentally committing model weights (bloating your repo forever)
  • Committing notebooks with outputs (including potentially sensitive data)
  • Config files with hardcoded secrets (API keys in plain text)
  • Invalid model configurations (that will fail silently during training)
Terminal window
# Install pre-commit
pip install pre-commit
# Install hooks in your repo
pre-commit install
# Now pre-commit runs automatically on every commit!
# You can also run manually:
pre-commit run --all-files
.pre-commit-config.yaml
repos:
# ============================================================
# STANDARD CODE QUALITY (Same as any Python project)
# ============================================================
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace # Remove trailing spaces
- id: end-of-file-fixer # Ensure newline at end
- id: check-yaml # Valid YAML syntax
- id: check-json # Valid JSON syntax
- id: check-added-large-files
args: ['--maxkb=1000'] # CRITICAL: Catch model files!
- id: detect-private-key # Catch committed SSH keys
- id: check-merge-conflict # Catch merge conflict markers
- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black # Auto-format Python code
language_version: python3.10
- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort # Sort imports
args: ["--profile", "black"]
- repo: https://github.com/pycqa/flake8
rev: 7.0.0
hooks:
- id: flake8 # Lint Python code
args: ['--max-line-length=100', '--ignore=E203,W503']
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
hooks:
- id: mypy # Type checking
additional_dependencies: [types-requests, numpy, pandas-stubs]
# ============================================================
# SECURITY CHECKS (Critical for ML with API keys)
# ============================================================
- repo: https://github.com/PyCQA/bandit
rev: 1.7.7
hooks:
- id: bandit # Security vulnerability scanner
args: ["-r", "src/", "-ll"]
exclude: tests/
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.1
hooks:
- id: gitleaks # Scan for secrets/API keys
# ============================================================
# ML-SPECIFIC HOOKS (The special sauce)
# ============================================================
# Clean notebook outputs before commit
- repo: https://github.com/kynan/nbstripout
rev: 0.7.1
hooks:
- id: nbstripout # Remove outputs from notebooks
# Why? Outputs can contain:
# - Sensitive data samples
# - Large binary outputs (plots, tensors)
# - Makes diffs unreadable
# Custom ML checks
- repo: local
hooks:
# Prevent committing model files directly
- id: no-large-model-files
name: Check no model files committed
entry: python scripts/hooks/check_no_models.py
language: python
types: [file]
# Validate ML configurations
- id: validate-ml-config
name: Validate ML configurations
entry: python scripts/hooks/validate_ml_config.py
language: python
files: configs/.*\.(yaml|yml)$
# Check for hardcoded secrets in config
- id: no-secrets-in-config
name: Check no secrets in config
entry: python scripts/hooks/check_no_secrets.py
language: python
files: \.(yaml|yml|json|ini|env)$
# Validate DVC files are in sync
- id: dvc-check
name: Check DVC files are valid
entry: dvc status
language: system
pass_filenames: false
always_run: true

Here are the custom scripts referenced in the config above:

scripts/hooks/check_no_models.py
"""Prevent accidentally committing large model files."""
import sys
from pathlib import Path
# File extensions that are typically large model files
MODEL_EXTENSIONS = {
'.pkl', '.pickle', # Pickle files
'.pt', '.pth', # PyTorch
'.h5', '.hdf5', # Keras/TensorFlow
'.onnx', # ONNX
'.bin', # Binary weights
'.safetensors', # Safetensors
'.ckpt', # Checkpoints
}
# Size threshold (10MB)
SIZE_THRESHOLD = 10 * 1024 * 1024
def check_file(filepath: str) -> str | None:
"""Check if a file looks like a model file."""
path = Path(filepath)
# Check extension
if path.suffix.lower() in MODEL_EXTENSIONS:
size = path.stat().st_size
if size > SIZE_THRESHOLD:
return (
f"️ {filepath} ({size / 1024 / 1024:.1f}MB)\n"
f" This looks like a model file. Use DVC to track it:\n"
f" $ dvc add {filepath}"
)
return None
def main():
issues = []
for filepath in sys.argv[1:]:
issue = check_file(filepath)
if issue:
issues.append(issue)
if issues:
print(" Large model files detected!\n")
print("\n".join(issues))
print("\n Model files should be tracked with DVC, not Git.")
sys.exit(1)
sys.exit(0)
if __name__ == "__main__":
main()
scripts/hooks/check_no_secrets.py
"""Check that config files don't contain secrets."""
import re
import sys
from pathlib import Path
SECRET_PATTERNS = [
(r'api[_-]?key\s*[:=]\s*["\']?[a-zA-Z0-9]{20,}', "API key"),
(r'password\s*[:=]\s*["\']?[^\s"\']+', "Password"),
(r'secret\s*[:=]\s*["\']?[a-zA-Z0-9]{20,}', "Secret"),
(r'sk-[a-zA-Z0-9]{48}', "OpenAI API key"),
(r'AKIA[A-Z0-9]{16}', "AWS Access Key"),
(r'ghp_[a-zA-Z0-9]{36}', "GitHub Personal Access Token"),
(r'xox[baprs]-[a-zA-Z0-9-]+', "Slack Token"),
]
def check_file(filepath: str) -> list:
"""Scan file for potential secrets."""
issues = []
content = Path(filepath).read_text()
for i, line in enumerate(content.split('\n'), 1):
# Skip comments
if line.strip().startswith('#'):
continue
for pattern, secret_type in SECRET_PATTERNS:
if re.search(pattern, line, re.IGNORECASE):
# Don't show the actual secret!
issues.append(f"{filepath}:{i}: Potential {secret_type} detected")
return issues
def main():
all_issues = []
for filepath in sys.argv[1:]:
all_issues.extend(check_file(filepath))
if all_issues:
print(" POTENTIAL SECRETS DETECTED!\n")
for issue in all_issues:
print(f" {issue}")
print("\n Use environment variables or a secrets manager instead.")
print(" Example: api_key: ${OPENAI_API_KEY}")
sys.exit(1)
print(" No secrets detected")
sys.exit(0)
if __name__ == "__main__":
main()

Did You Know? GitHub’s secret scanning detected over 700,000 exposed secrets in public repositories in 2022 alone. ML projects are particularly vulnerable because API keys for OpenAI, Anthropic, and cloud providers are expensive—a leaked key can result in thousands of dollars in charges within hours. Pre-commit hooks are your first line of defense.


A well-organized project structure isn’t just about aesthetics—it’s about making your project navigable for your future self, your teammates, and anyone who needs to reproduce your work.

ml-project/
├── .github/
│ └── workflows/
│ ├── ci.yml # CI: tests, linting on every PR
│ ├── train.yml # Training pipeline (triggered manually)
│ └── deploy.yml # Deployment pipeline
├── configs/
│ ├── model/
│ │ ├── base.yaml # Shared model configuration
│ │ ├── small.yaml # Small model (fast iteration)
│ │ └── large.yaml # Large model (production)
│ ├── training/
│ │ ├── default.yaml # Default hyperparameters
│ │ └── fine_tune.yaml # Fine-tuning configuration
│ └── inference/
│ └── production.yaml # Production serving config
├── data/
│ ├── raw/ # Raw data (DVC tracked, never modified)
│ ├── processed/ # Processed data (DVC tracked)
│ ├── features/ # Feature store outputs
│ └── .gitignore # CRITICAL: Ignore actual data files
├── models/
│ ├── checkpoints/ # Training checkpoints (DVC tracked)
│ ├── production/ # Production models (DVC tracked)
│ └── .gitignore # Ignore actual model files
├── notebooks/
│ ├── exploration/ # EDA and data exploration
│ ├── experiments/ # Experiment notebooks
│ └── reports/ # Stakeholder-facing notebooks
├── src/
│ ├── __init__.py
│ ├── data/
│ │ ├── __init__.py
│ │ ├── load.py # Data loading utilities
│ │ ├── preprocess.py # Preprocessing functions
│ │ └── validate.py # Data validation
│ ├── features/
│ │ ├── __init__.py
│ │ └── extract.py # Feature extraction
│ ├── models/
│ │ ├── __init__.py
│ │ ├── train.py # Training loop
│ │ ├── evaluate.py # Evaluation logic
│ │ └── predict.py # Inference logic
│ └── utils/
│ ├── __init__.py
│ ├── config.py # Configuration loading
│ └── logging.py # Logging setup
├── tests/
│ ├── __init__.py
│ ├── unit/ # Fast, isolated tests
│ ├── integration/ # Component integration tests
│ └── data/ # Data quality tests
├── scripts/
│ ├── train.py # Training entry point
│ ├── evaluate.py # Evaluation entry point
│ └── predict.py # Batch prediction
├── .dvc/ # DVC internals
├── .pre-commit-config.yaml # Pre-commit hooks
├── dvc.yaml # DVC pipeline definition
├── dvc.lock # DVC lock file (reproducibility)
├── pyproject.toml # Modern Python project config
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development dependencies
├── Makefile # Common commands
└── README.md # Project documentation
# Makefile - Common commands for the ML project
# Usage: make <command>
.PHONY: install test lint train evaluate clean help
# Default target: show help
help:
@echo "ML Project Commands:"
@echo " make install - Install all dependencies"
@echo " make test - Run all tests"
@echo " make test-unit - Run unit tests only"
@echo " make test-data - Run data quality tests"
@echo " make lint - Run all linters"
@echo " make train - Run training pipeline"
@echo " make evaluate - Evaluate current model"
@echo " make clean - Remove artifacts"
# Install all dependencies and set up pre-commit
install:
pip install -r requirements.txt
pip install -r requirements-dev.txt
pre-commit install
@echo " Installation complete"
# Run all tests with coverage
test:
pytest tests/ -v --cov=src --cov-report=term-missing --cov-report=html
@echo " Coverage report: htmlcov/index.html"
# Run specific test suites
test-unit:
pytest tests/unit/ -v
test-integration:
pytest tests/integration/ -v
test-data:
pytest tests/data/ -v --tb=short
# Lint and format code
lint:
black src/ tests/ scripts/
isort src/ tests/ scripts/
flake8 src/ tests/ scripts/
mypy src/
@echo " Linting complete"
# Run the full DVC pipeline
train:
dvc repro
@echo " Training complete. Check metrics with 'dvc metrics show'"
# Evaluate the current model
evaluate:
python scripts/evaluate.py --config configs/training/default.yaml
# Clean all artifacts
clean:
find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true
find . -type f -name "*.pyc" -delete
rm -rf .pytest_cache .mypy_cache .coverage htmlcov/
@echo " Cleaned"
# Data management
data-pull:
dvc pull
@echo " Data pulled from remote"
data-push:
dvc push
@echo " Data pushed to remote"
# Quick sanity check before committing
check: lint test-unit
@echo " Ready to commit"

Everything in this module leads to one goal: reproducibility. If you can’t reproduce your results, you can’t trust them. If you can’t trust them, you can’t deploy them.

Did You Know? Joelle Pineau’s famous 2019 study found that only 6% of ML papers were fully reproducible. The main culprits weren’t complex algorithms—they were mundane issues like unreported random seeds, missing hyperparameters, and undocumented preprocessing. Here’s a checklist to avoid joining the 94%:

# The Complete ML Reproducibility Checklist
REPRODUCIBILITY_CHECKLIST = {
"Random Seeds": {
"python": "import random; random.seed(42)",
"numpy": "import numpy as np; np.random.seed(42)",
"pytorch": """
import torch
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
""",
"tensorflow": "import tensorflow as tf; tf.random.set_seed(42)",
},
"Environment": {
"python_version": "Exact version (e.g., 3.10.12)",
"dependencies": "requirements.txt with pinned versions (pkg==1.2.3)",
"hardware": "Document GPU model, CUDA version, driver version",
"os": "Document OS and version",
},
"Data": {
"version": "DVC hash or explicit version tag",
"preprocessing": "All steps documented AND versioned",
"splits": "Fixed random seed for train/val/test splits",
"augmentation": "If any, document with seeds",
},
"Model": {
"architecture": "Full specification in config file",
"initialization": "Random, pretrained (which checkpoint?)",
"hyperparameters": "ALL of them, in versioned config",
},
"Training": {
"optimizer": "Type, learning rate, momentum, weight decay, etc.",
"scheduler": "Type, parameters, warmup steps",
"batch_size": "Training AND evaluation",
"epochs": "Number and early stopping criteria",
"hardware": "Single GPU? Multi-GPU? Which GPUs?",
},
"Evaluation": {
"metrics": "Exact formulas or library versions",
"test_set": "Versioned with DVC",
"preprocessing": "Same as training? Document any differences",
},
}
def verify_reproducibility(config_path: str) -> list[str]:
"""Check if a configuration covers reproducibility requirements."""
missing = []
config = load_config(config_path)
for category, items in REPRODUCIBILITY_CHECKLIST.items():
for item, description in items.items():
if item not in config.get(category, {}):
missing.append(f"{category}.{item}: {description}")
return missing

Did You Know? Spotify’s ML platform team developed an experiment organization hierarchy that’s now used across the industry. The key insight: experiments have different lifetimes, and your organization structure should reflect that.

EXPERIMENT TRACKING HIERARCHY
==============================
Project (lives: months to years)
│ "Customer Churn Prediction"
└── Experiment Group (lives: weeks)
│ "Feature Engineering v2"
└── Experiment (lives: days)
│ "Add behavioral features"
└── Run (lives: hours)
│ "lr=0.001, batch=32, seed=42"
└── Artifacts (lives: forever)
Model weights, metrics, plots
ORGANIZATION TIPS:
- Name projects after business problems, not techniques
- Name experiment groups after hypotheses, not dates
- Name experiments after what changed
- Let runs be automatically named with hyperparameters

Exercise 1: Set Up a Complete ML Git Workflow

Section titled “Exercise 1: Set Up a Complete ML Git Workflow”

Create a new ML project with:

  • Proper .gitignore for ML (exclude data, models, notebooks with outputs)
  • Branch naming conventions documented in CONTRIBUTING.md
  • Commit message template with experiment format
  • PR template with checklist for ML changes

Exercise 2: Implement DVC for Data Versioning

Section titled “Exercise 2: Implement DVC for Data Versioning”

Set up DVC in an existing project:

  • Initialize DVC and connect to remote storage (S3, GCS, or local)
  • Track your training data and model files
  • Create a simple DVC pipeline (preprocess → train → evaluate)
  • Practice “time travel” by checking out an old experiment

Write tests covering:

  • 5 unit tests for preprocessing functions
  • 3 data quality tests using Great Expectations
  • 2 model quality tests (accuracy threshold, fairness check)
  • 1 robustness test (handling missing values)

Exercise 4: Create Pre-commit Configuration

Section titled “Exercise 4: Create Pre-commit Configuration”

Set up pre-commit with:

  • Standard Python quality checks (black, isort, flake8, mypy)
  • Security scanning (bandit, gitleaks)
  • Notebook output stripping
  • Custom hook to prevent committing model files

  • “Hidden Technical Debt in Machine Learning Systems” (Sculley et al., Google, 2015) - The paper that started the MLOps movement
  • “Machine Learning: The High-Interest Credit Card of Technical Debt” (Google, 2014) - The earlier warning that was largely ignored
  • “Towards Reproducible Machine Learning Research in Natural Language Processing” (Pineau et al., 2019) - The reproducibility crisis exposed
  • “Continuous Delivery for Machine Learning” (ThoughtWorks, 2019) - CD4ML patterns
  • “Rules of Machine Learning: Best Practices for ML Engineering” (Martin Zinkevich, Google) - The definitive checklist

  1. ML DevOps is harder than traditional DevOps - You’re versioning code, data, models, and their complex dependencies. Traditional tools aren’t enough.

  2. DVC is your data’s Git - Large files don’t belong in Git. Use DVC to version data and models while keeping your repository lightweight.

  3. Testing ML is a multi-layer problem - You need unit tests, data quality tests, AND model quality tests. The traditional testing pyramid isn’t enough.

  4. Pre-commit hooks prevent disasters - Catching problems before commit is infinitely cheaper than catching them in production.

  5. Reproducibility is non-negotiable - If you can’t reproduce it, you can’t trust it. If you can’t trust it, you can’t deploy it.

  6. Project structure matters - A well-organized project is navigable by anyone. Use the standard layout—don’t reinvent the wheel.


Before moving on, ensure you have:

  • ML DevOps Toolkit with Git workflow automation
  • Pre-commit configuration covering ML-specific concerns
  • Data quality test suite using Great Expectations
  • Model quality test framework with fairness checks
  • Project template following the recommended structure
  • 5 working demos showing the complete workflow

With DevOps fundamentals in place, you’re ready to containerize your ML applications. Containers provide the reproducible environment that makes all this versioning actually work across different machines.

Up Next: Module 44 - Docker & Containerization for ML


Module 43 Complete! You now have the foundation to build ML systems that actually make it to production.

Remember the Knight Capital story: DevOps isn’t overhead—it’s insurance. And in ML, where more things can go wrong, that insurance is invaluable.

“The best ML model is worthless if you can’t deploy it reliably. The second-best model with solid DevOps beats it every time.”