CI/CD for ML

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Seattle. December 24, 2023. 4:17 PM. Sarah Park was already late to her family’s holiday dinner when her phone buzzed with a PagerDuty alert. The e-commerce recommendation system she’d built had just crashed—on the busiest shopping day of the year.

The root cause was embarrassingly simple: a well-meaning teammate had deployed a “small improvement” to the model. They’d retrained it on last month’s data, saw that accuracy looked good in their Jupyter notebook, and pushed it to production. What they didn’t notice was that the new model was 3x slower than the old one. Under holiday traffic, inference latency caused cascading timeouts across the platform.

“But it worked when I tested it!” her teammate protested.

That sentence haunts every ML engineer who’s heard it. Of course it worked in testing. Everything works in testing. Testing isn’t production. Testing doesn’t have 50,000 concurrent users. Testing doesn’t have the weird edge cases that real traffic surfaces within minutes.

The fix took four hours. Sarah missed her family dinner. The company lost an estimated $2 million in sales. All because they had no automated checks between a developer’s laptop and production.

“Traditional software can break in predictable ways: it compiles or it doesn’t, tests pass or they don’t. ML models break in subtle ways: they pass tests but give bad predictions, or good predictions but slowly, or fast predictions on training data but slow on production data. CI/CD for ML needs to catch all of it.” — Sarah Park, speaking at MLConf 2024

This module teaches you how to build the safety nets Sarah wished she’d had. By the end, you’ll have CI/CD pipelines that catch bugs before production, validate models before deployment, and automatically retrain when data changes.

The Evolution of CI/CD for ML

Understanding where CI/CD for ML came from helps you appreciate why it’s different from traditional software CI/CD and where the field is heading.

Phase 1: Ad-Hoc Deployments (Pre-2015)

In the early days of production ML, deployment was largely manual. Data scientists would train models on their laptops, export weights, and hand them to engineers who would somehow integrate them into production systems. Version control was often a folder named “model_v2_final_FINAL_v3”. Testing meant “it looked good in the notebook.”

This approach was fragile but tolerable when ML was rare and low-stakes. When Google deployed PageRank in 1998, there was no CI/CD pipeline—Larry and Sergey manually updated the ranking algorithm. When Netflix launched its recommendation engine in 2006, model updates were quarterly events requiring extensive manual validation.

Phase 2: Custom Infrastructure (2015-2018)

As ML became critical to business (Uber’s surge pricing, Airbnb’s search ranking, Facebook’s news feed), companies built custom ML infrastructure. Google created TFX, Facebook built FBLearner, Uber developed Michelangelo. These systems automated training, validation, and deployment—but they were proprietary and specific to each company.

Did You Know? Google’s TFX (TensorFlow Extended) paper in 2017 was the first public description of a complete ML pipeline. It introduced the concept of “pipeline components” that are now standard: data validation, data transformation, training, model analysis, and serving. Every modern MLOps tool traces its lineage to TFX concepts.

Phase 3: Open Standards (2018-2021)

Open-source alternatives emerged: MLflow (Databricks, 2018), Kubeflow (Google, 2018), Airflow (Airbnb), and later Metaflow (Netflix, 2019). These tools democratized ML infrastructure but created fragmentation—teams could choose from dozens of tools for each pipeline stage, with minimal interoperability.

Phase 4: Platform Convergence (2021-Present)

Today we’re seeing consolidation around platform-agnostic standards. GitHub Actions has become the dominant CI/CD platform. Dagger enables portable pipelines. OCI (container) standards ensure models run anywhere. The goal is “build once, run anywhere” for ML pipelines.

What You’ll Be Able to Do

By the end of this module, you will:

Understand why CI/CD for ML is different from traditional software
Master GitHub Actions for ML workflows
Build automated testing pipelines for ML code
Implement continuous training (CT) pipelines
Create model validation gates
Use portable CI/CD with Dagger

Why CI/CD for ML is Different

Before diving into the how, let’s understand the why. CI/CD for ML isn’t just “regular CI/CD with different tools.” It’s a fundamentally different problem with unique challenges.

Think of traditional software like building a house from blueprints. The blueprints (code) define exactly what the house will look like. If you follow them correctly, you get a predictable result. The house either matches the blueprints or it doesn’t—there’s no ambiguity.

ML is more like training a dog. You provide inputs (training data) and rewards (loss functions), and the dog (model) learns behaviors. But unlike blueprints, you can’t perfectly predict what behaviors the dog will learn. Two dogs trained identically might behave slightly differently. And even a well-trained dog might behave unexpectedly in new situations.

This uncertainty changes everything about how you need to test and deploy.

The Traditional CI/CD Pipeline

TRADITIONAL SOFTWARE CI/CD
===========================

Code Change → Build → Test → Deploy
     │          │       │       │
     │          │       │       └── Ship binary/container
     │          │       └── Unit + Integration tests
     │          └── Compile/bundle
     └── Git push

Simple because:
- Code is the only artifact
- Tests are deterministic
- "Working" is binary (pass/fail)

The ML CI/CD Challenge

The core challenge is that ML has THREE things that can change independently, and any of them can break your system. Traditional CI/CD only deals with code changes. ML CI/CD must handle code, data, AND model changes—each with its own testing requirements.

This is like the difference between maintaining a car and maintaining a race horse. A car mechanic only worries about mechanical parts. A horse trainer worries about the horse’s physical condition, diet, training regimen, and psychology—all interacting in complex ways. ML systems are more like horses than cars.

ML CI/CD COMPLEXITY
===================

┌─────────────────────────────────────────────────────────────────────┐
│                    THREE THINGS CAN CHANGE                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. CODE                                                            │
│     Model architecture, feature engineering, inference code         │
│     Traditional CI/CD handles this                                  │
│                                                                     │
│  2. DATA                                                            │
│     Training data, validation data, production data drift           │
│     Need data validation, versioning, quality checks               │
│                                                                     │
│  3. MODEL                                                           │
│     Trained weights, hyperparameters, model version                 │
│     Need model validation, A/B testing, rollback                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Any of these can trigger a pipeline!

Did You Know? Google’s ML platform team coined the term “ML Technical Debt” in a famous 2015 paper. They found that ML systems have a small fraction of actual ML code surrounded by a massive infrastructure for data collection, feature extraction, configuration, and monitoring. This is why CI/CD for ML is so complex—you’re not just testing code.

Continuous X in ML

THE CONTINUOUS SPECTRUM
=======================

CI  (Continuous Integration)
    → Code changes trigger tests
    → Unit tests, linting, type checking
    → Same as traditional software

CD  (Continuous Delivery/Deployment)
    → Successful tests trigger deployment
    → Model packaging, container builds
    → Deploy to staging/production

CT  (Continuous Training) ← NEW FOR ML!
    → Data changes trigger retraining
    → Scheduled or event-driven
    → Automatic model updates

CM  (Continuous Monitoring) ← NEW FOR ML!
    → Track model performance in production
    → Detect data drift, model degradation
    → Trigger retraining when needed

GitHub Actions for ML

GitHub Actions has become the dominant CI/CD platform for ML projects, and for good reason. It’s free for public repositories, integrates seamlessly with GitHub (where most ML projects live), and supports the complex workflows that ML requires.

Think of GitHub Actions like a programmable robot assistant that watches your repository. When you push code, create a pull request, or on a schedule, the robot wakes up and follows the instructions you’ve given it. Those instructions can include running tests, training models, deploying to production, or anything else you can script.

The key to effective ML CI/CD is teaching this robot to check everything that matters: code quality, data quality, model quality, and production readiness.

Anatomy of a Workflow

name: ML Pipeline

# Triggers
on:
  push:
    branches: [main, develop]
    paths:
      - 'src/**'
      - 'tests/**'
      - 'requirements.txt'
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 0 * * 0'  # Weekly retraining
  workflow_dispatch:      # Manual trigger

# Environment variables
env:
  PYTHON_VERSION: '3.10'
  MODEL_REGISTRY: 'models'

# Jobs
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      - run: pip install -r requirements.txt
      - run: pytest tests/

ML-Specific Workflow Patterns

# Pattern 1: Code Quality + ML Tests
jobs:
  code-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint
        run: ruff check src/
      - name: Type Check
        run: mypy src/
      - name: Format Check
        run: black --check src/

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: pytest tests/unit/ -v

  data-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate data schema
        run: python -m src.validate_data
      - name: Check data quality
        run: pytest tests/data/ -v

  model-tests:
    runs-on: ubuntu-latest
    needs: [unit-tests, data-tests]
    steps:
      - uses: actions/checkout@v4
      - name: Load model
        run: python -m src.load_model
      - name: Run model tests
        run: pytest tests/model/ -v
      - name: Check model metrics
        run: python -m src.validate_metrics

Caching for ML Workflows

# Cache dependencies (saves 2-5 minutes)
- uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

# Cache model artifacts (saves download time)
- uses: actions/cache@v4
  with:
    path: models/
    key: models-${{ hashFiles('models/config.json') }}

# Cache Hugging Face models
- uses: actions/cache@v4
  with:
    path: ~/.cache/huggingface
    key: hf-${{ hashFiles('requirements.txt') }}

Did You Know? GitHub Actions provides 2,000 free minutes per month for private repos and unlimited minutes for public repos. A typical ML test suite takes 5-15 minutes, so you can run 130-400 pipeline runs per month for free. Self-hosted runners can reduce this further—and give you GPU access.

Testing Strategies for ML

Testing ML systems requires thinking in layers. Unlike traditional software where you’re mainly checking “does this function return the right value?”, ML testing asks questions like “is this data clean?”, “is this model accurate enough?”, “is this model fast enough?”, and “did this model get worse since last week?”

The testing pyramid visualizes how many tests you should have at each level. The base (unit tests) should be the widest—lots of fast, cheap tests catching obvious bugs. The peak (end-to-end tests) should be narrow—fewer slow, expensive tests validating the whole system.

Think of it like security at an airport. The first layer (unit tests) is the ticket check—fast and catches obvious issues. The middle layers (data and model tests) are like the metal detector and bag scanner—more thorough. The top layer (end-to-end tests) is like an air marshal on the plane—the last line of defense, slow and expensive, but catches what everything else missed.

The ML Testing Pyramid

                    ▲
                   ╱ ╲
                  ╱   ╲     End-to-End Tests
                 ╱ E2E ╲    (Full pipeline validation)
                ╱───────╲
               ╱         ╲   Model Tests
              ╱  MODEL    ╲  (Accuracy, latency, regression)
             ╱─────────────╲
            ╱               ╲  Data Tests
           ╱     DATA        ╲ (Schema, quality, drift)
          ╱───────────────────╲
         ╱                     ╲ Integration Tests
        ╱    INTEGRATION        ╲(API contracts, services)
       ╱─────────────────────────╲
      ╱                           ╲ Unit Tests
     ╱         UNIT                ╲(Functions, transformations)
    ╱───────────────────────────────╲

    MORE ──────────────────────────► FEWER
    FAST ──────────────────────────► SLOW
    CHEAP ─────────────────────────► EXPENSIVE

Unit Tests for ML Code

Unit tests for ML code follow the same principles as traditional software, but focus on the data transformation functions rather than business logic. These are your bread-and-butter tests: fast, deterministic, and numerous.

The key insight is that while ML model outputs are inherently probabilistic (and thus hard to unit test), the code around the model—preprocessing, postprocessing, feature engineering—is deterministic and should be tested thoroughly. If your normalization function returns NaN on edge cases, you want to catch that immediately, not when the model mysteriously fails in production.

import pytest
import numpy as np
from src.preprocessing import normalize, tokenize, extract_features

class TestNormalize:
    """Test normalization functions."""

    def test_normalize_zero_mean(self):
        """Output should have zero mean."""
        data = np.array([1, 2, 3, 4, 5])
        result = normalize(data)
        assert np.isclose(result.mean(), 0, atol=1e-7)

    def test_normalize_unit_variance(self):
        """Output should have unit variance."""
        data = np.array([1, 2, 3, 4, 5])
        result = normalize(data)
        assert np.isclose(result.std(), 1, atol=1e-7)

    def test_normalize_handles_constant(self):
        """Should handle constant arrays without division by zero."""
        data = np.array([5, 5, 5, 5, 5])
        result = normalize(data)
        assert not np.any(np.isnan(result))

    def test_normalize_empty_array(self):
        """Should raise on empty input."""
        with pytest.raises(ValueError):
            normalize(np.array([]))


class TestTokenize:
    """Test tokenization functions."""

    def test_tokenize_basic(self):
        """Basic tokenization should split on whitespace."""
        text = "Hello world"
        tokens = tokenize(text)
        assert tokens == ["hello", "world"]

    def test_tokenize_handles_punctuation(self):
        """Should remove punctuation."""
        text = "Hello, world!"
        tokens = tokenize(text)
        assert tokens == ["hello", "world"]

    def test_tokenize_max_length(self):
        """Should respect max_length parameter."""
        text = "one two three four five"
        tokens = tokenize(text, max_length=3)
        assert len(tokens) == 3

Data Quality Tests

Data quality tests are the ML-specific layer that traditional software doesn’t have. They answer questions like: Is the data schema correct? Are there unexpected nulls? Is the class distribution what we expected? Have we accidentally introduced duplicates?

These tests are crucial because bad data is the silent killer of ML models. A model trained on corrupted data will produce corrupted predictions, but it won’t throw an error. It’ll confidently give you wrong answers. Data quality tests are your firewall against this failure mode.

A 2023 study by Gartner found that poor data quality costs organizations an average of $12.9 million annually. For ML systems, the cost is even higher because bad data doesn’t just cause immediate failures—it trains models that make systematically wrong predictions for months before anyone notices. One financial services company discovered their fraud detection model had been trained on data where 23% of labels were incorrect, leading to $4.2 million in false positive operational costs before the issue was identified.

The best data quality tests codify your assumptions. If you assume all labels are 0, 1, or 2—test for it. If you assume no text is longer than 10,000 characters—test for it. If you assume at least 10% of data comes from each source—test for it. Assumptions that aren’t tested are assumptions that will break silently. Write down every assumption your team makes about the data, then turn each one into a test. This exercise alone often reveals hidden assumptions that team members didn’t know they disagreed about.

import pytest
import pandas as pd
from src.data import load_training_data

@pytest.fixture
def training_data():
    """Load training data for tests."""
    return load_training_data()

class TestDataSchema:
    """Verify data schema expectations."""

    def test_required_columns_exist(self, training_data):
        """All required columns must be present."""
        required = ['text', 'label', 'timestamp', 'source']
        missing = set(required) - set(training_data.columns)
        assert not missing, f"Missing columns: {missing}"

    def test_no_null_in_required_fields(self, training_data):
        """Required fields should not have nulls."""
        required = ['text', 'label']
        for col in required:
            null_count = training_data[col].isnull().sum()
            assert null_count == 0, f"{col} has {null_count} nulls"

    def test_label_values_valid(self, training_data):
        """Labels should be in expected set."""
        valid_labels = {0, 1, 2}  # negative, neutral, positive
        actual_labels = set(training_data['label'].unique())
        invalid = actual_labels - valid_labels
        assert not invalid, f"Invalid labels: {invalid}"


class TestDataQuality:
    """Verify data quality expectations."""

    def test_minimum_samples(self, training_data):
        """Should have minimum number of samples."""
        min_samples = 1000
        assert len(training_data) >= min_samples

    def test_class_balance(self, training_data):
        """Classes should be reasonably balanced."""
        label_counts = training_data['label'].value_counts()
        min_ratio = label_counts.min() / label_counts.max()
        assert min_ratio >= 0.1, f"Class imbalance ratio: {min_ratio}"

    def test_text_length_distribution(self, training_data):
        """Text lengths should be within expected range."""
        lengths = training_data['text'].str.len()
        assert lengths.min() >= 10, "Text too short"
        assert lengths.max() <= 10000, "Text too long"
        assert lengths.median() >= 50, "Median text length too short"

    def test_no_duplicate_texts(self, training_data):
        """Should not have duplicate texts."""
        duplicates = training_data['text'].duplicated().sum()
        duplicate_ratio = duplicates / len(training_data)
        assert duplicate_ratio < 0.01, f"Duplicate ratio: {duplicate_ratio:.2%}"

Model Quality Tests

Model quality tests are the heart of ML CI/CD—they verify that your model actually does what it’s supposed to do. These tests are harder than traditional unit tests because ML model behavior is probabilistic and can be sensitive to initialization, training data, and even hardware.

The trick is to test at different levels:

Smoke tests: Does the model load? Does it produce output at all? Does the output have the right shape?
Sanity tests: Are predictions within reasonable bounds? Does the model predict different classes (not collapsing to a single output)?
Performance tests: Does accuracy meet minimum thresholds? Is inference fast enough?
Regression tests: Is the new model at least as good as the old one?

Regression tests are particularly important. It’s easy to accidentally make a model worse while trying to improve it. Without automated regression checks, you might not notice until users complain—or worse, until you’ve lost revenue due to degraded predictions.

import pytest
import time
import numpy as np
from src.model import load_model, predict

@pytest.fixture(scope="module")
def model():
    """Load model once for all tests."""
    return load_model("models/production/model.pt")

@pytest.fixture
def test_samples():
    """Sample inputs for testing."""
    return [
        "This product is amazing!",
        "Terrible experience, never again.",
        "It's okay, nothing special.",
    ]

class TestModelAccuracy:
    """Verify model accuracy thresholds."""

    def test_accuracy_above_threshold(self, model):
        """Model accuracy should meet minimum threshold."""
        from src.evaluate import evaluate_on_test_set
        metrics = evaluate_on_test_set(model)
        assert metrics['accuracy'] >= 0.85, f"Accuracy {metrics['accuracy']}"

    def test_f1_score_above_threshold(self, model):
        """F1 score should meet minimum threshold."""
        from src.evaluate import evaluate_on_test_set
        metrics = evaluate_on_test_set(model)
        assert metrics['f1'] >= 0.80, f"F1 {metrics['f1']}"

    def test_no_class_collapse(self, model, test_samples):
        """Model should predict multiple classes."""
        predictions = [predict(model, text) for text in test_samples * 10]
        unique_predictions = set(predictions)
        assert len(unique_predictions) >= 2, "Model collapsed to single class"


class TestModelLatency:
    """Verify model inference performance."""

    def test_single_inference_latency(self, model, test_samples):
        """Single inference should be fast."""
        text = test_samples[0]

        start = time.perf_counter()
        predict(model, text)
        latency_ms = (time.perf_counter() - start) * 1000

        assert latency_ms < 100, f"Latency {latency_ms:.1f}ms exceeds 100ms"

    def test_batch_inference_latency(self, model, test_samples):
        """Batch inference should scale efficiently."""
        batch = test_samples * 100  # 300 samples

        start = time.perf_counter()
        for text in batch:
            predict(model, text)
        total_ms = (time.perf_counter() - start) * 1000

        per_sample_ms = total_ms / len(batch)
        assert per_sample_ms < 50, f"Per-sample latency {per_sample_ms:.1f}ms"


class TestModelRegression:
    """Verify model doesn't regress from baseline."""

    def test_no_accuracy_regression(self, model):
        """New model should not be worse than baseline."""
        from src.evaluate import evaluate_on_test_set, load_baseline_metrics

        current = evaluate_on_test_set(model)
        baseline = load_baseline_metrics()

        # Allow 1% regression tolerance
        min_accuracy = baseline['accuracy'] * 0.99
        assert current['accuracy'] >= min_accuracy, (
            f"Regression: {current['accuracy']:.3f} < {min_accuracy:.3f}"
        )

Continuous Training (CT)

Continuous Training is the ML-specific addition to the traditional CI/CD acronym soup. While CI (Continuous Integration) and CD (Continuous Deployment) handle code changes, CT handles model changes triggered by data changes.

Why do we need this? Because ML models decay. The data they were trained on becomes stale. User behavior changes. The world changes. A model trained on 2022 data might make terrible predictions in 2024 because the underlying patterns have shifted.

Think of CT like a gardener who continuously tends a garden. Traditional deployment is like planting a garden once and hoping it survives. CT is the ongoing work: watering (new data), pruning (model refinement), replanting (retraining when models decay). Without the gardener, the garden withers. Without CT, models degrade.

The key insight is that CT should be automatic but gated. You don’t want to deploy every retrained model—only ones that are actually better than what’s in production.

CT Architecture

CONTINUOUS TRAINING PIPELINE
============================

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   DATA SOURCES           TRIGGERS              PIPELINE             │
│   ============           ========              ========             │
│                                                                     │
│   ┌─────────┐                                                       │
│   │ New Data│ ─────┐                                                │
│   └─────────┘      │     ┌──────────────┐     ┌──────────────┐     │
│                    ├────►│   Trigger    │────►│   Training   │     │
│   ┌─────────┐      │     │   Service    │     │   Pipeline   │     │
│   │Schedule │ ─────┤     └──────────────┘     └──────┬───────┘     │
│   │(Weekly) │      │                                 │              │
│   └─────────┘      │                                 ▼              │
│                    │                          ┌──────────────┐      │
│   ┌─────────┐      │                          │  Validation  │      │
│   │  Drift  │ ─────┘                          │    Gate      │      │
│   │Detected │                                 └──────┬───────┘      │
│   └─────────┘                                        │              │
│                                                      ▼              │
│                                               ┌──────────────┐      │
│                                               │   Deploy?    │      │
│                                               │  (if better) │      │
│                                               └──────────────┘      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Scheduled Retraining Workflow

name: Continuous Training

on:
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2 AM
  workflow_dispatch:
    inputs:
      force_deploy:
        description: 'Deploy even if metrics are worse'
        required: false
        default: 'false'

jobs:
  fetch-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Fetch latest training data
        run: |
          python -m src.data.fetch \
            --start-date $(date -d '7 days ago' +%Y-%m-%d) \
            --end-date $(date +%Y-%m-%d) \
            --output data/new/

      - name: Upload data artifact
        uses: actions/upload-artifact@v4
        with:
          name: training-data
          path: data/new/

  train:
    needs: fetch-data
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Download data
        uses: actions/download-artifact@v4
        with:
          name: training-data
          path: data/new/

      - name: Train model
        run: |
          python -m src.train \
            --data data/new/ \
            --output models/candidate/ \
            --experiment-name "weekly-retrain-${{ github.run_id }}"

      - name: Upload model
        uses: actions/upload-artifact@v4
        with:
          name: candidate-model
          path: models/candidate/

  validate:
    needs: train
    runs-on: ubuntu-latest
    outputs:
      should_deploy: ${{ steps.compare.outputs.should_deploy }}
    steps:
      - uses: actions/checkout@v4

      - name: Download candidate model
        uses: actions/download-artifact@v4
        with:
          name: candidate-model
          path: models/candidate/

      - name: Download production model
        run: |
          aws s3 cp s3://models/production/ models/production/ --recursive

      - name: Compare models
        id: compare
        run: |
          python -m src.evaluate.compare \
            --candidate models/candidate/ \
            --baseline models/production/ \
            --output metrics.json

          # Check if candidate is better
          BETTER=$(python -c "
          import json
          m = json.load(open('metrics.json'))
          print('true' if m['candidate']['accuracy'] > m['baseline']['accuracy'] else 'false')
          ")
          echo "should_deploy=$BETTER" >> $GITHUB_OUTPUT

  deploy:
    needs: validate
    if: needs.validate.outputs.should_deploy == 'true' || github.event.inputs.force_deploy == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Download candidate model
        uses: actions/download-artifact@v4
        with:
          name: candidate-model
          path: models/candidate/

      - name: Deploy to production
        run: |
          # Upload to S3
          aws s3 cp models/candidate/ s3://models/production/ --recursive

          # Update Kubernetes deployment
          kubectl set image deployment/model-server \
            model=myregistry/model:${{ github.sha }}

      - name: Notify
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "New model deployed! Run: ${{ github.run_id }}"}'

Did You Know? Uber’s Michelangelo platform processes over 1.5 million predictions per second. They implemented continuous training that automatically retrains models when feature drift exceeds thresholds. Their paper “Meet Michelangelo: Uber’s Machine Learning Platform” (2017) was foundational for MLOps practices.

Model Validation Gates

Validation gates are checkpoints that a model must pass before deployment. They’re the automated version of a human reviewer asking “is this model good enough for production?”

Without validation gates, you’re trusting that whoever pushed the model did all the right checks manually. This is the same mistake traditional software made before CI/CD—trusting developers to remember to run all the tests. Developers are human. Humans forget. Automation doesn’t forget.

The gate pattern is inspired by manufacturing quality control. Think of a car factory where every car passes through inspection stations before leaving. The first station checks the engine. The second checks the brakes. The third checks the electronics. A car that fails any station doesn’t ship—it goes back for fixes. Your models should work the same way.

Quality Gates Pattern

MODEL VALIDATION GATES
======================

Candidate Model
      │
      ▼
┌─────────────────┐
│ Gate 1: Schema  │ → Does model output match expected format?
│   Validation    │   (shapes, types, ranges)
└────────┬────────┘
         │ PASS
         ▼
┌─────────────────┐
│ Gate 2: Metrics │ → Does accuracy meet threshold?
│   Threshold     │   (accuracy >= 0.85, latency < 100ms)
└────────┬────────┘
         │ PASS
         ▼
┌─────────────────┐
│ Gate 3: No      │ → Is it better than current production?
│   Regression    │   (accuracy_new >= accuracy_old * 0.99)
└────────┬────────┘
         │ PASS
         ▼
┌─────────────────┐
│ Gate 4: Shadow  │ → Does it work on real traffic?
│   Testing       │   (A/B test with no user impact)
└────────┬────────┘
         │ PASS
         ▼
    DEPLOY

Implementation

from dataclasses import dataclass
from typing import Callable, Optional
from enum import Enum

class GateStatus(Enum):
    PASSED = "passed"
    FAILED = "failed"
    SKIPPED = "skipped"

@dataclass
class GateResult:
    gate_name: str
    status: GateStatus
    message: str
    metrics: dict = None

class ValidationGate:
    """Base class for validation gates."""

    def __init__(self, name: str, required: bool = True):
        self.name = name
        self.required = required

    def check(self, model, context: dict) -> GateResult:
        raise NotImplementedError


class MetricsThresholdGate(ValidationGate):
    """Check if model meets minimum metrics thresholds."""

    def __init__(
        self,
        thresholds: dict,
        name: str = "metrics_threshold",
    ):
        super().__init__(name)
        self.thresholds = thresholds

    def check(self, model, context: dict) -> GateResult:
        metrics = context.get('metrics', {})

        failures = []
        for metric, threshold in self.thresholds.items():
            value = metrics.get(metric, 0)
            if value < threshold:
                failures.append(
                    f"{metric}: {value:.3f} < {threshold:.3f}"
                )

        if failures:
            return GateResult(
                gate_name=self.name,
                status=GateStatus.FAILED,
                message=f"Thresholds not met: {', '.join(failures)}",
                metrics=metrics,
            )

        return GateResult(
            gate_name=self.name,
            status=GateStatus.PASSED,
            message="All thresholds met",
            metrics=metrics,
        )


class NoRegressionGate(ValidationGate):
    """Check that new model isn't worse than baseline."""

    def __init__(
        self,
        metric: str = "accuracy",
        tolerance: float = 0.01,
        name: str = "no_regression",
    ):
        super().__init__(name)
        self.metric = metric
        self.tolerance = tolerance

    def check(self, model, context: dict) -> GateResult:
        current = context.get('metrics', {}).get(self.metric, 0)
        baseline = context.get('baseline_metrics', {}).get(self.metric, 0)

        min_allowed = baseline * (1 - self.tolerance)

        if current < min_allowed:
            return GateResult(
                gate_name=self.name,
                status=GateStatus.FAILED,
                message=f"Regression: {current:.3f} < {min_allowed:.3f}",
                metrics={"current": current, "baseline": baseline},
            )

        return GateResult(
            gate_name=self.name,
            status=GateStatus.PASSED,
            message=f"No regression: {current:.3f} >= {min_allowed:.3f}",
            metrics={"current": current, "baseline": baseline},
        )


class ValidationPipeline:
    """Run model through validation gates."""

    def __init__(self, gates: list[ValidationGate]):
        self.gates = gates

    def validate(self, model, context: dict) -> tuple[bool, list[GateResult]]:
        results = []
        all_passed = True

        for gate in self.gates:
            result = gate.check(model, context)
            results.append(result)

            if result.status == GateStatus.FAILED and gate.required:
                all_passed = False
                break  # Stop on first required failure

        return all_passed, results

Portable CI/CD with Dagger

Why This Module Matters

Here’s a frustrating reality of CI/CD: your pipeline YAML is vendor-locked. A GitHub Actions workflow doesn’t run on GitLab CI. A GitLab pipeline doesn’t run on Jenkins. A CircleCI config doesn’t run locally. You’re learning platform-specific DSLs that don’t transfer.

This is the same problem that existed for applications before Docker. “Works on my machine” was the dreaded phrase because every machine had different configurations. Docker solved it by containerizing applications. Dagger solves the same problem for CI/CD pipelines.

The core insight is brilliant: write your pipeline as actual code (Python, Go, TypeScript), run it inside containers, and let Dagger handle the orchestration. The same pipeline runs on your laptop, in GitHub Actions, or in any other CI system. No more “it passed locally but failed in CI” mysteries.

THE CI VENDOR LOCK-IN PROBLEM
=============================

Traditional Approach:
┌─────────────────┐
│ GitHub Actions  │ ← Workflow YAML (vendor-specific)
│ GitLab CI       │ ← .gitlab-ci.yml (different syntax)
│ Jenkins         │ ← Jenkinsfile (Groovy DSL)
│ CircleCI        │ ← config.yml (yet another format)
└─────────────────┘

Problems:
- Can't test locally
- Vendor-specific syntax
- Hard to debug
- "Works on CI" ≠ "Works locally"

Dagger Approach:
┌─────────────────┐
│     Dagger      │ ← Write pipelines in Python/Go/TypeScript
│   (Portable)    │ ← Run anywhere: local, GitHub, GitLab, etc.
└─────────────────┘

Benefits:
- Test locally before pushing
- Same code runs everywhere
- Type-safe, IDE support
- Cacheable, reproducible

Did You Know? Dagger was created by Solomon Hykes (the creator of Docker) in 2022. His insight was that CI/CD pipelines have the same portability problem that Docker solved for applications. Dagger pipelines run inside containers, making them truly portable across CI platforms.

Dagger Pipeline Example

import dagger
from dagger import dag, function, object_type

@object_type
class MLPipeline:
    """ML Pipeline with Dagger."""

    @function
    async def test(self, source: dagger.Directory) -> str:
        """Run tests on the ML code."""
        return await (
            dag.container()
            .from_("python:3.10-slim")
            .with_directory("/app", source)
            .with_workdir("/app")
            .with_exec(["pip", "install", "-r", "requirements.txt"])
            .with_exec(["pip", "install", "pytest"])
            .with_exec(["pytest", "tests/", "-v"])
            .stdout()
        )

    @function
    async def lint(self, source: dagger.Directory) -> str:
        """Lint the code."""
        return await (
            dag.container()
            .from_("python:3.10-slim")
            .with_directory("/app", source)
            .with_workdir("/app")
            .with_exec(["pip", "install", "ruff", "mypy"])
            .with_exec(["ruff", "check", "src/"])
            .with_exec(["mypy", "src/"])
            .stdout()
        )

    @function
    async def train(
        self,
        source: dagger.Directory,
        data: dagger.Directory,
        epochs: int = 10,
    ) -> dagger.Directory:
        """Train the model."""
        return await (
            dag.container()
            .from_("pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime")
            .with_directory("/app", source)
            .with_directory("/data", data)
            .with_workdir("/app")
            .with_exec(["pip", "install", "-r", "requirements.txt"])
            .with_exec([
                "python", "-m", "src.train",
                "--data", "/data",
                "--output", "/models",
                "--epochs", str(epochs),
            ])
            .directory("/models")
        )

    @function
    async def build_image(
        self,
        source: dagger.Directory,
        model: dagger.Directory,
    ) -> str:
        """Build production Docker image."""
        container = (
            dag.container()
            .from_("python:3.10-slim")
            .with_directory("/app", source)
            .with_directory("/app/models", model)
            .with_workdir("/app")
            .with_exec(["pip", "install", "-r", "requirements.txt"])
            .with_entrypoint(["python", "-m", "src.serve"])
        )

        # Publish to registry
        address = await container.publish(
            f"myregistry/ml-model:latest"
        )
        return address

    @function
    async def full_pipeline(
        self,
        source: dagger.Directory,
        data: dagger.Directory,
    ) -> str:
        """Run the complete ML pipeline."""
        # Run tests and lint in parallel
        test_result = self.test(source)
        lint_result = self.lint(source)

        # Wait for both
        await test_result
        await lint_result

        # Train model
        model = await self.train(source, data)

        # Build and publish image
        image = await self.build_image(source, model)

        return f"Pipeline complete! Image: {image}"

Running Dagger Locally

# Install Dagger CLI
curl -L https://dl.dagger.io/dagger/install.sh | sh

# Run pipeline locally
dagger call test --source=.

# Run full pipeline
dagger call full-pipeline --source=. --data=./data/

# Call from GitHub Actions
# .github/workflows/dagger.yml
name: Dagger Pipeline
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dagger/dagger-for-github@v5
        with:
          verb: call
          args: full-pipeline --source=. --data=./data/

Workflow Patterns for ML

Pattern 1: PR Validation

name: PR Validation

on:
  pull_request:
    branches: [main]

jobs:
  quick-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint & Format
        run: |
          pip install ruff black
          ruff check src/
          black --check src/

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Unit Tests
        run: |
          pip install -r requirements.txt pytest
          pytest tests/unit/ -v --tb=short

  model-smoke-test:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - name: Quick Model Test
        run: |
          pip install -r requirements.txt
          python -m src.test_model --quick

Pattern 2: Release Pipeline

name: Release

on:
  push:
    tags:
      - 'v*'

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Full Test Suite
        run: pytest tests/ -v

  build-image:
    needs: build-and-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build Docker Image
        run: |
          docker build -t myapp:${{ github.ref_name }} .

      - name: Push to Registry
        run: |
          docker push myapp:${{ github.ref_name }}

  deploy-staging:
    needs: build-image
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - name: Deploy to Staging
        run: |
          kubectl set image deployment/app app=myapp:${{ github.ref_name }}

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to Production
        run: |
          kubectl set image deployment/app app=myapp:${{ github.ref_name }}

Pattern 3: Matrix Testing

# Test across multiple Python versions and OS
jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest]
        python-version: ['3.9', '3.10', '3.11']
        exclude:
          - os: macos-latest
            python-version: '3.9'

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - run: pytest tests/

Common Mistakes and How to Avoid Them

Mistake 1: Testing in Production

Many teams skip comprehensive CI testing because “we’ll catch issues in production.” This is like skipping the parachute check because “we’ll know if it’s broken when we jump.”

The problem:

Production issues affect real users
Debugging in production is expensive (both time and money)
Some bugs are hard to reproduce once they’ve occurred
Rollback may not be possible (data contamination, user impact)

The solution:

Mirror production as closely as possible in CI
Use production data samples (anonymized) for testing
Test with production-like load (staging environment)
Shadow deploy: run new model on production traffic without serving results

Mistake 2: Not Versioning Data

Code is versioned in git. Models are versioned in MLflow. But data? Often it lives in a bucket and nobody tracks which version was used for which model.

The problem:

# Which data did this model use?
model_v3.pt  # No idea. The S3 bucket was updated since training.

The solution:

# DVC (Data Version Control) tracks data alongside code
dvc add data/training.csv
git add data/training.csv.dvc
git commit -m "Training data v3 - added October examples"

Now you can checkout any commit and get the exact data that was used.

Mistake 3: Manual Approval Bottlenecks

Some teams require human approval for every deployment. This sounds safe but creates bottlenecks—models wait days or weeks for review.

The better approach:

Automated gates for objective criteria (accuracy, latency, no regression)
Human approval only for subjective criteria (UI changes, new features)
Tiered risk: routine updates auto-deploy, risky changes need review
Clear escalation paths when automation is uncertain

Mistake 4: Ignoring Cost in CI/CD Design

GPU-intensive jobs can cost $10-100 per run. Running full training on every PR commit adds up fast.

Cost-conscious patterns:

Path filters: only run expensive jobs when relevant files change
Smaller models for PR validation, full training on merge to main
Shared caching across jobs (pip cache, model cache, data cache)
Spot instances for non-urgent training jobs (50-70% savings)
Kill stuck jobs: timeout limits prevent runaway costs

Did You Know? Some companies spend more on CI/CD compute than on production inference. A 2023 survey found that 23% of ML teams had experienced “bill shock” from CI/CD costs. The solution isn’t less testing—it’s smarter testing. Cache aggressively, run smaller validations on PRs, save full training for merge events.

Hands-On Exercises

Exercise 1: Basic ML Workflow

Create a GitHub Actions workflow that:

Runs on push to main
Lints with ruff
Runs pytest
Reports code coverage

Exercise 2: Continuous Training

Create a workflow that:

Runs weekly on schedule
Fetches new data
Retrains the model
Compares with baseline
Deploys if better

Exercise 3: Validation Gates

Implement validation gates for:

Minimum accuracy threshold
Maximum latency requirement
No regression from baseline
Memory usage limit

Production War Stories: When CI/CD Fails (and Saves the Day)

Learning from real failures and successes helps you design better pipelines.

The Model That Passed All Tests (But Was Wrong)

New York. March 2024. A fintech startup had a robust CI/CD pipeline with 94% test coverage. Their credit scoring model passed every automated check: unit tests , data validation , accuracy threshold , latency check .

One month after deployment, they discovered the model was systematically rejecting applicants with certain ZIP codes. The model had learned geographic biases from historical data—and none of their tests caught it.

What went wrong? Their tests validated accuracy but not fairness. The model performed well on aggregate metrics while discriminating against specific groups.

The fix:

Added fairness tests: disparate impact ratio, equalized odds
Slice-based evaluation: accuracy per demographic group
“Failure mode” test suite: adversarial examples designed to catch biases

Lesson: Accuracy isn’t enough. Test for what matters—and fairness matters.

The Pipeline That Saved Christmas

San Francisco. December 15, 2023. An e-commerce company’s ML team was preparing for the holiday rush. Their continuous training pipeline detected something alarming: model accuracy had dropped 8% over the past week.

The automated drift detection triggered an investigation. The root cause? A change in the data pipeline had corrupted 12% of product descriptions with HTML tags. The model was learning to predict based on garbage data.

Because the CT pipeline caught the drift automatically, the team fixed the data issue and retrained before the holiday traffic surge. Without automated monitoring, they might not have noticed until customers complained about bad recommendations.

What went right?

Automated drift detection with alerts
Daily model evaluation on fresh data
Clear runbooks for investigation
Data lineage tracking to find root cause

Lesson: Continuous monitoring isn’t paranoia—it’s preparedness.

The $100,000 GPU Bill

Seattle. August 2023. A startup’s CI/CD pipeline had a bug: every PR triggered a full model training job on expensive GPU instances. For two weeks, nobody noticed. When the AWS bill arrived, the CTO nearly had a heart attack.

What went wrong?

No cost alerts or budgets
Training jobs ran on A100s regardless of changes
No caching of unchanged model artifacts
PRs didn’t distinguish “code that affects training” from “documentation changes”

The fix:

Path filters: only run expensive jobs when ML code changes
Smaller models for PR validation, full training only on merge
Cost alerts at $1000/day
Caching: skip training if data and code haven’t changed

Lesson: Design pipelines for cost-efficiency from day one. GPU minutes add up fast.

Interview Prep: CI/CD for ML

These questions come up in ML engineering and MLOps interviews.

Common Questions

Q: “What makes CI/CD for ML different from traditional software?”

Strong Answer: “Three key differences: First, ML has three artifacts that can change (code, data, model) while traditional software only has code. Second, ML tests are probabilistic—a model might be 85% accurate, not pass/fail. Third, ML needs continuous training because models decay as data distributions shift. This means ML pipelines need data validation, model evaluation gates, and drift monitoring—none of which traditional CI/CD addresses.”

Q: “How would you design a continuous training pipeline?”

Strong Answer: “I’d design it with four stages: First, a trigger mechanism—scheduled (weekly), event-driven (new data arrives), or threshold-based (drift detected). Second, a training stage that versions both code and data, uses reproducible random seeds, and logs all hyperparameters. Third, a validation gate comparing the new model against the current production model—only deploy if better. Fourth, a gradual rollout: shadow mode first, then canary at 5%, then full deployment. I’d also include automatic rollback if post-deployment metrics degrade.”

Q: “Your model passed all tests but performs poorly in production. What would you investigate?”

Strong Answer: “I’d investigate several failure modes: First, data distribution shift—is production data different from test data? Second, feature leakage in test data that doesn’t exist in production. Third, silent infrastructure differences—maybe the test environment has more memory or faster CPUs. Fourth, time-based issues—does the model degrade on data from different time periods? Fifth, bias in test data selection—maybe tests use a non-representative sample. I’d add slice-based evaluation, production traffic replay in CI, and more comprehensive drift detection.”

The Economics of CI/CD for ML

Understanding costs helps you design efficient pipelines.

Cost Components

Component	Typical Cost	Optimization Strategy
Compute (CPU)	$0.05/minute	Use smaller instances for tests
Compute (GPU)	$0.50-3.00/minute	Run only when needed
Storage	$0.02/GB/month	Clean up old artifacts
Network transfer	$0.09/GB	Cache locally, minimize pulls
Secrets management	$0.40/10K calls	Batch secret reads

Cost vs Speed Tradeoffs

Faster pipelines cost more:

Parallel jobs finish faster but cost more compute
Larger instances reduce build time but increase cost
More frequent runs catch issues earlier but consume resources

The optimal balance depends on:

How often you deploy (daily? weekly?)
Cost of production bugs (higher risk = more testing)
Team velocity requirements

Benchmarks: What Teams Actually Spend

Based on industry surveys and published data:

Team Size	Monthly CI/CD Cost	Cost per Deployment
Small (5 devs)	$200-500	$5-20
Medium (20 devs)	$1,000-5,000	$10-50
Large (100+ devs)	$10,000-50,000	$20-100

For ML teams, GPU usage can triple these costs if not managed carefully.

Did You Know? GitHub Actions offers 2,000 free minutes per month for private repositories and unlimited for public repositories. Self-hosted runners can reduce costs by 80% or more if you have spare on-premise hardware—especially for GPU workloads.

Knowledge Check

Test your understanding of CI/CD for ML concepts.

What are the three things that can trigger an ML pipeline?

Code changes (git push), data changes (new training data arrives), and model degradation (detected via monitoring/drift detection). Each requires different validation approaches.

What is Continuous Training (CT)?

CT is the ML-specific addition to CI/CD that automatically retrains models when data changes. Unlike code, models decay over time as the world changes. CT ensures models stay current through scheduled retraining, event-driven retraining (new data), or threshold-based retraining (when monitoring detects degradation).

Why is the ML testing pyramid different from traditional software?

ML adds two new layers: data tests and model tests. Traditional software only needs unit, integration, and E2E tests. ML needs data quality tests (schema, distributions, no corruption) and model quality tests (accuracy thresholds, latency, no regression). The model layer is probabilistic—tests check ranges and statistical properties rather than exact values.

What problem does Dagger solve for CI/CD?

Vendor lock-in. Traditional CI/CD pipelines (GitHub Actions YAML, GitLab CI, Jenkins) use different syntaxes and don’t run locally. Dagger lets you write pipelines in real programming languages (Python, Go, TypeScript) that run identically on your laptop, in GitHub Actions, or anywhere else. It’s Docker for CI/CD.

What are validation gates and why are they important?

Validation gates are automated checkpoints that a model must pass before deployment. They include schema validation (correct output format), metrics thresholds (accuracy >= X), regression checks (not worse than current), and shadow testing (works on real traffic). They’re important because they prevent bad models from reaching production without manual review of every deployment.

The Future of CI/CD for ML

Where is this field heading? Understanding trends helps you make better technology choices today.

Trend 1: AI-Assisted CI/CD

Ironically, AI is being used to improve AI pipelines. Tools like Sourcegraph Cody and GitHub Copilot can generate workflow files. Automated test generation creates data and model tests from code analysis. Intelligent caching predicts which tests are likely to fail, running them first.

Within 2-3 years, expect to see CI/CD systems that automatically detect when models are degrading, generate retraining jobs, and even suggest hyperparameter changes based on historical patterns.

Trend 2: Universal Pipeline Standards

Today’s fragmentation (GitHub Actions, GitLab CI, Jenkins, CircleCI) is giving way to portable standards. Dagger lets you write pipelines in real programming languages. OCI (Open Container Initiative) standardizes container formats. OpenLineage standardizes data lineage tracking. The future is “write once, run anywhere” for ML pipelines.

Trend 3: Shift-Left Security

Security is moving earlier in the pipeline, from “check before deploy” to “check on every commit.” This includes dependency scanning, code scanning, and even model security scanning (checking for adversarial vulnerabilities). Expect security gates to become as common as unit tests.

Trend 4: Cost Intelligence

As cloud bills grow, pipelines will optimize themselves for cost. Spot instance orchestrators that automatically switch to cheaper capacity. Intelligent scheduling that batches jobs during off-peak hours. Automatic right-sizing that chooses the smallest instance that can complete in reasonable time. Cost-aware routing that chooses between cloud providers based on real-time pricing.

Did You Know? Netflix’s Metaflow includes automatic resource estimation—it profiles your training job and predicts how much compute you need. This prevents both under-provisioning (failed jobs) and over-provisioning (wasted money). Expect this capability to become standard in all ML pipeline tools.

⏭️ Next Steps

You now understand CI/CD for ML! Key takeaways:

ML pipelines are triggered by code, data, AND model changes
Testing includes data quality and model quality tests
Continuous Training automates model updates
Validation gates prevent bad models from deploying

Up Next: Module 46 - Kubernetes Fundamentals for ML

Key Takeaways

After completing this module, remember these essential points:

Three Triggers: ML pipelines must handle code changes, data changes, and model degradation. Traditional CI/CD only handles code. Design your pipelines to respond to all three.
Testing Pyramid for ML: Add data quality tests and model quality tests to your traditional unit, integration, and E2E tests. Data tests validate schema and distributions. Model tests validate accuracy, latency, and no regression.
Continuous Training: Models decay as data distributions shift. Implement CT (Continuous Training) through scheduled retraining, event-driven retraining when new data arrives, or threshold-based retraining when monitoring detects degradation.
Validation Gates: Automated checkpoints prevent bad models from reaching production. Include schema validation, metrics thresholds, regression checks, and shadow testing in your gates.
Cost Awareness: GPU jobs are expensive. Use path filters, caching, and smaller models for PR validation. Save full training for merge events. Monitor costs and set alerts.
Portability Matters: Consider tools like Dagger that write pipelines in real programming languages rather than vendor-specific YAML. This enables local testing and prevents lock-in.
Security First: CI/CD handles sensitive credentials. Use proper secret management, audit access, and implement supply chain security (dependency scanning, signed images).
Start Simple, Evolve: Don’t over-engineer from day one. Start with a basic pipeline (lint + test + deploy), then add data validation, model tests, and continuous training as your needs grow. A simple pipeline that runs is infinitely better than a complex pipeline that nobody maintains.
Monitor Everything: The pipeline doesn’t end at deployment. Monitor model performance in production. Detect data drift. Track latency and error rates. Use these signals to trigger retraining automatically.
Document Your Decisions: Future you (and your teammates) will thank you for documenting why you chose specific thresholds, test coverage levels, and deployment strategies. CI/CD pipelines accumulate technical debt like any other code.

Module 45 Complete! You now understand CI/CD for ML! “The best pipeline is the one that catches problems before production.”