Module 3.3: Argo Workflows
Module 3.3: Argo Workflows
Section titled “Module 3.3: Argo Workflows”Toolkit Track | Complexity:
[COMPLEX]| Time: 55-70 min | Prerequisites: CI/CD concepts, Kubernetes Pods and Jobs, container images, basic YAML, and the idea of a Directed Acyclic Graph.
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Design Argo Workflows that express sequential, parallel, and fan-out/fan-in delivery or data-processing pipelines without overloading a Kubernetes cluster.
- Debug failed workflow nodes by connecting Argo status, pod status, logs, retry policy, resource requests, and artifact movement into one investigation path.
- Evaluate when to use parameters, artifacts, WorkflowTemplates, retries, timeouts, and
parallelismcontrols for a given operational scenario. - Compare Argo Workflows with Tekton and Airflow for CI/CD, ML, batch processing, and event-driven orchestration use cases.
- Implement a reusable workflow pattern that uses dependency gates, runtime parameters, bounded parallelism, and cleanup behavior suitable for shared clusters.
Why This Module Matters
Section titled “Why This Module Matters”A platform team inherited a nightly model-training pipeline that looked reliable on paper. The repository had a tidy CI workflow, the Python dependencies were pinned, and the training code had unit tests. In production, however, the work ran on a mixture of long-lived virtual machines, manually patched notebooks, and shared shell scripts. Every failed run started the same argument: was the data stale, was the environment different, did the worker run out of memory, or did someone change a hidden dependency?
The team moved each stage into a container and scheduled the work on Kubernetes, but a plain set of Jobs was still not enough. The pipeline needed to extract data, validate it, train several candidates in parallel, compare metrics, publish artifacts, and notify the model registry only if the right branches succeeded. They needed a workflow engine that treated Kubernetes pods as the execution units while still letting engineers describe dependencies, retries, artifacts, and cleanup as one versioned object.
Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes. It is not just “CI inside Kubernetes.” While Tekton focuses on CI/CD pipelines, Argo Workflows is strongest when the shape of the work is a graph: data pipelines, ML workflows, simulation batches, security scans, report generation, or delivery flows where some branches run together and others wait for a merge point.
The senior-level lesson is that Argo gives you power that Kubernetes alone does not constrain for you. A workflow that creates ten pods is convenient; a workflow that creates thousands of pods can stress the API server, scheduler, etcd, storage system, and neighboring workloads. Good Argo design is therefore two skills at once: expressing the dependency graph clearly, and putting operational limits around how that graph runs on a real cluster.
Prerequisites
Section titled “Prerequisites”You should already understand what a Kubernetes Pod is, how a Job differs from a long-running Deployment, and why containers make runtime environments repeatable. If you have used kubectl apply, read pod logs, and followed a basic CI pipeline from clone to test to build, you have enough background to start this module.
This module uses k as a shell alias for kubectl after the first installation command. The alias is not magic; it is just a shorter command used by many Kubernetes practitioners. When you copy commands into a fresh shell, create the alias first so the remaining examples work as written.
alias k=kubectlCore Concepts: How Argo Thinks About Work
Section titled “Core Concepts: How Argo Thinks About Work”Argo Workflows turns a workflow specification into Kubernetes pods. The specification says what tasks exist, which template each task runs, which tasks must finish before another task can start, and what values or files move between tasks. The controller watches the custom resource, creates pods for runnable nodes, records status back onto the Workflow object, and repeats that loop until the graph succeeds, fails, or is terminated.
┌──────────────────────────────────────────────────────────────────────┐│ ARGO WORKFLOWS CONTROL LOOP │├──────────────────────────────────────────────────────────────────────┤│ ││ Workflow YAML ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ apiVersion: argoproj.io/v1alpha1 │ ││ │ kind: Workflow │ ││ │ spec: │ ││ │ entrypoint: main │ ││ │ templates: │ ││ │ - name: main │ ││ │ dag: │ ││ │ tasks: │ ││ │ - name: A │ ││ │ - name: B dependencies: [A] │ ││ │ - name: C dependencies: [A] │ ││ │ - name: D dependencies: [B, C] │ ││ └───────────────────────────────┬──────────────────────────────┘ ││ │ ││ ▼ ││ Workflow Controller ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ Parses the graph, checks dependencies, creates pods, records │ ││ │ node phase, handles retries, enforces timeouts, and moves │ ││ │ parameters or artifacts according to the template contracts. │ ││ └───────────────────────────────┬──────────────────────────────┘ ││ │ ││ ▼ ││ Kubernetes Execution ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ A completes │ ││ │ ├──────────────► B runs ───────┐ │ ││ │ └──────────────► C runs ───────┤ │ ││ │ ▼ │ ││ │ D runs after B and C │ ││ │ │ ││ └──────────────────────────────────────────────────────────────┘ ││ │└──────────────────────────────────────────────────────────────────────┘The important mental model is that Argo is not running your code inside the controller. The controller is an orchestrator, and your code runs in ordinary pods. That separation is what makes Argo fit Kubernetes well: scheduling, logs, image pulls, resource limits, service accounts, network policies, and secrets all behave like Kubernetes primitives because the work is executed by Kubernetes primitives.
A Workflow contains templates, and templates are reusable task definitions inside that workflow. A template can run a container, run a script, define a list of steps, or define a DAG. Tasks are invocations of those templates with specific arguments. This distinction matters because a template is like a function definition, while a task is like a function call placed at a specific point in the graph.
| Concept | What it means in practice | Design question it answers |
|---|---|---|
| Workflow | A submitted run with metadata, status, arguments, and templates. | What complete unit of work should the controller execute and track? |
| Template | A reusable definition for a container, script, steps group, or DAG. | What kind of pod or subgraph should this task create? |
| Task | A named use of a template inside a DAG, usually with arguments. | When should this particular unit run, and what inputs should it receive? |
| Parameter | A small string value passed through templates and task outputs. | Is this value small enough to treat as configuration or a simple result? |
| Artifact | A file or directory passed through an artifact repository or local path. | Is this data too large or structured to safely squeeze into a string? |
| WorkflowTemplate | A reusable workflow definition stored in the cluster. | Should many teams or runs share the same graph shape? |
Active check: Before reading further, describe a three-stage pipeline you already know. Mark which stages are task invocations, which parts could become reusable templates, and which outputs are small parameters versus file artifacts. If everything in your sketch is a task but nothing is reusable, you are probably designing a one-off run rather than a platform pattern.
Installing Argo Workflows Safely
Section titled “Installing Argo Workflows Safely”Installation is simple, but the field choices around installation are not trivial. Argo installs a controller, a server, service accounts, CRDs, and supporting RBAC. In a learning cluster, it is reasonable to install the upstream manifest into an argo namespace. In a shared platform cluster, the safer pattern is to treat Argo like any other control-plane extension: pin a release, review RBAC, decide authentication mode, decide artifact storage, and decide which namespaces may submit workflows.
The commands below use a local or disposable cluster. They install Argo Workflows into the argo namespace, wait for the controller, and port-forward the UI. The example uses the upstream latest install URL because it is convenient for a lab, but production teams should pin a tested release artifact so a rebuild of the same environment does not silently change controller behavior.
k create namespace argo
k apply -n argo -f https://github.com/argoproj/argo-workflows/releases/latest/download/install.yaml
k -n argo wait --for=condition=ready pod -l app=workflow-controller --timeout=120sThe CLI is useful because it gives workflow-specific commands such as argo submit, argo get, argo logs, and argo terminate. You can still inspect everything with Kubernetes commands, but the Argo CLI presents the graph and node status in a way that maps directly to the workflow specification.
brew install argo
argo versionOn Linux, you can download the release binary directly. In a corporate environment, prefer your approved package mirror or internal tool distribution system rather than having every engineer download binaries manually from the internet.
curl -sLO https://github.com/argoproj/argo-workflows/releases/latest/download/argo-linux-amd64.gzgunzip argo-linux-amd64.gzchmod +x argo-linux-amd64sudo mv argo-linux-amd64 /usr/local/bin/argo
argo versionThe Argo Server UI is useful for learning because it shows the workflow graph, node status, logs, and artifacts. For a local lab, you can port-forward it and open the HTTPS endpoint on the loopback address. For a real platform, expose the UI through your normal ingress, identity provider, TLS policy, and audit controls.
k -n argo port-forward svc/argo-server 2746:2746Open https://127.0.0.1:2746 in a browser while the port-forward is running. Some local browsers will warn about the development certificate; that is expected in a lab, but it is not a production authentication design.
Pause and predict: What would change if ten teams shared this one
argonamespace? Think about workflow names, service accounts, secrets, resource quotas, artifact locations, and log visibility before you answer. The install command succeeds either way, but the governance model changes from “my lab controller” to “shared execution service.”
A useful first verification is to check the custom resource definitions and controller pod. This confirms that Kubernetes knows the Workflow kind and that the controller is present to reconcile submitted runs.
k get crd | grep workflows.argoproj.iok -n argo get podsk -n argo get deployIf the controller pod is not ready, do not start by editing workflow YAML. First inspect the controller deployment, events, and logs. A broken controller means no workflow will reconcile correctly, no matter how valid the workflow specification is.
k -n argo describe deploy workflow-controllerk -n argo logs deploy/workflow-controllerk -n argo get events --sort-by=.lastTimestampThe installation step teaches a pattern you will use repeatedly: separate platform health from workflow health. If the controller, CRDs, RBAC, or artifact repository are broken, workflow authors will see confusing task failures that are not caused by their code. A senior operator proves the substrate is healthy before debugging the pipeline itself.
From a Single Container to a Graph
Section titled “From a Single Container to a Graph”The smallest Argo Workflow runs one container. This example looks almost too simple, but it reveals the core fields. generateName asks Kubernetes to create a unique name for each run, entrypoint tells Argo which template starts the workflow, and the hello template defines the container that becomes a pod.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: hello-world- namespace: argospec: entrypoint: hello templates: - name: hello container: image: alpine:3.20 command: [echo] args: ["Hello, Argo Workflows!"]Save that as hello-workflow.yaml, then submit and watch it. The --watch flag is helpful while learning because it shows phase transitions immediately instead of making you poll.
argo submit -n argo hello-workflow.yaml --watchargo get -n argo @latestargo logs -n argo @latestThe first important design choice is whether you need steps or dag. A steps template is stage-oriented: each outer list item is a stage, and all inner items in the same stage can run in parallel. A dag template is dependency-oriented: each task declares what it depends on, and Argo runs tasks as soon as their dependencies are satisfied.
| Shape | Best fit | Why it helps | Common trap |
|---|---|---|---|
| Single container | One-off utility work, smoke tests, simple reports. | Minimal YAML and direct mapping to one pod. | Teams later bolt on many unrelated concerns without refactoring. |
| Steps | Mostly linear flows with occasional parallel work inside a stage. | The visual order matches the reading order. | Deep nesting becomes hard to reason about. |
| DAG | Complex graphs, fan-out/fan-in, independent branches, shared prerequisites. | Dependencies are explicit and Argo can maximize safe parallelism. | Unbounded parallel branches can overload shared clusters. |
A steps template is a good bridge from traditional CI thinking. The syntax is unusual because each stage is a list of steps, and each step in the same stage can run together. In the example below, step1 runs first, step2 runs second, and step3a plus step3b run together after step2.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: steps-example- namespace: argospec: entrypoint: main templates: - name: main steps: - - name: step1 template: echo arguments: parameters: - name: message value: "Step 1 prepares the workspace" - - name: step2 template: echo arguments: parameters: - name: message value: "Step 2 validates the input" - - name: step3a template: echo arguments: parameters: - name: message value: "Step 3A publishes a report" - name: step3b template: echo arguments: parameters: - name: message value: "Step 3B sends a notification"
- name: echo inputs: parameters: - name: message container: image: alpine:3.20 command: [echo] args: ["{{inputs.parameters.message}}"]The echo template declares an input parameter named message. Each step passes a different value into the same template. This is the first example of the function-call mental model: the template defines the reusable operation, and each task or step invocation supplies the arguments.
Active check: In the
stepsexample, predict which pod names will appear first and which ones can appear together. Then run the workflow and compare your prediction withargo get -n argo @latest. If your prediction was wrong, look again at the nested list shape understeps.
A DAG expresses the same idea with explicit dependencies rather than stage order. The example below models a delivery flow where clone must complete before lint, test, and scan; build waits for all three quality gates; deploy waits for the image build. The graph is clearer than a steps list once there are several branches.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: dag-example- namespace: argospec: entrypoint: main templates: - name: main dag: tasks: - name: checkout template: git-clone
- name: lint template: run-lint dependencies: [checkout]
- name: test template: run-test dependencies: [checkout]
- name: security-scan template: run-trivy dependencies: [checkout]
- name: build template: build-image dependencies: [lint, test, security-scan]
- name: deploy template: kubectl-apply dependencies: [build]
- name: git-clone container: image: alpine/git:2.45.2 command: [sh, -c] args: - | git clone https://github.com/argoproj/argo-workflows.git /work cd /work git rev-parse --short HEAD
- name: run-lint container: image: alpine:3.20 command: [sh, -c] args: ["echo 'lint would run here'"]
- name: run-test container: image: alpine:3.20 command: [sh, -c] args: ["echo 'tests would run here'"]
- name: run-trivy container: image: alpine:3.20 command: [sh, -c] args: ["echo 'security scan would run here'"]
- name: build-image container: image: alpine:3.20 command: [sh, -c] args: ["echo 'image build would run here'"]
- name: kubectl-apply container: image: bitnami/kubectl:1.35 command: [sh, -c] args: ["kubectl version --client=true"]This YAML is intentionally runnable without registry credentials, but the template names still match real delivery work. In production, run-lint might use a language-specific linter image, build-image might use Kaniko or BuildKit, and deploy might apply manifests or notify a GitOps controller. The field design stays the same: templates define operations, tasks define dependency placement, and arguments make each invocation specific.
┌──────────┐│ checkout │└────┬─────┘ │ ├──────────────► ┌──────┐ │ │ lint │ │ └──┬───┘ │ │ ├──────────────► ┌──▼───┐ │ │ test │ │ └──┬───┘ │ │ └──────────────► ┌──▼──────────┐ │ security-scan │ └──────┬───────┘ │ ┌──────▼──────┐ │ build │ └──────┬──────┘ │ ┌──────▼──────┐ │ deploy │ └─────────────┘The diagram makes one limitation visible: a dependency list is a control-flow dependency, not a shared filesystem. The checkout pod does not automatically give /work to the lint pod. Every task runs in its own pod unless you explicitly use artifacts, volumes, or an external service. Many first Argo failures come from assuming a directory created in one task magically exists in the next one.
Parameters and Artifacts: Choosing the Right Data Contract
Section titled “Parameters and Artifacts: Choosing the Right Data Contract”A workflow needs data contracts between tasks. Argo gives you parameters for small string values and artifacts for files or directories. The rule is simple, but the operational consequences are large: parameters live in workflow status and are convenient for branch names, versions, metric numbers, or JSON snippets; artifacts belong in an artifact repository when they represent real data, model files, reports, or build outputs.
| Data type | Use it for | Avoid it when | Operational concern |
|---|---|---|---|
| Parameter | Short strings, flags, image tags, small JSON arrays for withParam. | The value is large, binary, sensitive, or repeatedly appended. | It appears in workflow metadata and can make status bulky. |
| Artifact | Files, directories, models, reports, extracted data, scan results. | The value is just a small decision or scalar. | You need artifact storage, retention, access control, and cleanup. |
| External store | Databases, object stores, registries, warehouse tables, long-lived datasets. | A workflow run should be fully self-contained for a tiny lab. | You must pass references and credentials safely. |
The parameter example below has one task generate a message and another task print it. The output parameter reads the content of /tmp/message.txt from the producer pod, stores it as a workflow output value, and injects it into the consumer task. This is useful for small computed values such as a version string, selected environment, or validation result.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: params-example- namespace: argospec: entrypoint: main arguments: parameters: - name: repo value: https://github.com/example/app.git - name: branch value: main
templates: - name: main dag: tasks: - name: generate template: generate-message
- name: consume template: print-message dependencies: [generate] arguments: parameters: - name: msg value: "{{tasks.generate.outputs.parameters.message}}"
- name: generate-message container: image: alpine:3.20 command: [sh, -c] args: - | echo "Generated for {{workflow.parameters.repo}} on branch {{workflow.parameters.branch}}" > /tmp/message.txt outputs: parameters: - name: message valueFrom: path: /tmp/message.txt
- name: print-message inputs: parameters: - name: msg container: image: alpine:3.20 command: [echo] args: ["Received: {{inputs.parameters.msg}}"]Notice the two levels of substitution. {{workflow.parameters.repo}} reads a workflow-level argument, while {{tasks.generate.outputs.parameters.message}} reads an output from a previous task. That difference matters during debugging because a missing workflow parameter is usually a submission or template-reference problem, while a missing task output is usually a producer-task failure or path problem.
Pause and predict: If
generate-messagewrites to/tmp/other.txtbutvalueFrom.pathstill points at/tmp/message.txt, what phase should you expect forconsume? Explain whether the consumer is broken, the dependency is broken, or the producer output contract is broken.
Artifacts solve a different problem. The producer writes a file or directory, declares it as an output artifact, and the consumer declares an input artifact with a path where Argo should make it available. This keeps file content out of workflow status and gives the platform a place to manage retention, size, and access control.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: artifacts-example- namespace: argospec: entrypoint: main templates: - name: main dag: tasks: - name: generate template: generate-file
- name: process template: process-file dependencies: [generate] arguments: artifacts: - name: input from: "{{tasks.generate.outputs.artifacts.output}}"
- name: generate-file container: image: alpine:3.20 command: [sh, -c] args: - | mkdir -p /tmp/out echo '{"records":[1,2,3,4,5]}' > /tmp/out/data.json outputs: artifacts: - name: output path: /tmp/out/data.json
- name: process-file inputs: artifacts: - name: input path: /tmp/input.json container: image: alpine:3.20 command: [sh, -c] args: - | cat /tmp/input.json echo "Processed artifact"For a local lab, Argo may use a default artifact mechanism depending on the installation mode. For a production platform, configure an explicit artifact repository such as S3-compatible storage, GCS, or another supported backend. The workflow should not pretend artifacts are invisible; they are a storage system with cost, permissions, lifecycle policy, and failure modes.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: s3-artifacts- namespace: argospec: entrypoint: main artifactRepositoryRef: configMap: artifact-repositories key: default
templates: - name: main dag: tasks: - name: generate template: generate-data
- name: process template: process-data dependencies: [generate] arguments: artifacts: - name: input from: "{{tasks.generate.outputs.artifacts.output}}"
- name: generate-data container: image: python:3.12-alpine command: [python, -c] args: - | import json data = {"results": [1, 2, 3, 4, 5]} with open("/tmp/data.json", "w", encoding="utf-8") as handle: json.dump(data, handle) outputs: artifacts: - name: output path: /tmp/data.json s3: key: workflows/{{workflow.name}}/data.json
- name: process-data inputs: artifacts: - name: input path: /tmp/data.json container: image: python:3.12-alpine command: [python, -c] args: - | import json with open("/tmp/data.json", encoding="utf-8") as handle: data = json.load(handle) print(f"Sum: {sum(data['results'])}")This example uses artifactRepositoryRef to make the storage choice visible. The workflow writer should know whether artifacts are stored centrally, encrypted, retained, and accessible to the right teams. A workflow that produces model files or customer-derived reports is making a data governance decision, not just a YAML decision.
A senior design habit is to write the task contract in words before writing YAML. For example: “The extractor produces one compressed JSON Lines file named raw-orders; the transformer consumes raw-orders at /input/orders.jsonl.gz and produces a validation count parameter plus a transformed Parquet artifact.” That sentence prevents ambiguous template names, accidental string outputs, and hidden coupling through image behavior.
Loops, Fan-Out, and Controlled Parallelism
Section titled “Loops, Fan-Out, and Controlled Parallelism”Loops are where Argo starts to feel more powerful than a traditional CI system. withItems expands a task over a static list, while withParam expands a task over a JSON array produced at runtime. The result is fan-out: one logical task definition becomes many node executions.
The static loop below processes three items. Each item becomes the item input parameter for one task execution. Argo can run the expanded tasks concurrently unless dependencies or parallelism limits prevent it.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: parallel-loop- namespace: argospec: entrypoint: main parallelism: 3 templates: - name: main dag: tasks: - name: process template: process-item arguments: parameters: - name: item value: "{{item}}" withItems: - one - two - three
- name: process-item inputs: parameters: - name: item container: image: alpine:3.20 command: [echo] args: ["Processing: {{inputs.parameters.item}}"]The parallelism: 3 field is not decorative. It tells Argo the maximum number of pods from the workflow that may run at once. In a tiny lab that limit may not matter, but in a shared cluster it is the difference between a batch job that coexists with production and a batch job that floods the control plane.
A dynamic loop uses a producer task to generate the list. The producer prints a JSON array as its result, and the fan-out task consumes that JSON with withParam. This pattern is common for nightly data processing because the list of partitions, files, tenants, or experiments may not be known until the workflow starts.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: fan-out- namespace: argospec: entrypoint: main parallelism: 4 templates: - name: main dag: tasks: - name: generate-list template: generate
- name: process-items template: process-item dependencies: [generate-list] arguments: parameters: - name: item value: "{{item}}" withParam: "{{tasks.generate-list.outputs.result}}"
- name: aggregate template: aggregate-results dependencies: [process-items]
- name: generate script: image: python:3.12-alpine command: [python] source: | import json items = ["item1", "item2", "item3", "item4", "item5"] print(json.dumps(items))
- name: process-item inputs: parameters: - name: item container: image: alpine:3.20 command: [sh, -c] args: - | echo "Processing: {{inputs.parameters.item}}" sleep 2
- name: aggregate-results container: image: alpine:3.20 command: [echo] args: ["Aggregation runs after every expanded process-items task completes"]Active check: Change
parallelismfrom4to2in the fan-out example and predict the effect on total runtime. The graph shape does not change, but the scheduler behavior does. That distinction is central to operating Argo responsibly.
Large fan-out workflows need more than a single workflow-level cap. You may need namespace ResourceQuota, LimitRange, lower-priority classes for batch workloads, pod garbage collection, workflow TTL cleanup, and batching logic that converts a huge item list into smaller groups. Argo makes it easy to express “do this for every item,” but Kubernetes still has finite control-plane and node capacity.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: controlled-backtest- namespace: argospec: entrypoint: main parallelism: 50 podGC: strategy: OnPodCompletion ttlStrategy: secondsAfterCompletion: 3600
templates: - name: main dag: tasks: - name: generate-batches template: create-batches
- name: process-batch template: batch-processor dependencies: [generate-batches] arguments: parameters: - name: batch value: "{{item}}" withParam: "{{tasks.generate-batches.outputs.result}}"
- name: aggregate template: aggregate-results dependencies: [process-batch]
- name: create-batches script: image: python:3.12-alpine command: [python] source: | import json batches = [ {"start": index * 100, "end": (index + 1) * 100} for index in range(100) ] print(json.dumps(batches))
- name: batch-processor inputs: parameters: - name: batch container: image: python:3.12-alpine command: [python, -c] args: - | import json batch = json.loads("""{{inputs.parameters.batch}}""") print(f"Processing records {batch['start']} through {batch['end']}") resources: requests: cpu: "200m" memory: "512Mi" limits: cpu: "500m" memory: "1Gi" retryStrategy: limit: 3 backoff: duration: 30s factor: 2
- name: aggregate-results container: image: alpine:3.20 command: [echo] args: ["Aggregating all completed batches"]The batching approach is slower than launching every record as a pod, but it is usually more reliable. Each pod has scheduling overhead, status writes, log collection, image pull behavior, and cleanup. If a record takes milliseconds to process, launching one pod per record is usually worse than processing a batch inside each pod.
Reuse with WorkflowTemplates
Section titled “Reuse with WorkflowTemplates”A WorkflowTemplate is a reusable workflow definition stored in the cluster. It is useful when platform teams want to publish a supported pipeline shape, such as “clone, test, build, scan, publish,” while application teams provide only the repository, branch, image, and environment values. Reuse matters because copy-pasted workflow YAML ages badly: one team gets the new timeout, another team keeps the old risky retry policy, and a third team quietly removes resource limits to make a deadline.
The example below defines a reusable CI pipeline. The workflow-level arguments declare the public contract: callers must provide a repository and image, while branch defaults to main. The graph then passes those values into templates at the point where they are needed.
apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: ci-pipeline namespace: argospec: arguments: parameters: - name: repo - name: branch value: main - name: image
entrypoint: main templates: - name: main dag: tasks: - name: clone template: git-clone arguments: parameters: - name: repo value: "{{workflow.parameters.repo}}" - name: branch value: "{{workflow.parameters.branch}}"
- name: test template: run-tests dependencies: [clone]
- name: build template: build-push dependencies: [test] arguments: parameters: - name: image value: "{{workflow.parameters.image}}"
- name: git-clone inputs: parameters: - name: repo - name: branch container: image: alpine/git:2.45.2 command: [git, clone, "--branch", "{{inputs.parameters.branch}}", "{{inputs.parameters.repo}}", "/work"]
- name: run-tests container: image: golang:1.22 command: [sh, -c] args: ["echo 'go test ./... would run here'"]
- name: build-push inputs: parameters: - name: image container: image: gcr.io/kaniko-project/executor:v1.23.2-debug args: ["--destination={{inputs.parameters.image}}"]A caller can then submit a small Workflow that references the shared template. This makes the run-specific file shorter and moves the platform-owned graph into one controlled object.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: run-ci- namespace: argospec: workflowTemplateRef: name: ci-pipeline arguments: parameters: - name: repo value: https://github.com/example/app.git - name: image value: ghcr.io/example/app:latestThe hidden risk is versioning. A cluster-scoped or namespace-scoped WorkflowTemplate can change underneath callers unless the platform treats it like an API. Senior platform teams document template contracts, publish change notes, test common caller workflows, and provide a migration path when arguments or behavior change.
Active check: Imagine one team needs a longer test timeout and another team needs a different build image. Decide whether those should be template parameters, separate WorkflowTemplates, or team-owned workflow YAML. Your answer should mention how often the value changes, who owns the risk, and whether the change affects the shared contract.
WorkflowTemplates are also a governance boundary. Platform teams can require certain images, resource requests, retries, artifact conventions, labels, and exit handlers in the shared template. Application teams still get flexibility through parameters, but the dangerous defaults are centralized.
Error Handling, Timeouts, and Debugging
Section titled “Error Handling, Timeouts, and Debugging”Errors in Argo Workflows fall into several categories. Your application can fail, a pod can fail before the application starts, an artifact can fail to upload or download, a dependency can wait forever because an upstream node failed, or the controller can be unable to create more pods because of quota or RBAC. Good error handling starts by deciding which failures are worth retrying and which failures should stop the graph immediately.
| Failure type | Typical signal | Retry decision | Better control |
|---|---|---|---|
| Transient network call | Exit code from application, timeout from dependency, temporary service error. | Retry with backoff if the operation is idempotent. | retryStrategy with bounded attempts and clear logs. |
| Deterministic validation failure | Same bad input fails every attempt. | Do not retry blindly because it wastes cluster capacity. | Fail fast and publish validation artifacts. |
| Resource exhaustion | Pod shows OOMKilled, Evicted, or long Pending. | Retry only after changing resources, placement, or input size. | Requests, limits, batching, quotas, and node capacity checks. |
| Optional branch failure | A report or non-blocking scan fails while core output is still useful. | Continue only if the business rule allows degraded output. | continueOn and explicit downstream handling. |
| Hung task | Pod runs far longer than the expected envelope. | Terminate and retry only when the operation is safe. | activeDeadlineSeconds and external idempotency. |
A retry strategy belongs on a template because the operation knows whether retrying is safe. Pulling a file from object storage is often retryable. Charging a customer, publishing a release, or sending a deployment command may not be retryable unless the operation is idempotent. The YAML cannot know that business rule for you.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: retry-example- namespace: argospec: entrypoint: flaky-task templates: - name: flaky-task retryStrategy: limit: 3 retryPolicy: Always backoff: duration: 5s factor: 2 maxDuration: 1m container: image: alpine:3.20 command: [sh, -c] args: - | value=$((RANDOM % 2)) echo "Generated value: ${value}" exit "${value}"The limit prevents infinite retry loops, and backoff gives dependent systems time to recover. Without backoff, a workflow can amplify an outage by hammering a dependency that is already failing. Without a limit, a workflow can consume cluster capacity long after the original run has stopped being useful.
continueOn is for explicitly optional work. It should not be used as a quick way to make a red workflow green. If a required security scan fails and the workflow deploys anyway, the YAML is now hiding risk from reviewers and operators.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: continue-example- namespace: argospec: entrypoint: main templates: - name: main dag: tasks: - name: optional-report template: might-fail continueOn: failed: true
- name: required-deploy template: must-succeed dependencies: [optional-report]
- name: might-fail container: image: alpine:3.20 command: [sh, -c] args: ["echo 'non-blocking report failed'; exit 1"]
- name: must-succeed container: image: alpine:3.20 command: [echo] args: ["Deploy is allowed only because optional-report is truly optional"]Timeouts protect the cluster and the business process. A task that normally takes five minutes but has been running for two hours is no longer “just slow”; it is consuming resources, delaying downstream work, and confusing operators. activeDeadlineSeconds gives the template an upper bound.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: timeout-example- namespace: argospec: entrypoint: limited-task templates: - name: limited-task activeDeadlineSeconds: 300 container: image: alpine:3.20 command: [sleep, "600"]Pause and predict: The timeout example sleeps for ten minutes but has a five-minute active deadline. Predict the workflow phase, the pod phase, and the kind of operator message you would want in a production runbook. Then explain why a timeout is better than letting the pod run until someone notices.
A practical debugging workflow starts broad and narrows down. First get the workflow graph and node phases, then identify the failed pod, then inspect pod events and logs, then decide whether the failure is code, scheduling, resources, artifacts, or platform configuration.
argo list -n argoargo get -n argo @latestargo logs -n argo @latest
WORKFLOW_NAME="$(argo list -n argo -o name | head -n 1)"argo get -n argo "${WORKFLOW_NAME}" -o yaml > workflow-status.yaml
k -n argo get podsk -n argo describe pod <pod-name>k -n argo logs <pod-name> --previousk -n argo get events --sort-by=.lastTimestampIf a task is Pending, inspect scheduling constraints, quota, resource requests, image pulls, and service accounts. If a task is Error, inspect init containers and artifact handling before assuming the application code failed. If a task is Failed, the main container usually exited non-zero, so logs and exit codes are the first stop.
The most common debugging mistake is reading the workflow YAML only. The YAML tells you intended behavior, but Kubernetes status tells you observed behavior. An Argo operator needs both: the graph explains why a task should run, and pod status explains why it did or did not run.
Argo Events and External Triggers
Section titled “Argo Events and External Triggers”Argo Workflows can be submitted manually, scheduled by CronWorkflow, created by another controller, or triggered through Argo Events. Event integration is useful when the workflow is naturally caused by something outside the cluster: a Git push, an object appearing in storage, a message on a queue, or a webhook from a data platform.
The example below shows a Sensor creating a Workflow from a GitHub push event. The important idea is not the exact GitHub fields; it is the mapping from event payload into workflow parameters. A trigger should create a workflow with explicit inputs, not a workflow that reaches back into an opaque event blob at every task.
apiVersion: argoproj.io/v1alpha1kind: Sensormetadata: name: github-sensor namespace: argospec: template: serviceAccountName: argo-events-sa dependencies: - name: github eventSourceName: github eventName: push triggers: - template: name: workflow-trigger k8s: operation: create source: resource: apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: github-triggered- namespace: argo spec: workflowTemplateRef: name: ci-pipeline arguments: parameters: - name: repo value: "" - name: branch value: "" parameters: - src: dependencyName: github dataKey: body.repository.clone_url dest: spec.arguments.parameters.0.value - src: dependencyName: github dataKey: body.ref dest: spec.arguments.parameters.1.valueEvent-driven workflows need deduplication, authorization, and rate limits. A webhook endpoint that blindly creates workflows can become a denial-of-service path into your cluster. Treat event sources like public APIs: authenticate them, validate payloads, limit the trigger rate, and give the resulting workflow a service account with only the permissions it needs.
Active check: A storage bucket sends one event for every uploaded object, and a partner accidentally uploads several thousand small files in a minute. Decide where you would add protection: event source, sensor, workflow
parallelism, namespace quota, batching logic, or all of them. Explain which layer fails open if the previous layer misses the spike.
Worked Example: Designing a Nightly ETL Workflow
Section titled “Worked Example: Designing a Nightly ETL Workflow”A useful worked example is a nightly ETL pipeline. The pipeline extracts from three databases in parallel, transforms each dataset, loads the warehouse only after all transforms succeed, runs several quality checks, and sends a notification regardless of outcome. This shape teaches most Argo design choices without becoming a toy.
┌────────────────┐ ┌─────────────────┐│ extract-users │─────►│ transform-users │─────┐└────────────────┘ └─────────────────┘ │ │┌────────────────┐ ┌─────────────────┐ ▼│ extract-orders │─────►│ transform-orders│──► load warehouse└────────────────┘ └─────────────────┘ │ │┌──────────────────┐ ┌───────────────────┐ ▼│ extract-products │───►│ transform-products│── quality checks└──────────────────┘ └───────────────────┘ │ ▼ exit notificationThe graph has two kinds of parallelism. Extracts can run together because the source systems are independent, and quality checks can run together because they read the loaded warehouse state. The load task is a fan-in point because it should not run until all transformed artifacts exist.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: nightly-etl- namespace: argospec: entrypoint: main parallelism: 12 activeDeadlineSeconds: 14400 onExit: exit-handler arguments: parameters: - name: run-date value: "2026-04-26"
templates: - name: main dag: tasks: - name: extract-users template: extract arguments: parameters: - name: source value: users
- name: extract-orders template: extract arguments: parameters: - name: source value: orders
- name: extract-products template: extract arguments: parameters: - name: source value: products
- name: transform-users template: transform dependencies: [extract-users] arguments: parameters: - name: dataset value: users artifacts: - name: raw-data from: "{{tasks.extract-users.outputs.artifacts.data}}"
- name: transform-orders template: transform dependencies: [extract-orders] arguments: parameters: - name: dataset value: orders artifacts: - name: raw-data from: "{{tasks.extract-orders.outputs.artifacts.data}}"
- name: transform-products template: transform dependencies: [extract-products] arguments: parameters: - name: dataset value: products artifacts: - name: raw-data from: "{{tasks.extract-products.outputs.artifacts.data}}"
- name: load-warehouse template: load dependencies: [transform-users, transform-orders, transform-products] arguments: artifacts: - name: users from: "{{tasks.transform-users.outputs.artifacts.transformed}}" - name: orders from: "{{tasks.transform-orders.outputs.artifacts.transformed}}" - name: products from: "{{tasks.transform-products.outputs.artifacts.transformed}}"
- name: quality-checks template: run-quality-check dependencies: [load-warehouse] arguments: parameters: - name: check value: "{{item}}" withItems: - row-count - null-check - duplicate-check - freshness - business-rules
- name: extract inputs: parameters: - name: source retryStrategy: limit: 3 backoff: duration: 60s factor: 2 activeDeadlineSeconds: 1800 container: image: python:3.12-alpine command: [python, -c] args: - | import json import pathlib source = "{{inputs.parameters.source}}" pathlib.Path("/tmp/out").mkdir(parents=True, exist_ok=True) with open("/tmp/out/data.json", "w", encoding="utf-8") as handle: json.dump({"source": source, "records": [1, 2, 3]}, handle) print(f"extracted {source}") outputs: artifacts: - name: data path: /tmp/out/data.json
- name: transform inputs: parameters: - name: dataset artifacts: - name: raw-data path: /tmp/raw.json container: image: python:3.12-alpine command: [python, -c] args: - | import json dataset = "{{inputs.parameters.dataset}}" with open("/tmp/raw.json", encoding="utf-8") as handle: payload = json.load(handle) payload["dataset"] = dataset payload["validated"] = True with open("/tmp/transformed.json", "w", encoding="utf-8") as handle: json.dump(payload, handle) print(f"transformed {dataset}") resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1" memory: "1Gi" outputs: artifacts: - name: transformed path: /tmp/transformed.json
- name: load inputs: artifacts: - name: users path: /tmp/data/users.json - name: orders path: /tmp/data/orders.json - name: products path: /tmp/data/products.json retryStrategy: limit: 2 container: image: alpine:3.20 command: [sh, -c] args: - | ls -l /tmp/data echo "loading warehouse for {{workflow.parameters.run-date}}"
- name: run-quality-check inputs: parameters: - name: check container: image: alpine:3.20 command: [sh, -c] args: - | echo "running quality check {{inputs.parameters.check}}" sleep 1
- name: exit-handler steps: - - name: notify-status template: notify
- name: notify container: image: alpine:3.20 command: [echo] args: ["Nightly ETL finished with status {{workflow.status}}"]The field choices are as important as the graph. parallelism: 12 prevents a quality-check expansion from consuming unlimited cluster capacity. Extract tasks have retries because source connections can fail transiently. Transform tasks have resource requests and limits because data shape can change. onExit provides a single notification path for success, failure, or error, which prevents silent failed runs.
This worked example is intentionally complete before the exercise asks you to build a similar workflow. You have seen the design sequence: draw dependencies, choose parameters versus artifacts, place retries only where they are safe, bound parallelism, add timeouts, and provide an exit path. The exercise later changes the scenario so you must apply the same reasoning rather than copy the same YAML.
War Story: Uncontrolled Parallelism Can Overload a Cluster
Section titled “War Story: Uncontrolled Parallelism Can Overload a Cluster”A financial analytics team once used Argo Workflows to backtest many strategy variants. The first prototype worked beautifully with a few dozen inputs, so the team scaled the input list without changing the workflow-level controls. The workflow did exactly what it was told to do: it tried to create an enormous number of pods quickly.
THE 10,000 POD INCIDENT TIMELINE────────────────────────────────────────────────────────────────────────FRIDAY, 2:00 PM Workflow submitted: 10,000 pods requestedFRIDAY, 2:01 PM 1,000 pods scheduled immediatelyFRIDAY, 2:02 PM Kubernetes API server latency spikes to 30sFRIDAY, 2:05 PM etcd disk I/O at 100%, cluster becoming unresponsiveFRIDAY, 2:08 PM Production trading systems experiencing timeoutsFRIDAY, 2:10 PM kubectl commands failing: "context deadline exceeded"FRIDAY, 2:15 PM INCIDENT DECLARED: Production trading haltedFRIDAY, 2:30 PM Workflow manually terminated while kubectl was still failingFRIDAY, 2:45 PM Cluster recovery beginsFRIDAY, 4:00 PM Normal operations restoredFRIDAY, 4:30 PM Post-incident analysis starts
IMPACT ASSESSMENT────────────────────────────────────────────────────────────────────────Trading halt duration: 105 minutesMissed trade opportunities: ~$890,000Emergency incident response: 12 people × 4 hours = $24,000Cluster recovery costs: $15,000Reputation with trading desk: Severe damageCompliance review triggered: $50,000 in documentationWeekend remediation work: $45,000
TOTAL INCIDENT COST: ~$1,024,000Add: Lost productivity for 2 weeks while implementing fixes: ~$200,000
TOTAL COST OF UNCONTROLLED PARALLELISM: ~$1,224,000The root cause was not that Argo was unreliable. The root cause was that the workflow treated cluster capacity as infinite. Every pod creation created API server work, scheduler work, etcd writes, admission checks, secret lookups, and status updates. The workflow became an accidental load test of the control plane.
ROOT CAUSE ANALYSIS────────────────────────────────────────────────────────────────────────1. Argo Workflow Controller attempted to create 10,000 pods rapidly └── API server received a burst of pod creation requests
2. Each pod creation triggered several Kubernetes operations ├── Pod object write to etcd ├── Secret and service account lookups ├── Scheduling decisions ├── Volume and image pull coordination └── Status updates from kubelet and controllers
3. etcd write-ahead log could not keep up └── Latency spike caused timeouts across unrelated cluster operations
4. Kubernetes controllers started missing normal reconciliation windows └── Production workloads could not scale or report health consistently
5. Trading systems lost supporting service connectivity └── Failsafe triggered and automated trading haltedThe fix was defense in depth. The team added workflow-level parallelism, batched inputs, cleaned completed pods quickly, set TTL cleanup, applied namespace quotas, added LimitRanges, and moved batch workflows to lower scheduling priority. No single control was trusted as the only safeguard.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: name: backtest-all-strategies-v2 namespace: argospec: entrypoint: main parallelism: 50 podGC: strategy: OnPodCompletion ttlStrategy: secondsAfterCompletion: 3600 templates: - name: main dag: tasks: - name: generate-batches template: create-batches
- name: process-batch template: batch-processor dependencies: [generate-batches] arguments: parameters: - name: batch value: "{{item}}" withParam: "{{tasks.generate-batches.outputs.result}}"
- name: aggregate template: aggregate-results dependencies: [process-batch]
- name: create-batches script: image: python:3.12-alpine command: [python] source: | import json batches = [ {"start": index * 100, "end": (index + 1) * 100} for index in range(100) ] print(json.dumps(batches))
- name: batch-processor inputs: parameters: - name: batch container: image: python:3.12-alpine command: [python, -c] args: - | import json batch = json.loads("""{{inputs.parameters.batch}}""") print(f"running batch {batch['start']} to {batch['end']}") resources: requests: cpu: "200m" memory: "512Mi" limits: cpu: "500m" memory: "512Mi" retryStrategy: limit: 3 backoff: duration: 30s factor: 2
- name: aggregate-results container: image: python:3.12-alpine command: [python, -c] args: ["print('aggregate results')"]apiVersion: v1kind: ResourceQuotametadata: name: argo-workflows-quota namespace: argospec: hard: pods: "100" requests.cpu: "50" requests.memory: "100Gi"
---apiVersion: v1kind: LimitRangemetadata: name: argo-workflows-limits namespace: argospec: limits: - type: Pod max: cpu: "4" memory: "8Gi" default: cpu: "500m" memory: "512Mi" defaultRequest: cpu: "100m" memory: "256Mi"
---apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: batch-workflowvalue: -100globalDefault: falsedescription: "Low priority for batch data processing"BEFORE VS AFTER──────────────────────────────────────────────────────────────────────── Before AfterMax concurrent pods: 10,000 50Workflow duration: FAILED 4 hoursAPI server latency: 30+ seconds < 100msProduction impact: INCIDENT NoneSuccess rate: 0% 99.8%Cluster stability: Compromised StableThe lesson is not “avoid Argo for large work.” The lesson is that a workflow engine makes large work easy to describe, so the platform must make safe execution easy as well. Parallelism is a dial, batching is a design tool, and Kubernetes control-plane capacity is a shared dependency.
Choosing Argo Workflows, Tekton, or Airflow
Section titled “Choosing Argo Workflows, Tekton, or Airflow”Tool choice should follow workload shape. Tekton is often a better fit for standardized CI/CD tasks because its model and ecosystem center on pipeline runs, catalog tasks, and supply-chain integrations. Airflow remains common for scheduled data workflows with rich operator ecosystems and external-system orchestration. Argo Workflows is strongest when Kubernetes-native execution and graph-shaped parallel work are the core requirement.
| Scenario | Stronger default | Why that choice usually wins | When to reconsider |
|---|---|---|---|
| Standard clone-test-build-publish for many services | Tekton | CI/CD primitives, catalog tasks, and supply-chain tooling match the job. | Use Argo if the pipeline becomes graph-heavy or batch-oriented. |
| ML training with many parameter combinations | Argo Workflows | Dynamic fan-out, artifact passing, pod-native execution, and long-running jobs fit naturally. | Use Airflow if most work happens in external managed services. |
| Nightly warehouse orchestration across many SaaS APIs | Airflow | Mature scheduling and provider ecosystem can reduce custom glue. | Use Argo if each stage is containerized and Kubernetes is the execution plane. |
| Event-driven image or document processing | Argo Workflows plus Argo Events | Event payloads can create bounded Kubernetes workflows with clear graph status. | Add a queue if events can arrive faster than the cluster should process them. |
| GitOps deployment after image publication | Tekton or Argo CD integration | Deployment may be better represented as a GitOps reconciliation than a custom pod. | Use Argo Workflows for pre-deploy batch validation or complex promotion gates. |
The wrong tool is often the one chosen because a team already likes it, not because the workload fits it. Argo can run CI/CD, but that does not make every CI/CD flow an Argo problem. Tekton can express graphs, but that does not make it the best choice for a giant simulation fan-out. Airflow can trigger Kubernetes jobs, but that does not mean Kubernetes-native artifact and pod lifecycle are first-class in every Airflow deployment.
A practical evaluation question is: where should the state live? If the most important state is Kubernetes pod execution, artifacts from containers, and graph node status, Argo is a strong candidate. If the most important state is a chain of standardized delivery tasks, Tekton may be cleaner. If the most important state is a schedule of external system operations, Airflow may be easier to operate and staff.
Did You Know?
Section titled “Did You Know?”- Argo Workflows runs each workflow step as a Kubernetes pod, which means normal cluster controls such as service accounts, resource limits, node selection, and pod events remain central to debugging.
- WorkflowTemplates are API contracts as much as reusable YAML, because changing an argument name, default value, retry policy, or artifact convention can break teams that submit against the template.
- Large workflows usually fail operationally before they fail syntactically, because API server throughput, scheduler capacity, artifact storage, and namespace quota become the limiting systems.
- The Argo project includes Workflows, CD, Events, and Rollouts, and many production platforms combine them rather than expecting one tool to cover every delivery concern.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why it causes trouble | Better approach |
|---|---|---|
| Treating a dependency as shared storage | A downstream pod waits for the upstream pod to finish, but it does not inherit that pod’s filesystem. | Pass small values as parameters and files or directories as artifacts. |
| Launching one pod per tiny record | Pod startup, scheduling, status, and cleanup overhead can dominate the actual work. | Batch small records inside each pod and cap workflow parallelism. |
| Retrying every failure automatically | Deterministic validation errors waste capacity and can hide the real problem. | Retry only idempotent transient operations and fail fast on bad input. |
Using continueOn to hide required failures | The workflow may turn red business risk into green automation status. | Reserve continueOn for explicitly optional work with clear downstream handling. |
| Leaving resource requests out of templates | The scheduler cannot make good placement decisions and nodes can be overcommitted. | Set requests and limits that match observed memory and CPU behavior. |
| Copy-pasting WorkflowTemplates into every repository | Fixes and policy improvements drift because every copy becomes its own fork. | Publish supported templates and keep caller workflows small. |
Debugging only with argo get | Argo status shows graph state, but pod events reveal scheduling, image, quota, and runtime failures. | Combine argo get, argo logs, k describe pod, and namespace events. |
| Exposing event triggers without rate and identity controls | External spikes or unauthorized payloads can create uncontrolled workflow load. | Authenticate events, validate payloads, batch inputs, and enforce namespace quotas. |
Question 1
Section titled “Question 1”Your team converts a nightly data job into an Argo Workflow. The first task clones a repository into /work, and the second task fails because /work does not exist. The dependency is configured correctly and the first task succeeded. What should you change, and why?
Show Answer
The dependency only controls execution order; it does not share the producer pod’s filesystem with the consumer pod. The correct fix is to declare the cloned repository, generated files, or build context as an output artifact from the first task and pass it as an input artifact to the second task. If the downstream task only needs a commit SHA or image tag, pass that small value as a parameter instead.
A senior answer also checks whether cloning should happen in every task, whether a shared volume is appropriate for the cluster and workflow shape, or whether a build system should consume the repository directly. The key is to make the data contract explicit rather than relying on pod-local paths.
Question 2
Section titled “Question 2”A workflow processes a runtime-generated list of customer partitions. The generator task prints a JSON array, and the processor task uses withParam. During a launch, the list grows from dozens of partitions to several thousand. The workflow is valid but starts harming other workloads. Which controls would you add first?
Show Answer
Add workflow-level parallelism, consider batching partitions inside each pod, and enforce namespace ResourceQuota plus LimitRange so the workflow cannot exceed agreed capacity. If the input spike is expected to recur, add rate control or batching before workflow creation as well. withParam is the right feature for dynamic fan-out, but it does not automatically make the fan-out safe.
The strongest answer separates graph correctness from operational safety. The graph may correctly express “process every partition,” while the platform still needs concurrency limits, resource requests, priority, cleanup, and observability. Safe throughput is an engineering decision, not a YAML default.
Question 3
Section titled “Question 3”A security scan task sometimes fails because the scanner service returns a temporary network error. Another task fails when a policy violation is found. A teammate proposes the same retryStrategy for both tasks. How would you evaluate that proposal?
Show Answer
The temporary network error is a good candidate for bounded retries with backoff because a later attempt may succeed and the scan operation is usually safe to repeat. The policy violation should not be retried blindly because the same input is likely to fail again, and repeated attempts waste cluster capacity while delaying feedback. The policy task should fail clearly and publish enough logs or artifacts for the owner to fix the violation.
The decision depends on idempotency and failure class. Retry transient infrastructure or dependency failures. Do not retry deterministic validation failures unless the task itself can change the condition it is validating.
Question 4
Section titled “Question 4”Your platform team publishes a WorkflowTemplate for application CI. One product team asks to remove resource limits because their tests occasionally hit memory ceilings. Another team asks to increase the test timeout for integration tests. Which change belongs in the shared template, and which might belong in parameters or a separate template?
Show Answer
Removing resource limits from the shared template is usually a platform risk and should not be accepted as a one-team convenience. The better response is to measure the test workload, set realistic requests and limits, or offer a controlled parameter with approved bounds if teams have legitimate size differences. A longer timeout may be a template parameter if it is common and safe within a maximum, or a separate template if it represents a distinct class of workflow such as long integration testing.
The evaluation should consider ownership, blast radius, and contract stability. Shared templates encode safe defaults for many users, so changes that weaken guardrails require stronger justification than changes that expose controlled configuration.
Question 5
Section titled “Question 5”A workflow has an optional report-generation task with continueOn.failed: true. A downstream notification says the release is ready even when the report failed. The business now treats the report as required for regulated releases but optional for internal previews. How should the workflow design change?
Show Answer
The workflow should make the business rule explicit instead of using one unconditional continueOn. One design is to add a workflow parameter such as release-type, run the report task, and gate the downstream release notification differently for regulated releases and previews. For regulated releases, report failure should block readiness; for previews, the notification should clearly state that the optional report failed.
The important point is that optionality is not a technical convenience. It is a policy decision. The workflow should encode the policy clearly enough that reviewers and operators can see when a failed task is acceptable.
Question 6
Section titled “Question 6”A model-training workflow fails with OOMKilled during one fan-out branch. The same template succeeds for smaller input partitions. What investigation steps would you take, and what design changes might fix the issue?
Show Answer
Start with argo get to identify the failed node, then inspect the pod with k describe pod, check logs including previous logs if the container restarted, and compare the failed partition size with successful partitions. Look at resource requests and limits, node memory pressure, and whether the task accumulates large artifacts in memory. The likely causes are an undersized memory limit, an unusually large partition, application memory growth, or a batching strategy that creates uneven chunks.
Fixes include increasing memory limits based on measurements, splitting large partitions into smaller batches, changing the application to stream data, setting clearer input-size limits, or routing memory-heavy tasks to appropriate nodes. Retrying the same pod without changing the resource or input condition is unlikely to help.
Question 7
Section titled “Question 7”A team wants to use Argo Workflows for every pipeline because they already operate Argo CD. They have three workloads: simple microservice CI, hyperparameter training with many parallel runs, and scheduled SaaS-to-warehouse extraction. How would you recommend tools?
Show Answer
For simple microservice CI, Tekton or another CI-focused system may be cleaner because standardized clone, test, build, and publish flows match its strengths. For hyperparameter training with many parallel runs, Argo Workflows is a strong fit because dynamic fan-out, artifacts, and Kubernetes pod execution are central. For scheduled SaaS-to-warehouse extraction, Airflow may be the better default if provider integrations and external orchestration dominate, while Argo is reasonable if each stage is already containerized and Kubernetes-native execution is the main requirement.
The recommendation should be based on workload shape, state location, operator skill, and integration needs. Sharing the Argo name with Argo CD is not enough reason to force every pipeline into Argo Workflows.
Question 8
Section titled “Question 8”A GitHub webhook triggers an Argo Events Sensor, and the Sensor creates a Workflow for each push. A misconfigured repository starts sending repeated push events. The cluster remains healthy, but the workflow namespace fills with pending runs. What should you change?
Show Answer
Add protection at multiple layers: authenticate and validate the webhook, deduplicate or rate-limit events before workflow creation, set workflow parallelism, enforce namespace quota, and add TTL cleanup for completed workflows. If pending runs are not useful after a newer event arrives, add cancellation or superseding logic so stale runs do not consume queue space.
The answer should not rely on only one control. Event-driven systems need upstream filtering and downstream resource protection because either layer can fail or be bypassed.
Hands-On Exercise
Section titled “Hands-On Exercise”Scenario: Build a Bounded Data Processing Workflow
Section titled “Scenario: Build a Bounded Data Processing Workflow”You are supporting a platform team that needs a small but realistic Argo Workflow for data processing. The workflow must generate a runtime list of work items, process those items with bounded parallelism, aggregate the result, and expose enough status for an operator to debug a failed task. You will first run the healthy version, then intentionally break one template and practice the debugging path.
Step 1: Create a Disposable Cluster and Install Argo
Section titled “Step 1: Create a Disposable Cluster and Install Argo”Use a disposable local cluster for this exercise. The commands assume kind is installed and your current shell has access to Docker or another supported local container runtime.
kind create cluster --name argo-lab
alias k=kubectl
k create namespace argo
k apply -n argo -f https://github.com/argoproj/argo-workflows/releases/latest/download/install.yaml
k patch deployment argo-server -n argo --type='json' -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--auth-mode=server"}]'
k -n argo wait --for=condition=ready pod -l app=workflow-controller --timeout=120s
k -n argo port-forward svc/argo-server 2746:2746Open https://127.0.0.1:2746 while the port-forward runs. This authentication mode is for a disposable lab only. Do not copy it into a shared cluster.
Step 2: Create the Workflow Manifest
Section titled “Step 2: Create the Workflow Manifest”Create a file named bounded-data-pipeline.yaml. This workflow uses withParam for dynamic fan-out, parallelism: 3 to limit concurrency, explicit resources for the processing pods, and a final aggregation task that waits for every expanded processing node.
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: bounded-data-pipeline- namespace: argospec: entrypoint: main parallelism: 3 podGC: strategy: OnPodCompletion ttlStrategy: secondsAfterCompletion: 1800
templates: - name: main dag: tasks: - name: generate-data template: generate
- name: process-items template: process dependencies: [generate-data] arguments: parameters: - name: item value: "{{item}}" withParam: "{{tasks.generate-data.outputs.result}}"
- name: aggregate template: aggregate dependencies: [process-items]
- name: generate script: image: python:3.12-alpine command: [python] source: | import json items = [f"item-{index}" for index in range(8)] print(json.dumps(items))
- name: process inputs: parameters: - name: item activeDeadlineSeconds: 60 retryStrategy: limit: 2 backoff: duration: 5s factor: 2 container: image: alpine:3.20 command: [sh, -c] args: - | echo "Processing {{inputs.parameters.item}}" sleep 2 echo "Finished {{inputs.parameters.item}}" resources: requests: cpu: "100m" memory: "64Mi" limits: cpu: "250m" memory: "128Mi"
- name: aggregate container: image: alpine:3.20 command: [echo] args: ["All bounded processing tasks completed"]Step 3: Submit and Observe the Healthy Run
Section titled “Step 3: Submit and Observe the Healthy Run”Submit the workflow with the Argo CLI and watch it complete. Then inspect the graph, logs, and pods so you connect the workflow view with Kubernetes objects.
argo submit -n argo bounded-data-pipeline.yaml --watch
argo get -n argo @latest
argo logs -n argo @latest
k -n argo get podsWhile the run is active, observe that only a small number of processing pods run concurrently. The list contains more items than the parallelism limit, so Argo must queue some expanded nodes until running nodes finish.
Step 4: Break the Workflow on Purpose
Section titled “Step 4: Break the Workflow on Purpose”Edit the process template command so it fails for one item. This turns the exercise into a debugging scenario rather than a copy-and-run lab.
args: - | echo "Processing {{inputs.parameters.item}}" if [ "{{inputs.parameters.item}}" = "item-3" ]; then echo "simulated bad partition" exit 1 fi sleep 2 echo "Finished {{inputs.parameters.item}}"Submit the modified workflow and inspect the result.
argo submit -n argo bounded-data-pipeline.yaml --watch
argo get -n argo @latest
argo logs -n argo @latestStep 5: Debug Like an Operator
Section titled “Step 5: Debug Like an Operator”Use the workflow graph to identify the failed node, then use Kubernetes status to inspect the pod. Your goal is to connect the Argo failure to the container exit and prove that the aggregate task waited because an upstream expanded task failed.
argo get -n argo @latest
k -n argo get pods
k -n argo describe pod <failed-pod-name>
k -n argo logs <failed-pod-name>Explain your finding in one short runbook note. A good note says which item failed, what the container printed, whether retries happened, whether the failure is transient or deterministic, and what change would fix the workflow.
Step 6: Improve the Design
Section titled “Step 6: Improve the Design”Change the workflow so deterministic bad input fails fast but transient processing still has retries. You can keep the simulated failure, but describe why retrying it is wasteful. Then remove the simulated failure and run the workflow successfully again.
argo submit -n argo bounded-data-pipeline.yaml --watch
argo get -n argo @latestSuccess Criteria
Section titled “Success Criteria”- Argo Workflows is installed in the
argonamespace and the controller pod is ready. - You can submit a Workflow and inspect it with both
argo getand Kubernetes pod commands. - The workflow uses
withParamto process a runtime-generated list of items. -
parallelism: 3visibly limits concurrent processing pods. - You can explain why the aggregate task waits for all expanded processing nodes.
- You intentionally caused one processing task to fail and traced the failure from workflow node to pod logs.
- You can explain whether the failure should be retried, fixed in code, or handled as bad input.
- The final workflow succeeds after you remove the simulated failure.
Cleanup
Section titled “Cleanup”Delete the disposable cluster when you are finished. This removes the Argo installation, Workflow CRDs, pods, and any lab state created inside the cluster.
kind delete cluster --name argo-labNext Module
Section titled “Next Module”Continue to Security Tools Toolkit where you will evaluate Vault, OPA, Falco, and supply-chain security tools for platform engineering workflows.
Sources
Section titled “Sources”- Argo Workflows Overview — Official overview of Argo Workflows concepts, DAG capabilities, and common Kubernetes workload patterns.
- Tekton Overview — Official Tekton overview describing its Kubernetes-native CI/CD model and core concepts.
- Argo Project Overview — Official project landing page summarizing the Argo ecosystem tools, including Workflows, CD, Events, and Rollouts.
- Considerations for Large Clusters — Kubernetes guidance on control-plane, scheduler, and etcd limits in large clusters.
- tekton.dev: getting started — Tekton’s official getting-started page describes Tekton as an open-source cloud native CI/CD solution.
- argo-workflows.readthedocs.io: running at massive scale — Argo’s scale guidance warns that large amounts of work and parallel nodes can overwhelm the Kubernetes API and require explicit controls.
- argo-workflows.readthedocs.io: architecture — The architecture documentation describes controller reconciliation and states that each Step and each DAG Task generate a pod.
- argo-workflows.readthedocs.io: installation — The installation guide says
latestis the tip ofmainand recommends using a specific release version in production. - argo-workflows.readthedocs.io: workflow concepts — The core concepts page describes both
stepsanddagwith these execution semantics. - argo-workflows.readthedocs.io: workflow templates — The WorkflowTemplate documentation defines it this way and explains cross-reference behavior.
- argo-workflows.readthedocs.io: configure artifact repository — Argo’s artifact repository docs say artifact workflows must configure a repository and list supported backends including AWS, GCS, and MinIO.
- argo-workflows.readthedocs.io: loops — The loops documentation defines
withItemsandwithParamexactly in these terms. - argo-workflows.readthedocs.io: synchronization — The synchronization docs state that
parallelismrestricts concurrent task or step executions within a workflow or template. - argo-workflows.readthedocs.io: fields — The Argo field reference defines all three fields and their semantics.
- argoproj.github.io: argo workflow — The Argo Events workflow-trigger documentation shows webhook-triggered workflow creation, WorkflowTemplate submission, and parameterization from event data.