Module 1.1: Advanced Argo Workflows
Цей контент ще не доступний вашою мовою.
CAPA Track — Domain 1 (36%) | Complexity:
[COMPLEX]| Time: 50-60 min
The platform team at a fintech company had a problem. Their nightly reconciliation workflow ran 14 steps sequentially, took 3 hours, and failed silently twice a week. Nobody knew until morning standup. After migrating to Argo Workflows with exit handlers for Slack alerts, CronWorkflows for scheduling, memoization to skip unchanged steps, and lifecycle hooks for audit logging, the pipeline shrank to 40 minutes. Failures triggered immediate notifications, and transient errors retried automatically. The team went from 12 hours per week of pipeline babysitting to zero.
Prerequisites
Section titled “Prerequisites”- Module 3.3: Argo Workflows — Container, Script, Steps, DAG, Artifacts
- Kubernetes RBAC basics (ServiceAccounts, Roles)
- CronJob scheduling syntax
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Construct advanced Argo Workflows using all 7 template types, including Resource templates for direct Kubernetes object manipulation
- Configure CronWorkflows, memoization, and synchronization locks to build scheduled, efficient, and concurrency-safe pipelines
- Implement exit handlers, lifecycle hooks, and retry strategies that make workflows self-healing and auditable
- Apply workflow security best practices: scoped ServiceAccounts, artifact repository encryption, and RBAC for workflow submission
Why This Module Matters
Section titled “Why This Module Matters”The CAPA exam dedicates 36% to Domain 1, covering Argo Workflows in depth. Module 3.3 taught the fundamentals. This module covers everything else: remaining template types, scheduled workflows, reusable templates, exit handlers, synchronization, memoization, lifecycle hooks, variables, retry strategies, and security.
Did You Know?
Section titled “Did You Know?”- Argo Workflows supports 7 template types — most teams only use 2-3, but the CAPA exam expects all of them
- CronWorkflows are not Kubernetes CronJobs — they are a separate CRD creating Workflow objects on a schedule
- Memoization has a hard 1MB limit per entry — ConfigMap values are capped at 1MB; exceed it and your workflow fails cryptically
- Expression tags use expr-lang, not Go templates —
{{=expression}}gives you conditionals, math, and string ops inline
Remaining Template Types
Section titled “Remaining Template Types”Module 3.3 covered Container, Script, Steps, and DAG. Here are the rest.
Resource Template
Section titled “Resource Template”Performs CRUD on Kubernetes resources directly — no kubectl container needed.
- name: create-configmap resource: action: create # create | patch | apply | delete | get manifest: | apiVersion: v1 kind: ConfigMap metadata: name: output-{{workflow.name}} data: result: "done" successCondition: "status.phase == Active" failureCondition: "status.phase == Failed"The successCondition/failureCondition fields use jsonpath — useful with Jobs or CRDs where you wait for a status field.
Suspend Template
Section titled “Suspend Template”Pauses execution until manually resumed or a duration elapses. This is how you build approval gates.
- name: approval-gate suspend: duration: "0" # Wait indefinitely until resumed- name: timed-pause suspend: duration: "30m" # Auto-resume after 30 minutesResume from CLI: argo resume my-workflow -n argo
HTTP Template
Section titled “HTTP Template”Makes HTTP requests without spinning up a container. Requires the Argo Server.
- name: call-webhook http: url: "https://api.example.com/notify" method: POST headers: - name: Authorization valueFrom: secretKeyRef: {name: api-creds, key: token} body: '{"workflow": "{{workflow.name}}", "status": "{{workflow.status}}"}' successCondition: "response.statusCode >= 200 && response.statusCode < 300"ContainerSet Template
Section titled “ContainerSet Template”Multiple containers in a single pod sharing volumes. Like init-containers with dependency ordering.
- name: multi-container containerSet: volumeMounts: - name: workspace mountPath: /workspace containers: - name: clone image: alpine/git command: [sh, -c, "git clone https://github.com/org/repo /workspace/repo"] - name: build image: golang:1.22 command: [sh, -c, "cd /workspace/repo && go build ./..."] dependencies: [clone] - name: test image: golang:1.22 command: [sh, -c, "cd /workspace/repo && go test ./..."] dependencies: [clone] volumes: - name: workspace emptyDir: {}Key difference from DAG: all containers share one pod — shared filesystem without artifacts, but limited to one node’s resources.
Data and Plugin Templates
Section titled “Data and Plugin Templates”Data sources data from artifact storage with transformations (e.g., filtering S3 files). Plugin extends Argo via executor plugins registered on the cluster. Both are less common on exams but know they exist.
CronWorkflow
Section titled “CronWorkflow”CronWorkflows create Workflow objects on a schedule — their own CRD, not a wrapper around K8s CronJobs.
apiVersion: argoproj.io/v1alpha1kind: CronWorkflowmetadata: name: nightly-etlspec: schedule: "0 2 * * *" # 2 AM daily timezone: "America/New_York" # Default: UTC startingDeadlineSeconds: 300 # Skip if missed by >5min concurrencyPolicy: Replace # Kill previous if still running successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 5 workflowSpec: entrypoint: main templates: - name: main dag: tasks: - name: extract template: run-etl - name: load template: run-etl dependencies: [extract] - name: run-etl container: image: etl-runner:v3 command: [python, run.py]| Concurrency Policy | Behavior |
|---|---|
Allow | Multiple concurrent runs permitted |
Forbid | Skip new run if previous still active |
Replace | Kill running workflow, start new one |
Backfill: CronWorkflows do not backfill missed runs. Manual trigger: argo submit -n argo --from cronwf/nightly-etl
WorkflowTemplate and ClusterWorkflowTemplate
Section titled “WorkflowTemplate and ClusterWorkflowTemplate”WorkflowTemplate is namespace-scoped; ClusterWorkflowTemplate is cluster-scoped (accessible from any namespace).
Reference an entire template as your workflow:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: ci-run-spec: workflowTemplateRef: name: build-test-deploy # WorkflowTemplate # clusterScope: true # Add for ClusterWorkflowTemplate arguments: parameters: - name: image-tag value: ghcr.io/org/app:v1.2.3Reference individual templates within a DAG using templateRef:
dag: tasks: - name: scan templateRef: name: org-standard-ci template: security-scan clusterScope: true arguments: parameters: [{name: image, value: "myapp:latest"}]Templates are resolved at submission time — updating a WorkflowTemplate does not affect running workflows.
Exit Handlers
Section titled “Exit Handlers”Run at workflow end regardless of outcome. Specified via spec.onExit.
spec: entrypoint: main onExit: exit-handler templates: - name: main container: image: alpine command: [sh, -c, "echo 'working'"] - name: exit-handler steps: - - name: success-notify template: notify when: "{{workflow.status}} == Succeeded" - name: failure-notify template: alert when: "{{workflow.status}} != Succeeded"{{workflow.status}} resolves to Succeeded, Failed, or Error inside exit handlers.
Synchronization
Section titled “Synchronization”Mutex — exclusive lock, one workflow at a time:
Section titled “Mutex — exclusive lock, one workflow at a time:”spec: synchronization: mutex: name: deploy-productionSemaphore — N concurrent holders, backed by a ConfigMap:
Section titled “Semaphore — N concurrent holders, backed by a ConfigMap:”# ConfigMap: data: { gpu-jobs: "3" }spec: synchronization: semaphore: configMapKeyRef: name: semaphore-config key: gpu-jobsBoth can be applied at workflow level or template level.
Memoization
Section titled “Memoization”Cache step outputs in a ConfigMap. If inputs match, skip execution.
- name: expensive-step memoize: key: "{{inputs.parameters.dataset}}-{{inputs.parameters.version}}" maxAge: "24h" cache: configMap: name: memo-cache inputs: parameters: [{name: dataset}, {name: version}] container: image: processor:v2 command: [python, process.py] outputs: parameters: - name: result valueFrom: path: /tmp/result.jsonConstraints: 1MB limit per entry (ConfigMap cap), only output parameters cached (not artifacts), maxAge: "0" for infinite TTL.
Lifecycle Hooks
Section titled “Lifecycle Hooks”Execute actions when a template starts or finishes, without modifying the main logic.
- name: deploy hooks: running: template: log-start exit: template: log-completion expression: "steps['deploy'].status == 'Failed'" # Conditional container: image: bitnami/kubectl command: [kubectl, apply, -f, /manifests/]Triggers: running (node starts), exit (node finishes regardless of outcome).
Variables: Simple Tags vs Expression Tags
Section titled “Variables: Simple Tags vs Expression Tags”Simple tags ({{...}}) — plain string substitution:
"{{workflow.name}}" "{{workflow.status}}""{{inputs.parameters.my-param}}" "{{tasks.task-a.outputs.result}}"Expression tags ({{=...}}) — evaluate expr-lang expressions:
"{{=workflow.status == 'Succeeded' ? 'PASS' : 'FAIL'}}""{{=asInt(inputs.parameters.replicas) + 1}}""{{=sprig.upper(workflow.name)}}"Use simple tags for references. Use expression tags for conditionals, math, or string manipulation.
Retry Strategies
Section titled “Retry Strategies”- name: call-api retryStrategy: limit: 5 retryPolicy: OnError # See table backoff: duration: 10s # Initial delay factor: 2 # Multiplier per retry maxDuration: 5m # Cap affinity: nodeAntiAffinity: {} # Retry on different node container: image: curlimages/curl command: [curl, -f, "https://api.example.com/process"]| Policy | Retries on… |
|---|---|
Always | Any failure (non-zero exit, OOM, node failure) |
OnFailure | Non-zero exit code only |
OnError | System errors (OOM, node failure), NOT non-zero exit |
OnTransientError | Transient K8s errors only (pod eviction) |
Security
Section titled “Security”Per-workflow service accounts for least privilege:
spec: serviceAccountName: argo-deployer # Workflow-level templates: - name: build-step serviceAccountName: argo-builder # Template-level overridePod security contexts:
- name: secure-step securityContext: runAsUser: 1000 runAsNonRoot: true container: image: my-app:v1 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: [ALL]Resource templates need RBAC on the resources they manage — create a Role granting only the required verbs on specific resources, and bind it to the workflow’s ServiceAccount.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Hurts | Better Approach |
|---|---|---|
Always retry for logic errors | Bad code retries forever | OnError for infra, OnFailure for self-healing bugs |
| Memoized outputs > 1MB | ConfigMap silently fails | Keep memoized outputs small; artifacts for large data |
CronWorkflow without startingDeadlineSeconds | Missed runs vanish silently | Set deadline, monitor for skips |
| Single SA for all workflows | One compromise = full access | Least-privilege SA per workflow |
Missing clusterScope: true in templateRef | ClusterWorkflowTemplate ref fails | Always set when referencing cluster-scoped |
| Exit handler uses artifacts | Artifacts may not be available | Pass data via parameters or external store |
| Mutex name collisions across teams | Unrelated workflows block each other | Namespace mutex names: team-a/deploy-prod |
| Unquoted expression tags | YAML parser breaks on {{=...}} | Always quote: "{{=expr}}" |
Question 1: What is the difference between a Resource template and a Container running kubectl?
Section titled “Question 1: What is the difference between a Resource template and a Container running kubectl?”Show Answer
Resource templates operate through the API server directly -- no container, no image pull, supports `successCondition`/`failureCondition` for watching status. Container+kubectl is heavier but allows shell scripting. Use Resource for simple CRUD, Container for complex logic.Question 2: Write the CronWorkflow spec for 3 AM UTC weekdays, skip if missed by >10 min.
Section titled “Question 2: Write the CronWorkflow spec for 3 AM UTC weekdays, skip if missed by >10 min.”Show Answer
spec: schedule: "0 3 * * 1-5" timezone: "UTC" startingDeadlineSeconds: 600 concurrencyPolicy: ForbidQuestion 3: How does memoization work, and what is its key limitation?
Section titled “Question 3: How does memoization work, and what is its key limitation?”Show Answer
Caches output parameters in a ConfigMap keyed by a user-defined key. On cache hit (matching key, not expired), returns cached output without executing. Key limitation: **1MB per entry** (ConfigMap value cap). Only output parameters are cached, not artifacts.Question 4: Explain {{workflow.name}} vs {{=workflow.name}}.
Section titled “Question 4: Explain {{workflow.name}} vs {{=workflow.name}}.”Show Answer
`{{workflow.name}}` is simple string substitution. `{{=workflow.name}}` evaluates an expr-lang expression -- identical for simple refs, but expression tags enable logic: `"{{=workflow.status == 'Succeeded' ? 'PASS' : 'FAIL'}}"`.Question 5: Limit GPU training workflows to 4 concurrent. How?
Section titled “Question 5: Limit GPU training workflows to 4 concurrent. How?”Show Answer
Create ConfigMap with `data: { gpu: "4" }`, then use `spec.synchronization.semaphore.configMapKeyRef` pointing to that key. Fifth workflow queues until one completes. ConfigMap value can be changed at runtime.Question 6: What happens when an exit handler fails?
Section titled “Question 6: What happens when an exit handler fails?”Show Answer
The workflow's final status becomes `Error`. Design robust exit handlers: add retries, use HTTP templates for speed, keep logic minimal. For critical notifications, use a fallback (dead-letter queue or persistent store).Question 7: A WorkflowTemplate is updated after a workflow starts. Old or new version?
Section titled “Question 7: A WorkflowTemplate is updated after a workflow starts. Old or new version?”Show Answer
**Old version.** Templates are resolved at submission time and stored in the Workflow object. Updates do not affect in-flight workflows.Question 8: Write a retry strategy: 3 retries, 30s exponential backoff capped at 5m, different nodes.
Section titled “Question 8: Write a retry strategy: 3 retries, 30s exponential backoff capped at 5m, different nodes.”Show Answer
retryStrategy: limit: 3 retryPolicy: Always backoff: {duration: 30s, factor: 2, maxDuration: 5m} affinity: nodeAntiAffinity: {}Sequence: attempt 1 immediate, retry after 30s/60s/120s on different nodes each time.
Question 9: When use ContainerSet vs DAG with Containers?
Section titled “Question 9: When use ContainerSet vs DAG with Containers?”Show Answer
**ContainerSet**: shared filesystem, tightly coupled steps, minimize scheduling overhead, fits on one node. **DAG**: independent steps, different resource needs, artifact passing via S3, independent retry/timeout per step, exceeds single-node capacity.Hands-On Exercise: Production-Ready Scheduled Pipeline
Section titled “Hands-On Exercise: Production-Ready Scheduled Pipeline”kind create cluster --name capa-labkubectl create namespace argokubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/latest/download/install.yamlkubectl -n argo wait --for=condition=ready pod -l app=workflow-controller --timeout=120sStep 1: Create supporting ConfigMaps
Section titled “Step 1: Create supporting ConfigMaps”kubectl apply -n argo -f - <<'EOF'apiVersion: v1kind: ConfigMapmetadata: name: deploy-semaphoredata: limit: "1"---apiVersion: v1kind: ConfigMapmetadata: name: build-cachedata: {}EOFStep 2: Create WorkflowTemplate and CronWorkflow
Section titled “Step 2: Create WorkflowTemplate and CronWorkflow”# Save as pipeline.yamlapiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: build-step namespace: argospec: templates: - name: build inputs: parameters: [{name: app-name}] memoize: key: "build-{{inputs.parameters.app-name}}" maxAge: "1h" cache: configMap: {name: build-cache} container: image: alpine command: [sh, -c] args: ["echo 'Building {{inputs.parameters.app-name}}' && sleep 3 && echo 'done' > /tmp/result.txt"] outputs: parameters: - name: build-id valueFrom: {path: /tmp/result.txt}---apiVersion: argoproj.io/v1alpha1kind: CronWorkflowmetadata: name: scheduled-pipeline namespace: argospec: schedule: "*/5 * * * *" startingDeadlineSeconds: 120 concurrencyPolicy: Forbid workflowSpec: entrypoint: main onExit: cleanup synchronization: semaphore: configMapKeyRef: {name: deploy-semaphore, key: limit} templates: - name: main dag: tasks: - name: build-app templateRef: {name: build-step, template: build} arguments: parameters: [{name: app-name, value: my-service}] - name: approval template: pause dependencies: [build-app] - name: deploy template: deploy-step dependencies: [approval] - name: pause suspend: {duration: "10s"} - name: deploy-step retryStrategy: {limit: 2, retryPolicy: OnError, backoff: {duration: 5s, factor: 2}} container: image: alpine command: [sh, -c, "echo 'Deploying...' && sleep 2 && echo 'Done'"] - name: cleanup container: image: alpine command: [sh, -c] args: ["echo 'Exit handler: {{workflow.name}} status={{workflow.status}}'"]kubectl apply -n argo -f pipeline.yaml# Manually trigger instead of waiting 5 minargo submit -n argo --from cronwf/scheduled-pipeline --watch# Run again to verify memoization (build step should be cached)argo submit -n argo --from cronwf/scheduled-pipeline --watchSuccess Criteria
Section titled “Success Criteria”- CronWorkflow creates workflows on schedule
- WorkflowTemplate referenced via
templateRef - Memoization caches build on second run
- Suspend template pauses and auto-resumes
- Exit handler reports workflow status
- Semaphore prevents concurrent runs
Cleanup
Section titled “Cleanup”kind delete cluster --name capa-labKey Takeaways
Section titled “Key Takeaways”- Describe all 7 template types and when to use each
- Configure CronWorkflows with timezone, deadline, and concurrency policy
- Create and reference WorkflowTemplates and ClusterWorkflowTemplates
- Implement exit handlers that branch on workflow status
- Use mutexes and semaphores for synchronization
- Configure memoization within the 1MB ConfigMap limit
- Attach lifecycle hooks for audit/observability
- Distinguish simple tags from expression tags
- Design retry strategies with backoff and node anti-affinity
- Apply least-privilege security with per-workflow service accounts
“Advanced workflows are not about complexity for its own sake. They are about making failure visible, recovery automatic, and operations predictable.”