Module 7.12: Ansible Operator SDK Fundamentals
Complexity: [COMPLEX]
Time to Complete: ~90 minutes
Prerequisites: Module 7.4: Ansible for Infrastructure, Kubernetes custom resources from K8s Extending Module 1.2: CRDs Advanced, controller fundamentals from K8s Extending Module 1.3: Controllers and client-go,
operator-sdk, Ansible,kubectl, Docker, and access to a Kubernetes cluster
Prerequisites
Section titled “Prerequisites”Before starting this module, you should be comfortable reading an Ansible role,
explaining why idempotent tasks are safer than shell scripts, and applying
ordinary Kubernetes resources with kubectl. Module 7.4 gives you the Ansible
side: playbooks, roles, collections, variables, and task result handling. The
Kubernetes Extending modules give you the control-plane side: custom resources,
the spec and status split, reconciliation, owner references, and why a
controller turns desired state into child objects.
You also need a local toolchain that can build and run a small operator. The
commands in this module assume kind, kubectl, Docker, make, ansible,
and operator-sdk are on your PATH. Operator SDK installation changes over
time, so use the official installation page for your platform; on macOS,
Homebrew is the simplest path, while Linux users commonly install the signed
release binary. Verify the tools before you begin so any later failure belongs
to the operator workflow rather than to a missing local dependency.
operator-sdk versionansible --versionkubectl version --client=truekind versiondocker version --format '{{.Client.Version}}'You do not need to know Go to complete this module. That is the point of the Ansible Operator pattern. You do need to know enough Kubernetes to recognize that an operator is not a batch job: it is a controller that keeps watching a custom resource, rerunning automation when the desired state or managed children change, and publishing observed state back to the API server.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design an Ansible Operator architecture that maps a
watches.yamlentry to an Ansible role, Kubernetes custom resource, RBAC boundary, and reconciliation loop. - Implement a minimum viable
DemoAppoperator with Operator SDK,kubernetes.core.k8stasks, custom status updates, and a kind deployment path. - Diagnose reconciliation failures by reading operator logs, CR status, RBAC denials, Ansible task output, and child-resource ownership signals.
- Evaluate when an Ansible Operator beats a Go Operator or Helm Operator for platform automation work.
- Plan a production readiness path for idempotency, status management, finalizers, image publishing, testing, and future advanced watch patterns.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: a platform team already has a mature Ansible role that creates a per-application Deployment, Service, ConfigMap, network policy, and a few external configuration hooks. Application teams keep asking for a self-service API: “I want to submit a small object to the cluster and let the platform handle the rest.” The platform engineers know Kubernetes well enough to model the request as a custom resource, but they do not want the first iteration to become a Go/Kubebuilder project with generated types, controller runtime clients, cache predicates, and a month of new language conventions.
An Ansible Operator gives that team a middle path. Operator SDK still provides
the Kubernetes controller shell: manager process, watches, RBAC manifests, CRD
generation, container image, kustomize deployment, and optional OLM packaging.
Instead of writing the reconcile function in Go, the team maps a custom
resource to an Ansible role. When the resource changes, the operator runs the
role with CR fields as variables. The role uses the modern
kubernetes.core.k8s collection to create or patch child resources. The API
server remains the source of desired state, while Ansible remains the language
the platform team already knows.
This is not a shortcut around controller design. Ansible Operators still need
idempotent logic, clear ownership, safe status reporting, RBAC that matches the
managed resources, and a plan for deletion. The advantage is that the first
controller can be built from the team’s existing operational vocabulary:
roles, variables, tasks, handlers, changed_when, failed_when, and rescue
blocks. This module establishes that vocabulary so later modules can go deep
on advanced watches.yaml patterns, AWX and Event-Driven Ansible integration,
Helm-versus-Ansible decisions, production hardening, and testing.
War Story - The Rollout Role That Became a Controller
Section titled “War Story - The Rollout Role That Became a Controller”Hypothetical scenario: a platform group owned a configuration rollout job for internal web services. Each service needed a Deployment, Service, standard labels, a team-owned ConfigMap, and a sidecar toggle for environments that required extra telemetry. The first version was a normal Ansible playbook run from a release pipeline. It worked during scheduled changes, but it did not respond when an application team edited replicas directly, deleted a Service while debugging, or created a new environment outside the release window.
The team considered a Go Operator because the desired interface was clearly a
Kubernetes API: create a DemoApp-like object and let the platform reconcile
the underlying objects. The barrier was not ideology; it was delivery cost.
The team had strong Ansible habits, a tested role, and a short window to make
the workflow self-service. Rewriting the whole thing in Go would have created
new code paths before they had proven the API contract.
They chose an Ansible Operator for the first release. The winning pattern was
simple: keep the custom resource small, map it to one role, let
kubernetes.core.k8s apply complete object definitions, and update .status
with enough information for application teams to debug readiness without
reading operator logs. The team did not try to model every future toggle in
the first CRD. They shipped image, replicas, port, and a short set of
labels, then added fields only after real users needed them.
The footgun was periodic reconciliation. Someone set a very short
reconcilePeriod because it felt safer to check often. It was not safer. The
role was idempotent, but it still made API calls, produced Ansible artifacts,
and generated log volume on every run. A low period turned harmless drift
checks into controller noise. The fix was to rely on Kubernetes watch events
for ordinary changes, keep the periodic interval conservative, and make
explicit decisions about which dependent resources should trigger new runs.
The lesson is practical: Ansible Operators are strongest when they turn known, idempotent operational procedures into Kubernetes-native reconciliation. They are weakest when teams treat them as cron jobs in disguise or hide a complex state machine inside YAML because they are avoiding a more appropriate Go controller.
Conceptual Model: Ansible Operator Architecture
Section titled “Conceptual Model: Ansible Operator Architecture”An ordinary Kubernetes controller watches one or more object types, receives an event, reads the desired state, compares it with observed state, and issues API calls until the cluster moves closer to the desired state. A Go Operator puts that logic in a reconcile function. A Helm Operator renders a chart with values from the custom resource. An Ansible Operator runs a role or playbook with variables derived from the custom resource.
+--------------------------------------------------------------------+| ANSIBLE OPERATOR MENTAL MODEL |+--------------------------------------------------------------------+| || Custom Resource || apiVersion: app.example.com/v1 || kind: DemoApp || spec: || replicas: 2 || image: nginx:1.27-alpine || || watch event or periodic reconcile || | || v || watches.yaml maps DemoApp to role demoapp || | || v || Operator SDK manager starts ansible-runner || | || v || Role variables include CR spec fields and metadata || | || v || kubernetes.core.k8s creates Deployment and Service || | || v || operator_sdk.util.k8s_status updates observed status || |+--------------------------------------------------------------------+The crucial file is watches.yaml. It tells the operator which group, version,
and kind should trigger which Ansible role or playbook. For a foundational
operator, one custom resource maps to one role. Later modules will explore
selectors, multiple watches, dependent watches, custom finalizers, and
collection roles, but the anchor idea stays the same: watches.yaml is the
bridge between the Kubernetes API and your Ansible content.
---- version: v1 group: app.example.com kind: DemoApp role: demoapp manageStatus: falseThis mapping means: when the controller sees an app.example.com/v1 DemoApp,
run the demoapp Ansible role. manageStatus: false is deliberate in this
module because the role will update a small custom status block itself. If you
leave status management enabled, the Ansible Operator can publish generic
Ansible run information for you. That default is useful for scaffolding, but
production APIs usually deserve status fields that communicate application
state rather than raw automation internals.
The reconcile loop looks familiar if you already write Ansible. Instead of
“run this role against host group webservers,” the operator effectively says
“run this role for this custom resource instance.” The inventory is not a
fleet of servers. The target is the Kubernetes API, accessed through the
operator process and the service account mounted into the controller Pod. The
role receives variables from .spec, plus metadata about the CR instance, and
then uses the Kubernetes collection to converge resources.
+----------------------+ +------------------------------------+| Ansible term | | Operator term |+----------------------+--------+------------------------------------+| Role | | Reconcile implementation || Task | | One step toward desired state || Variable | | CR spec field or derived value || Idempotent module | | Safe repeated API operation || Task failure | | Failed reconcile || Registered result | | Observed state evidence || changed_when | | Signal whether work actually moved || rescue block | | Controlled error handling || k8s_status module | | Custom resource status update |+----------------------+--------+------------------------------------+Operator SDK calls your role when a watched custom resource is created,
updated, or deleted, and it can also rerun periodically according to the
reconcilePeriod option. It may also reconcile when owned dependent resources
change, depending on the watch settings. The healthy default mental model is
event-driven first, conservative periodic resync second. A controller should
react promptly to meaningful Kubernetes events without turning every resource
into a high-frequency polling loop.
spec and status have different jobs. The spec is the user’s desired
state: image, replica count, port, feature toggles, or whatever contract your
CRD exposes. The status is the operator’s observed state: whether the child
Deployment is ready, which Service was created, what phase the operator last
reported, and which error is currently blocking convergence. Users should edit
spec; controllers should write status.
Finalizers are the deletion hook. Without a custom finalizer, Kubernetes can delete the custom resource and garbage collect owned in-cluster children. That is enough for a simple DemoApp. If the operator creates external resources, the delete path may need a role or playbook that removes them before Kubernetes lets the CR disappear. This module mentions finalizers only to set vocabulary; the future arc module on production patterns will handle finalizer design in depth.
Pause and predict: if the role creates a Deployment without an owner reference
or operator-injected ownership metadata, what happens when the custom resource
is deleted? The child object may remain behind because Kubernetes does not know
the CR owns it. Ansible Operator’s Kubernetes proxy normally helps inject owner
references for same-namespace objects created through the Kubernetes modules,
but you should still verify ownership with kubectl get deployment -o yaml
when designing cleanup behavior.
Ansible Operator, Helm Operator, and Go Operator
Section titled “Ansible Operator, Helm Operator, and Go Operator”The three common Operator SDK implementation styles solve different problems. They all expose Kubernetes APIs, but they place the implementation logic in different places. Choosing between them is less about which tool is fashionable and more about the shape of your desired state, the skills of the team, and the complexity of the reconciliation logic.
| Implementation | Reconcile logic lives in | Best fit | Main trade-off |
|---|---|---|---|
| Ansible Operator | Ansible role or playbook | Teams with existing roles, procedural setup, post-install checks, or mixed Kubernetes and external tasks | Slower and less type-safe than Go for complex state machines |
| Helm Operator | Helm chart rendered from CR values | Applications that are already cleanly packaged as charts and mostly need install or upgrade semantics | Harder to express imperative validation, branching, and custom status |
| Go Operator | Go code using controller-runtime | Latency-sensitive controllers, complex resource graphs, custom caches, advanced predicates, webhooks, or strong typing | Higher implementation cost for teams that do not already write Go controllers |
An Ansible Operator is not “Helm with more YAML.” Helm is a template renderer plus release manager. Ansible is a task runner with modules, facts, conditionals, registered results, retries, and error blocks. If your operator needs to call a Kubernetes API, wait for readiness, inspect a resource, branch on the observed result, and update a status field, Ansible can express that more naturally than a chart. If your operator mostly installs a packaged application with a values file, a Helm Operator may be simpler.
An Ansible Operator is also not “Go without Go.” You still need controller discipline. The controller process, cache, RBAC, CRD, image, and deployment shape are real Kubernetes artifacts. Ansible only changes the implementation language of the reconcile body. If the reconcile body needs sophisticated queues, fine-grained event predicates, admission webhooks, cross-resource indexing, or high-frequency decisions, Go is usually the right tool.
The practical rule is this: use Ansible when the domain logic already looks like a role, use Helm when the domain logic already looks like a chart, and use Go when the domain logic already looks like a controller.
Designing the DemoApp API Before Scaffolding
Section titled “Designing the DemoApp API Before Scaffolding”Before running operator-sdk init, pause on the API contract. Operators become
hard to change when teams treat the first custom resource as an internal
implementation detail. A custom resource is a product surface. Application
teams will copy examples, dashboards will query fields, runbooks will mention
status values, and automation may patch spec fields directly. Even a teaching
operator should distinguish user intent from controller internals so the first
version teaches the right design habits.
The DemoApp API in this module is intentionally small: replicas, image,
and port. These fields are enough to prove the whole loop without pretending
the operator is a production application platform. replicas changes the
Deployment scale. image changes the container image. port changes the
Service port that other workloads would use inside the cluster. Every field is
observable in a child resource, which makes debugging concrete for learners.
Notice what the API does not expose. It does not expose every Deployment field, every Service field, raw pod template patches, arbitrary annotations, or a free-form map of extra Kubernetes objects. Those features are tempting because they make the operator appear flexible, but they also erase the purpose of the custom API. If users need to understand every underlying Kubernetes field, the operator has not simplified the workflow; it has merely wrapped raw manifests in another layer.
Good operator APIs are opinionated at the boundary and transparent in status. The spec should contain the knobs the platform team wants users to own. The role can apply labels, probes, selectors, service type, and naming conventions consistently. The status should then explain what happened: whether the Deployment is ready, which Service name was created, and which endpoint the user can test. That split lets the platform team enforce defaults without turning the CR into a black box.
Schema validation is part of the teaching model, not an optional polish step.
If replicas must be between 1 and 10, the API server should reject 20
before Ansible runs. If port must be a valid TCP port, the CRD should say so.
Fast API rejection produces better user feedback and prevents the role from
becoming a second, inconsistent validation layer. The role can still assert
critical assumptions, but it should not be the only guardrail.
Status design should be equally deliberate. A beginner operator often starts with whatever status the framework can produce automatically. That is useful while scaffolding, but it rarely answers the user’s question. A user does not usually ask, “Did Ansible task 5 mark changed?” They ask, “Is my application ready, how many replicas are ready, and what Service should I call?” The status fields in this module answer those operational questions directly.
The DemoApp status is still not a full Kubernetes conditions model. That is
intentional. A production API would likely publish conditions such as
Ready, Progressing, and ReconcileFailed, each with status, reason,
message, and last transition time. This module stays lighter so the first
operator remains readable. The important habit is that status belongs to the
controller and should communicate observed state rather than repeat the spec.
RBAC belongs in API design too. The moment a CRD promises “create a DemoApp,” the operator service account needs permission to create the child resources that promise requires. If the role creates Deployments and Services, the RBAC must say exactly that. If a later version creates Ingresses, Secrets, Jobs, or external-provider objects, the permission review changes. A small API makes that review easier because each field maps to a known operational effect.
Ownership is another early design decision. The custom resource should be the stable parent, and the child objects should be named from the parent metadata. If child names include the image tag, environment label, or a mutable display name, updates can create new objects instead of patching existing ones. Stable identity is what makes reconciliation feel boring: the same CR keeps managing the same Deployment and Service as their specs evolve.
The same principle applies to labels. The role uses both
app.kubernetes.io/name and app.kubernetes.io/instance because those labels
separate the operator’s application class from the individual CR instance. The
Service selector depends on those labels, the Deployment selector depends on
those labels, and troubleshooting commands can filter by those labels. A
controller that changes labels casually can break its own Service routing or
make garbage collection harder to reason about.
Finally, keep the first CRD boring enough that future modules can extend it. Advanced watch patterns, AWX integration, finalizers, Molecule tests, and Kuttl tests all become easier to teach when the first API is not overloaded. The vocabulary you need today is simple: a spec with a few user-owned fields, a role that reconciles child resources, status that reports observed state, and RBAC that permits exactly the promised operations.
Building a Minimum Viable Operator
Section titled “Building a Minimum Viable Operator”The minimum viable operator in this module manages a custom resource called
DemoApp. A user creates a small object with replicas, image, and port.
The Ansible role reconciles a Deployment and Service, then writes a compact
status block back to the CR. The example is intentionally small because the
goal is to understand the SDK structure rather than to hide the model under a
large application.
Create a working directory and initialize the project with the Ansible plugin.
The domain becomes part of the API group, so example.com plus group app
will produce app.example.com.
mkdir demoapp-operatorcd demoapp-operator
operator-sdk init --plugins=ansible --domain=example.comAdd the custom resource API and ask the SDK to generate a role. The role name
will match the kind in lower-case form, so this module uses demoapp
throughout the example.
operator-sdk create api --group=app --version=v1 --kind=DemoApp --generate-roleThe exact generated files can change slightly between SDK versions, but the
important layout is stable. watches.yaml maps the custom resource to the
role, roles/ contains your Ansible implementation, and config/ contains the
Kubernetes manifests used to install the CRD, service account, RBAC, and
controller Deployment.
demoapp-operator/├── Dockerfile├── Makefile├── PROJECT├── requirements.yml├── watches.yaml├── roles/│ └── demoapp/│ ├── defaults/│ │ └── main.yml│ ├── meta/│ │ └── main.yml│ └── tasks/│ └── main.yml└── config/ ├── crd/ │ └── bases/ ├── default/ ├── manager/ ├── rbac/ └── samples/Open watches.yaml and make the mapping explicit. The SDK usually scaffolds
the basic entry for you, but editing it now reinforces the contract. This
module disables generic status management because the role will publish the
status fields directly.
---- version: v1 group: app.example.com kind: DemoApp role: demoapp manageStatus: falseAdd the Ansible collections required by the role. The Kubernetes collection
provides kubernetes.core.k8s and kubernetes.core.k8s_info; the Operator SDK
utility collection provides operator_sdk.util.k8s_status. In a scaffolded
project, check both requirements.yml and the role metadata. The
requirements.yml file drives collection installation during make docker-build (the SDK
Dockerfile runs ansible-galaxy collection install -r requirements.yml). The
meta/main.yml collections: entry is a Galaxy dependency declaration — useful
only if the role is later published to Galaxy as a standalone artifact.
---collections: - name: kubernetes.core - name: operator_sdk.util---collections: - kubernetes.core - operator_sdk.utilThe custom resource should have a small structural schema. Edit the CRD under
config/crd/bases/app.example.com_demoapps.yaml so users get validation before
the role ever runs. The full generated CRD contains metadata and conversion
details; the important part for this module is the schema under the served
version.
apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: demoapps.app.example.comspec: group: app.example.com names: kind: DemoApp listKind: DemoAppList plural: demoapps singular: demoapp scope: Namespaced versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: apiVersion: type: string kind: type: string metadata: type: object spec: type: object properties: replicas: type: integer minimum: 1 maximum: 10 default: 1 image: type: string default: nginx:1.27-alpine port: type: integer minimum: 1 maximum: 65535 default: 80 required: - replicas status: type: object x-kubernetes-preserve-unknown-fields: true subresources: status: {}Now write the role. The role does four things: derive defaults, validate the
requested replica count, reconcile a Deployment, reconcile a Service, and write
status. Notice that the role never shells out to kubectl. It uses the
Kubernetes collection so Ansible can perform idempotent API operations and
return structured results.
---- name: Derive DemoApp desired values ansible.builtin.set_fact: demoapp_name: "{{ ansible_operator_meta.name }}" demoapp_namespace: "{{ ansible_operator_meta.namespace }}" demoapp_replicas: "{{ replicas | default(1) | int }}" demoapp_image: "{{ image | default('nginx:1.27-alpine') }}" demoapp_port: "{{ port | default(80) | int }}"
- name: Validate DemoApp replica bounds ansible.builtin.assert: that: - demoapp_replicas >= 1 - demoapp_replicas <= 10 fail_msg: "DemoApp replicas must be between 1 and 10."
- name: Reconcile DemoApp Deployment kubernetes.core.k8s: state: present wait: true wait_timeout: 120 # This blocks the reconcile loop for up to 120 s; production operators should use `k8s_info` with `retries`/`until` (see the Common Patterns section below) rather than blocking the worker on slow or flapping pods. definition: apiVersion: apps/v1 kind: Deployment metadata: name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" labels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" app.kubernetes.io/managed-by: ansible-operator spec: replicas: "{{ demoapp_replicas }}" selector: matchLabels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" template: metadata: labels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" spec: containers: - name: demoapp image: "{{ demoapp_image }}" imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 # The `containerPort` stays at 80 because nginx serves on 80 # regardless of which port the Service exposes; the CR `port` field # controls only the Service-facing port (its `targetPort` continues # to route to container port 80). readinessProbe: httpGet: path: / port: http initialDelaySeconds: 3 periodSeconds: 5 livenessProbe: httpGet: path: / port: http initialDelaySeconds: 10 periodSeconds: 10
- name: Reconcile DemoApp Service kubernetes.core.k8s: state: present definition: apiVersion: v1 kind: Service metadata: name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" labels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" app.kubernetes.io/managed-by: ansible-operator spec: type: ClusterIP selector: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" ports: - name: http port: "{{ demoapp_port }}" targetPort: http
- name: Read DemoApp Deployment after reconcile kubernetes.core.k8s_info: api_version: apps/v1 kind: Deployment name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" register: demoapp_deployment
- name: Publish DemoApp status operator_sdk.util.k8s_status: api_version: app.example.com/v1 kind: DemoApp name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" status: phase: >- {{ 'Ready' if ((demoapp_deployment.resources | first | default({}, true)).status.readyReplicas | default(0) | int) == demoapp_replicas else 'Progressing' }} replicas: "{{ demoapp_replicas }}" readyReplicas: "{{ (demoapp_deployment.resources | first | default({}, true)).status.readyReplicas | default(0) }}" serviceName: "{{ demoapp_name }}" endpoint: "http://{{ demoapp_name }}.{{ demoapp_namespace }}.svc.cluster.local:{{ demoapp_port }}"The role is deliberately boring. Boring is good here. Reconciliation runs more
than once, so the tasks must be safe when the Deployment and Service already
exist. The kubernetes.core.k8s module patches resources toward the supplied
definition rather than creating duplicates. The status task publishes the
current observation and can run repeatedly without changing the desired spec.
The operator service account also needs permission to manage the child
resources. Edit config/rbac/role.yaml and make sure it includes Deployments,
Services, and the custom resource status subresource. The generated file may
already contain the custom resource rules; add the core and apps permissions if
they are missing.
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: manager-rolerules: - apiGroups: - "" resources: - services verbs: - create - delete - get - list - patch - update - watch - apiGroups: - apps resources: - deployments verbs: - create - delete - get - list - patch - update - watch - apiGroups: - app.example.com resources: - demoapps - demoapps/status - demoapps/finalizers verbs: - create - delete - get - list - patch - update - watchUpdate the sample custom resource under
config/samples/app_v1_demoapp.yaml. The sample is what you will apply after
the operator is running.
apiVersion: app.example.com/v1kind: DemoAppmetadata: name: demoapp-samplespec: replicas: 2 image: nginx:1.27-alpine port: 80Build the operator image. In a normal shared cluster, push the image to a registry that the cluster can pull. Replace the registry namespace with one you control. This command path is the production-shaped path because Kubernetes nodes pull the operator image from a real registry.
export IMG="quay.io/YOUR_NAMESPACE/demoapp-operator:v0.1.0"
make docker-build docker-push IMG="${IMG}"make deploy IMG="${IMG}"For a local kind lab, you can build the image and load it directly into the kind node instead of using a remote registry. This keeps the lab runnable even when you do not want to create a registry repository for a throwaway example.
export IMG="demoapp-operator:v0.1.0"
make docker-build IMG="${IMG}"kind load docker-image "${IMG}" --name ansible-operator-labmake deploy IMG="${IMG}"The make deploy target uses kustomize under the hood. If you want to see the
manifests before applying them, render the default overlay. This is especially
useful when debugging image names, service account names, or RBAC scope.
kustomize build config/default | sed -n '1,180p'Create a namespace for the managed application and apply the sample custom
resource. The operator itself is usually deployed into a system namespace such
as demoapp-operator-system, while the custom resources can live in tenant
namespaces if RBAC allows it.
kubectl create namespace demoapp-labkubectl apply -n demoapp-lab -f config/samples/app_v1_demoapp.yamlObserve the reconciliation. First confirm that the custom resource exists, then inspect the child Deployment and Service. The Deployment should converge to two ready replicas, and the Service should point at the Pods selected by the labels in the role.
kubectl get demoapps.app.example.com -n demoapp-labkubectl get deployment,service,pods -n demoapp-labkubectl get demoapp demoapp-sample -n demoapp-lab -o yamlExpected output will vary in timestamps and image IDs, but the shape should be recognizable. The custom resource should exist, the Deployment should have the requested replica count, and the Service should expose port 80.
NAME AGEdemoapp-sample 1m
NAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/demoapp-sample 2/2 2 2 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/demoapp-sample ClusterIP 10.96.120.15 <none> 80/TCP 1mFinally, check the operator logs. Ansible output is one of the best early
debugging surfaces because it tells you which task ran, which task changed
something, and which task failed. In a generated SDK project, the controller
Deployment name usually includes controller-manager.
kubectl logs -n demoapp-operator-system deployment/demoapp-operator-controller-managerPause and predict: if you edit spec.replicas from 2 to 3, should the
operator create a new Deployment or patch the existing Deployment? The correct
behavior is a patch. The resource identity is still the same; only desired
state changed. If you see duplicate Deployments, your role is deriving names
from mutable inputs or bypassing the Kubernetes module in a way that breaks
idempotency.
Common Patterns for Ansible Reconcilers
Section titled “Common Patterns for Ansible Reconcilers”The first production pattern is to make every task idempotent. A reconcile loop
will run after create events, update events, dependent resource changes, and
periodic resyncs. A task that is safe only once is not a controller task. Use
kubernetes.core.k8s with complete resource definitions, stable names, stable
labels, and explicit state. Avoid shell commands that call kubectl apply
unless you have a very specific reason and a test proving idempotency.
- name: Reconcile a ConfigMap with a stable name kubernetes.core.k8s: state: present definition: apiVersion: v1 kind: ConfigMap metadata: name: "{{ ansible_operator_meta.name }}-settings" namespace: "{{ ansible_operator_meta.namespace }}" data: logLevel: "{{ log_level | default('info') }}"The second pattern is to treat status as a user interface. A status field is
not a dump of every Ansible variable; it is the compact observation users need
after they apply a CR. For a workload wrapper, useful fields might include
phase, readyReplicas, serviceName, endpoint, and a conditions list.
For an external integration, useful fields might include provider ID, last
sync time, and a clear failure reason. Keep status stable enough that humans
and automation can rely on it.
- name: Publish a focused status block operator_sdk.util.k8s_status: api_version: app.example.com/v1 kind: DemoApp name: "{{ ansible_operator_meta.name }}" namespace: "{{ ansible_operator_meta.namespace }}" status: phase: Ready readyReplicas: 2 serviceName: "{{ ansible_operator_meta.name }}"The third pattern is controlled failure. A failed Ansible task becomes a failed
reconcile, which is useful when the error is truly blocking. For expected
transient states, use retries, readiness checks, failed_when, and rescue
blocks so the status tells the truth. Do not hide every failure, but do not
make the controller crash-loop because a Deployment needed another few seconds
to become ready.
- name: Read Deployment availability with controlled retry kubernetes.core.k8s_info: api_version: apps/v1 kind: Deployment name: "{{ ansible_operator_meta.name }}" namespace: "{{ ansible_operator_meta.namespace }}" register: deployment_read retries: 10 delay: 6 until: > (deployment_read.resources | length) == 1 and ((deployment_read.resources | first | default({}, true)).status.readyReplicas | default(0) | int) >= (replicas | default(1) | int)The fourth pattern is to keep variable translation obvious. Operator SDK passes
custom resource spec fields into Ansible as variables and, by default, case
converts camelCase into snake_case. If your CRD uses serviceAccountName, your
role should expect service_account_name unless you explicitly disable the
conversion in watches.yaml. Document that mapping because it becomes a
common source of “variable is undefined” failures.
---- version: v1 group: app.example.com kind: DemoApp role: demoapp snakeCaseParameters: trueThe fifth pattern is to keep external side effects behind explicit fields and finalizers. It is easy to call a cloud API from Ansible, but a controller must also know how to observe, retry, report, and delete that external state. If the first version only manages in-cluster resources, you can rely on owner references and ordinary garbage collection. Once you create external state, design the finalizer and status story before adding the task.
Decision Framework: When Ansible Operator Beats Go Operator
Section titled “Decision Framework: When Ansible Operator Beats Go Operator”An Ansible Operator beats a Go Operator when the reconciliation problem is mostly operational automation and the team can express it safely with existing roles. It does not beat Go because YAML is inherently easier. It beats Go when the cost of introducing a full Go controller exceeds the value of strong typing, custom caches, advanced event filtering, and fine-grained control over the reconcile loop.
| Question | Ansible Operator is a strong fit when | Go Operator is a stronger fit when |
|---|---|---|
| Team skill | The team already writes and reviews Ansible roles well | The team already maintains Go services or controllers |
| Logic shape | Reconciliation is a sequence of idempotent tasks | Reconciliation is a complex state machine |
| Resource graph | A small CR manages a modest set of children | Many related resource types need indexes and predicates |
| Latency | Seconds-level convergence is acceptable | Sub-second or high-frequency decisions matter |
| Validation | Ansible assertions and CRD schema are enough | Compile-time types and admission webhooks are central |
| External systems | Existing modules already automate the external API | Custom API clients, caches, or streaming watches are needed |
| Delivery pressure | A working role can become a controller quickly | Long-term controller complexity justifies the upfront cost |
| Operations | Logs and task output are useful to the team | Structured metrics and custom reconcile paths dominate |
Use this table as a design review tool, not as a permanent identity. A team can start with an Ansible Operator to prove the custom resource contract, then rewrite the controller in Go later if the API becomes central and the state machine outgrows role-based automation. The custom resource interface is often more durable than the first implementation language.
When Not to Use Ansible Operator
Section titled “When Not to Use Ansible Operator”Do not use an Ansible Operator for latency-sensitive controllers. A role run has Ansible startup overhead, task execution overhead, API calls through modules, and artifact handling. That is acceptable for many platform workflows, but it is the wrong shape for controllers that need very fast reactions to large numbers of events.
Do not use an Ansible Operator for high-frequency reconcile loops. A short
reconcilePeriod can multiply API traffic and logs even when no meaningful
state has changed. If you are tempted to reconcile every few seconds, first
ask why watch events and dependent watches are not enough. If the answer is
“the domain requires constant control decisions,” Go is probably a better
implementation.
Do not use an Ansible Operator as a place to hide a complex state machine. Ansible can branch, retry, rescue, include tasks, and call APIs, but a large graph of states, timers, cross-resource dependencies, and partial failure paths becomes hard to reason about in YAML. Complex controllers benefit from typed structs, unit tests, explicit state transitions, and controller-runtime primitives.
Do not use an Ansible Operator when the desired behavior is already a clean Helm chart install. If the CR mostly passes values into templates and the application lifecycle is install, upgrade, and uninstall, a Helm Operator or plain GitOps-managed Helm release may be less surprising. Ansible earns its place when task logic, checks, external operations, or status reporting matter.
Do not use an Ansible Operator to bypass platform boundaries. The operator Pod runs with a service account, and that service account can be powerful. A role that talks to arbitrary external APIs or broad cluster resources must go through the same security review as any controller. YAML does not make side effects safer.
Troubleshooting the First Reconcile
Section titled “Troubleshooting the First Reconcile”The fastest troubleshooting path starts with the custom resource, not with the
operator code. Read the CR and confirm that the spec the API server stored
is the spec you intended to submit. If schema defaults applied, they should be
visible. If validation rejected the object, no role run happened, so operator
logs will not explain the failure. This separation matters because API
validation failures and reconcile failures live in different parts of the
system.
After the CR is accepted, inspect status. A useful Ansible Operator status
field should tell you whether the role reached the point where it could observe
child resources. If phase is missing, the role may not have run, the status
task may have failed, or manageStatus may be fighting with your custom
status behavior. If phase is Progressing, compare the reported ready
replicas with the Deployment status before assuming Ansible is wrong.
The next step is child-resource inspection. Look at the Deployment and Service labels, selectors, owner references, and events. A Service with the wrong selector can make an application look down even while Pods are healthy. A Deployment with a missing owner reference may survive CR deletion. A Pod image pull error may be a registry problem, not an operator problem. The controller is responsible for creating resources, but Kubernetes still owns scheduling, pulling images, probing containers, and recording Pod events.
Only after those checks should you read the operator logs in detail. Ansible
logs are task-oriented, which is useful when you know which CR triggered the
run. Search around the CR name and namespace, then identify the first failed
task rather than the last stack trace. If the first failed task is a
kubernetes.core.k8s call with a forbidden error, the fix is probably RBAC. If
the failed task is an assertion, the fix is probably input validation or a
clearer CRD schema.
Finally, verify permissions as the operator service account. kubectl auth can-i is not a replacement for logs, but it quickly proves whether the
controller is allowed to perform the operation your role requested. This habit
prevents a common time sink: editing Ansible tasks for an hour when the real
bug is that the deployed ClusterRole cannot patch Deployments in the tenant
namespace.
Production Readiness Checklist
Section titled “Production Readiness Checklist”Use this checklist when a prototype starts moving toward shared-cluster use. The goal is not to make a small operator heavyweight; the goal is to identify the production decisions that are easy to forget when the first lab works.
- The CRD schema validates required fields, allowed ranges, and defaults.
- The role is idempotent when the same CR is reconciled repeatedly.
- The role uses
kubernetes.core.k8sandkubernetes.core.k8s_infoinstead of shelling out tokubectl. - The operator service account has only the Kubernetes verbs and resources the role actually needs.
- The role publishes stable status fields that help users debug without reading controller logs.
- Failure paths use clear Ansible errors, controlled retries, and status updates for expected transient states.
- Any external resource creation has a deletion story, usually involving a finalizer.
- The image tag is immutable in production deployment manifests.
- The container registry, image signature, and vulnerability scanning workflow are defined.
- The operator logs are collected, searchable, and mapped to CR names and namespaces.
- The team has a local test path and a cluster test path before changing production roles.
- The team has decided which future advanced watch patterns are allowed before adding multiple CRDs or dependent watches.
Did You Know?
Section titled “Did You Know?”- Operator SDK’s current documentation shows the Ansible quickstart using
operator-sdk init --domain example.com --plugins ansible, followed byoperator-sdk create api --generate-role, which is the same scaffold pattern this module uses. watches.yamlsupports a periodicreconcilePeriodfor cases where watch events are not enough, and the documentation warns that low values can make expensive operators perform worse rather than safer.- Custom resource
specfields are converted to Ansible variables, and the default case conversion turns camelCase fields such asserviceAccountNameinto snake_case variables such asservice_account_name. - The modern Kubernetes Ansible collection is
kubernetes.core; older examples that use legacy Kubernetes modules should be treated as historical unless a maintained project still documents that exact path.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Consequence | Fix |
|---|---|---|
| Treating the operator as a scheduled playbook runner | The role runs repeatedly without a clear event model, creating noisy logs and surprising API load | Design for watch-driven reconciliation first, then add periodic reconciliation only for a specific reason |
Using shell tasks that call kubectl | Tasks become harder to make idempotent, harder to inspect, and dependent on CLI behavior inside the image | Use kubernetes.core.k8s and kubernetes.core.k8s_info for structured API operations |
| Forgetting RBAC for child resources | The CR is accepted but reconciliation fails with forbidden errors when the role creates Deployments or Services | Add only the required verbs and resources to config/rbac/role.yaml, then test with the deployed service account |
| Publishing raw Ansible output as the only status signal | Users cannot tell whether the application is ready without reading controller logs | Set manageStatus: false and write focused fields with operator_sdk.util.k8s_status |
| Deriving child resource names from mutable spec fields | Edits create duplicate Deployments or Services instead of patching the existing children | Derive names from CR metadata and keep mutable inputs inside the child spec |
Setting a very short reconcilePeriod | The operator polls aggressively, burns API budget, and hides real event-driven behavior under noise | Keep the period conservative and rely on watched resource events for ordinary convergence |
| Creating external resources without deletion design | Deleting the CR leaves cloud or SaaS resources behind because Kubernetes garbage collection cannot see them | Add a finalizer and make deletion idempotent before shipping the external side effect |
| Letting the CRD schema stay loose | Invalid values reach the role and fail later with unclear Ansible errors | Add structural schema validation for required fields, ranges, enums, and defaults |
Quiz: Self-Check
Section titled “Quiz: Self-Check”Your team has a working Ansible role that creates a Deployment and Service from a few variables. Product teams want a Kubernetes-native self-service API. Why might an Ansible Operator be a reasonable first implementation?
An Ansible Operator lets the team keep the existing role vocabulary while
placing it behind a Kubernetes custom resource and reconcile loop. The team can
ship a small CRD, map it in watches.yaml, and use kubernetes.core.k8s tasks
to converge child resources without immediately writing a Go controller. This
is a good fit when seconds-level convergence is acceptable and the logic is a
sequence of idempotent tasks rather than a complex state machine.
A `DemoApp` custom resource is accepted, but the operator logs show forbidden errors when creating Deployments. What should you inspect first?
Inspect the operator service account’s RBAC, especially config/rbac/role.yaml
and the rendered ClusterRole or Role applied to the cluster. The CRD being
accepted only proves the API server understands the custom resource; it does
not prove the controller can create child objects. Add the minimal apps
Deployment verbs the role needs, redeploy the operator manifests, and rerun the
reconcile by editing or reapplying the CR.
Your CR has `spec.serviceAccountName`, but the role fails because `serviceAccountName` is undefined. What is the likely cause?
Operator SDK passes spec fields into Ansible and, by default, converts camelCase
field names into snake_case variables. The role should look for
service_account_name unless snakeCaseParameters was disabled in
watches.yaml. The better fix is to document the mapping and keep role
variable names consistent with the operator’s conversion behavior rather than
guessing at both forms.
Note: the CRD declares spec.serviceAccountName (camelCase), but inside the Ansible
role you receive it as service_account_name (snake_case) because watches.yaml
has snakeCaseParameters: true. This is the most common point of confusion when
porting between SDK and Ansible Operator.
A teammate proposes setting `reconcilePeriod: 5s` because frequent reconciliation feels safer. How would you evaluate that proposal?
Frequent periodic reconciliation is not automatically safer; it can create API load, log noise, and repeated Ansible artifact churn even when no desired state changed. First confirm whether ordinary watch events and dependent resource watches already cover the failure mode the teammate is worried about. A short period should be reserved for a documented case where no useful Kubernetes event exists and where the operator’s API cost has been tested.
The role creates an external DNS record through an API call when the CR is created. What design topic becomes mandatory before production?
Deletion design becomes mandatory because Kubernetes garbage collection cannot clean up an external DNS record by itself. The operator likely needs a finalizer so a delete event can run an idempotent cleanup role or playbook before the custom resource is removed. The status should also expose enough external identity information for operators to confirm which remote object is being managed.
The `DemoApp` role creates a new Deployment every time `spec.image` changes. Which implementation smell does that reveal?
The role is probably deriving the child Deployment name from mutable spec data,
or it is creating resources outside an idempotent kubernetes.core.k8s
definition. Child resource identity should normally come from stable CR
metadata such as the CR name and namespace. Mutable values such as image,
replicas, and port belong inside the Deployment or Service spec so edits patch
the existing child resources.
Hands-On Exercise: Runnable End-to-End Lab
Section titled “Hands-On Exercise: Runnable End-to-End Lab”This lab builds the DemoApp operator in a kind cluster. It uses a local image
load path so you do not need a public registry, and it keeps the managed
application in a separate namespace from the operator. The lab assumes your
machine can run Docker containers and that operator-sdk is installed from
the official installation instructions.
Task 1: Create the kind Cluster
Section titled “Task 1: Create the kind Cluster”Create a Kubernetes 1.35 kind cluster for the operator lab. If you already have a cluster with this name, delete it first so the lab starts from known state.
kind delete cluster --name ansible-operator-labkind create cluster --name ansible-operator-lab --image kindest/node:v1.35.0kubectl cluster-info --context kind-ansible-operator-labIf your local kind version doesn’t have the v1.35.0 node image cached, drop the --image flag entirely — kind create cluster will default to whatever node image your installed kind ships with, which is typically a recent stable release.
Expected output should show the control plane endpoint for the kind cluster and
confirm that kubectl is using the correct context. If kubectl points at a
different cluster, stop and fix the context before deploying the operator.
kubectl config current-contextSolution guidance for Task 1
The context should be kind-ansible-operator-lab. If it is not, run
kubectl config use-context kind-ansible-operator-lab and repeat the current
context check. Starting with a known local cluster keeps the operator RBAC,
namespace, and image-loading behavior predictable during the rest of the lab.
Task 2: Scaffold the Operator Project
Section titled “Task 2: Scaffold the Operator Project”Create the SDK project and generate the API. The command names and flags are the same as the conceptual walkthrough, but this time you will edit the files for a real local run.
mkdir demoapp-operatorcd demoapp-operator
operator-sdk init --plugins=ansible --domain=example.comoperator-sdk create api --group=app --version=v1 --kind=DemoApp --generate-roleInspect the generated files before editing. This is not busywork; knowing what the SDK generated helps you debug later when a manifest, RBAC rule, or image name is not what you expected.
find . -maxdepth 3 -type f | sort | sed -n '1,120p'cat watches.yamlSolution guidance for Task 2
You should see watches.yaml, roles/demoapp/, config/rbac/,
config/crd/, config/manager/, and config/samples/. If the role directory
has a different case or spelling, use the generated name consistently rather
than forcing the tutorial’s spelling. The watch entry must point to the role
name that exists inside roles/.
Task 3: Add the Role, Schema, and RBAC
Section titled “Task 3: Add the Role, Schema, and RBAC”Replace or edit watches.yaml so it maps DemoApp to the demoapp role and
disables generic status management for this module.
---- version: v1 group: app.example.com kind: DemoApp role: demoapp manageStatus: falseEnsure requirements.yml includes the two collections used by the role.
---collections: - name: kubernetes.core - name: operator_sdk.utilEnsure roles/demoapp/meta/main.yml declares the same collections for role
execution.
---collections: - kubernetes.core - operator_sdk.utilEdit roles/demoapp/tasks/main.yml with the role from the walkthrough. If you
copy the full role, keep the names stable and keep imagePullPolicy: IfNotPresent; that setting lets kind use the locally loaded operator-managed
workload image if you later experiment with custom images.
---- name: Derive DemoApp desired values ansible.builtin.set_fact: demoapp_name: "{{ ansible_operator_meta.name }}" demoapp_namespace: "{{ ansible_operator_meta.namespace }}" demoapp_replicas: "{{ replicas | default(1) | int }}" demoapp_image: "{{ image | default('nginx:1.27-alpine') }}" demoapp_port: "{{ port | default(80) | int }}"
- name: Validate DemoApp replica bounds ansible.builtin.assert: that: - demoapp_replicas >= 1 - demoapp_replicas <= 10 fail_msg: "DemoApp replicas must be between 1 and 10."
- name: Reconcile DemoApp Deployment kubernetes.core.k8s: state: present wait: true wait_timeout: 120 definition: apiVersion: apps/v1 kind: Deployment metadata: name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" labels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" app.kubernetes.io/managed-by: ansible-operator spec: replicas: "{{ demoapp_replicas }}" selector: matchLabels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" template: metadata: labels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" spec: containers: - name: demoapp image: "{{ demoapp_image }}" imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 readinessProbe: httpGet: path: / port: http initialDelaySeconds: 3 periodSeconds: 5 livenessProbe: httpGet: path: / port: http initialDelaySeconds: 10 periodSeconds: 10
- name: Reconcile DemoApp Service kubernetes.core.k8s: state: present definition: apiVersion: v1 kind: Service metadata: name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" labels: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" app.kubernetes.io/managed-by: ansible-operator spec: type: ClusterIP selector: app.kubernetes.io/name: demoapp app.kubernetes.io/instance: "{{ demoapp_name }}" ports: - name: http port: "{{ demoapp_port }}" targetPort: http
- name: Read DemoApp Deployment after reconcile kubernetes.core.k8s_info: api_version: apps/v1 kind: Deployment name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" register: demoapp_deployment
- name: Publish DemoApp status operator_sdk.util.k8s_status: api_version: app.example.com/v1 kind: DemoApp name: "{{ demoapp_name }}" namespace: "{{ demoapp_namespace }}" status: phase: >- {{ 'Ready' if ((demoapp_deployment.resources | first | default({}, true)).status.readyReplicas | default(0) | int) == demoapp_replicas else 'Progressing' }} replicas: "{{ demoapp_replicas }}" readyReplicas: "{{ (demoapp_deployment.resources | first | default({}, true)).status.readyReplicas | default(0) }}" serviceName: "{{ demoapp_name }}" endpoint: "http://{{ demoapp_name }}.{{ demoapp_namespace }}.svc.cluster.local:{{ demoapp_port }}"Add the CRD schema and RBAC snippets from the walkthrough. Then render the operator manifests before deploying so you can catch indentation mistakes or image placeholder problems early.
kustomize build config/default >demoapp-operator-rendered.yamlsed -n '1,220p' demoapp-operator-rendered.yamlSolution guidance for Task 3
If kustomize fails, the problem is usually malformed YAML in the CRD, RBAC, or manager overlay. If kustomize succeeds but the operator later fails to create child resources, inspect RBAC first. Rendering manifests before deployment is one of the fastest ways to move from “the SDK generated something” to “I know what Kubernetes will receive.”
Task 4: Build and Deploy the Operator
Section titled “Task 4: Build and Deploy the Operator”Build the operator image, load it into kind, and deploy the operator. This path avoids a remote image registry while still exercising the same image and kustomize deployment mechanics used by the production-shaped flow.
export IMG="demoapp-operator:v0.1.0"
make docker-build IMG="${IMG}"kind load docker-image "${IMG}" --name ansible-operator-labmake deploy IMG="${IMG}"Wait for the controller Deployment. The namespace name comes from the generated
project name; if your directory name differs, inspect config/default or list
namespaces to find the generated system namespace.
kubectl get namespaces | grep demoappkubectl rollout status deployment/demoapp-operator-controller-manager \ -n demoapp-operator-system \ --timeout=180sSolution guidance for Task 4
If the rollout waits forever, check the controller Pod events and image name.
In kind, kind load docker-image must target the same cluster name you created
in Task 1. If the Pod starts but crashes, read the logs; collection install
errors, malformed watches.yaml, and RBAC mistakes usually appear clearly in
the controller output.
Task 5: Create a DemoApp Instance and Observe Reconciliation
Section titled “Task 5: Create a DemoApp Instance and Observe Reconciliation”Create the namespace for the managed application and apply the sample custom
resource. If your sample file lacks a namespace, the -n demoapp-lab flag
places the CR in the lab namespace.
kubectl create namespace demoapp-labkubectl apply -n demoapp-lab -f config/samples/app_v1_demoapp.yamlWatch the child resources. A successful reconciliation creates a Deployment, two Pods, and a Service with the same stable instance label.
kubectl get demoapps.app.example.com -n demoapp-labkubectl get deployment,service,pods -n demoapp-labkubectl get demoapp demoapp-sample -n demoapp-lab -o jsonpath='{.status.phase}{"\n"}'Scale through the custom resource instead of editing the Deployment. This is the point of the operator API: users change desired state on the CR, and the controller converges children.
kubectl patch demoapp demoapp-sample \ -n demoapp-lab \ --type=merge \ -p '{"spec":{"replicas":3}}'
kubectl rollout status deployment/demoapp-sample -n demoapp-lab --timeout=180skubectl get deployment demoapp-sample -n demoapp-labSolution guidance for Task 5
The Deployment should scale to three replicas without creating a second
Deployment. If the status remains Progressing, check the Pods and then the
operator logs. If the Deployment does not change at all, confirm the CR patch
actually changed .spec.replicas and that the operator has permission to patch
Deployments in demoapp-lab.
Task 6: Break It Deliberately and Diagnose the Failure
Section titled “Task 6: Break It Deliberately and Diagnose the Failure”Patch the CR to an invalid replica count. The CRD schema should reject the request before Ansible runs. This proves validation is happening at the API boundary.
kubectl patch demoapp demoapp-sample \ -n demoapp-lab \ --type=merge \ -p '{"spec":{"replicas":20}}'Now simulate an RBAC-style failure by reading the operator logs and identifying the service account that would need additional permissions if a task failed. You do not need to break RBAC manually for the lab; the diagnostic workflow is the important part.
kubectl get pod -n demoapp-operator-systemkubectl logs -n demoapp-operator-system deployment/demoapp-operator-controller-manager | tail -n 80kubectl auth can-i patch deployments \ --as=system:serviceaccount:demoapp-operator-system:demoapp-operator-controller-manager \ -n demoapp-labSolution guidance for Task 6
The invalid patch should fail because the schema caps replicas at 10. The
kubectl auth can-i command should answer yes after RBAC is correct. When a
real reconcile fails, collect three pieces of evidence before changing code:
the CR .status, the operator logs around the failed task, and the Kubernetes
authorization result for the service account.
Task 7: Clean Up
Section titled “Task 7: Clean Up”Delete the custom resource, undeploy the operator, and delete the kind cluster. Cleanup matters because controller labs can leave CRDs and namespaces behind, which makes later runs confusing.
kubectl delete demoapp demoapp-sample -n demoapp-labmake undeploykind delete cluster --name ansible-operator-labSuccess Criteria
Section titled “Success Criteria”- You designed the
watches.yamlmapping fromDemoAppto thedemoapprole and can explain how the mapping starts reconciliation. - You implemented the DemoApp operator with Operator SDK,
kubernetes.core.k8s, andoperator_sdk.util.k8s_status. - You deployed the operator image into a kind cluster and created a
DemoAppcustom resource. - You observed the Deployment, Service, Pods, and custom resource status created by the role.
- You diagnosed at least one validation or reconciliation failure using CR status, operator logs, and RBAC checks.
- You evaluated whether the example would still fit Ansible Operator if it grew external resources, short reconcile periods, or a complex state machine.
Next Arc Modules
Section titled “Next Arc Modules”This module is the vocabulary-setter for the Ansible Operator arc. Future
modules can now assume that watches.yaml, role-driven reconciliation,
kubernetes.core, custom status, and the Ansible-versus-Go trade-off have
already been introduced.
- Module 7.13 (planned): Advanced
watches.yamlPatterns - selectors, multiple watches, dependent watches, reconcile period tuning, and annotation overrides. - Module 7.14 (planned): AWX, Tower, and Event-Driven Ansible Integration - where cluster controllers and enterprise automation controllers should meet, and where they should remain separate.
- Module 7.15 (planned): Helm vs. Ansible Operator Decision Framework - a deeper implementation-choice guide for chart-shaped, role-shaped, and code-shaped controllers.
- Module 7.16 (planned): Production Ansible Operator Patterns - finalizers, status conditions, RBAC minimization, image publishing, observability, and upgrade strategy.
- Module 7.17 (planned): Testing Ansible Operators with Molecule and Kuttl - local role tests, cluster integration tests, and CI gates for operator behavior.
References
Section titled “References”- https://sdk.operatorframework.io/docs/building-operators/ansible/
- https://sdk.operatorframework.io/docs/building-operators/ansible/quickstart/
- https://sdk.operatorframework.io/docs/building-operators/ansible/tutorial/
- https://sdk.operatorframework.io/docs/building-operators/ansible/reference/watches/
- https://sdk.operatorframework.io/docs/building-operators/ansible/reference/information-flow-ansible-operator/
- https://sdk.operatorframework.io/docs/building-operators/ansible/reference/finalizers/
- https://sdk.operatorframework.io/docs/installation/
- https://github.com/ansible-collections/kubernetes.core
- https://docs.ansible.com/ansible/latest/collections/kubernetes/core/k8s_module.html
- https://docs.ansible.com/ansible/latest/collections/kubernetes/core/k8s_info_module.html
- https://github.com/operator-framework/ansible-operator-plugins
- https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
- https://kubernetes.io/docs/concepts/architecture/controller/
Next Module
Section titled “Next Module”Continue with Module 7.13 (planned): Advanced watches.yaml Patterns to
learn how one Ansible Operator can manage multiple kinds, tune event behavior,
control dependent watches, and make deliberate choices about periodic
reconciliation.