Module 1.5: CRDs & Operators - Extending Kubernetes
Module 1.5: CRDs & Operators - Extending Kubernetes
Section titled “Module 1.5: CRDs & Operators - Extending Kubernetes”Complexity:
[MEDIUM]- New to CKA 2025Time to Complete: 35-45 minutes
Prerequisites: Module 1.1 (Control Plane understanding)
Learning Outcomes
Section titled “Learning Outcomes”After this module, you will be able to:
- Design a Custom Resource Definition with schema validation, scope, printer columns, and subresources that fit a real operational purpose.
- Debug CRD discovery, API group, version, validation, and deletion problems using
kubectland Kubernetes API metadata. - Implement an operator-style reconciliation loop that turns a custom resource into ordinary Kubernetes objects.
- Evaluate when a CRD and operator are better than built-in resources, Helm values, or plain configuration files.
- Diagnose operator reconciliation failures by inspecting custom resource status, events, controller pods, and RBAC boundaries.
Why This Module Matters
Section titled “Why This Module Matters”Exercise scenario: your platform team wants application teams to request TLS certificates, monitoring targets, and small databases without opening tickets for every supporting object. The teams do not want to learn every Secret, Service, StatefulSet, and certificate challenge detail, but the cluster still needs validation, auditability, RBAC, and repeatable operations. A Custom Resource Definition gives the API server a new resource type, and an operator gives that resource type behavior.
This is the same extension model behind widely used Kubernetes tools such as cert-manager, the Prometheus Operator, Argo CD, Istio, and many storage and database operators. Those tools do not ask the Kubernetes project to add a built-in Certificate, ServiceMonitor, or PostgreSQL kind. They register those kinds through CRDs, then run controllers that watch the new resources and reconcile ordinary Kubernetes objects until the cluster matches the desired state.
For the CKA, the point is not to become an operator author in one lesson. The point is to recognize what happens when the API grows beyond Pods and Deployments, know how to inspect the new types, and separate storage from behavior. A CRD by itself stores and validates data; an operator watches that data and acts on it. Confusing those two ideas is the fastest path to mystery failures during installation, upgrades, and incident response.
Think of Kubernetes like a public building with a front desk. Built-in resources are standard request forms the staff already knows how to process: Pod, Service, Deployment, ConfigMap. A CRD adds a new form to the desk, such as Certificate or Database. An operator is the trained staff member who reads that form, orders the right materials, checks progress, and updates the requester when the work is complete.
This module also matters because CRDs are often the first place where platform engineering becomes visible to application teams. A platform team can expose a small, stable API that says “I need a certificate for this DNS name” or “I need this workload monitored” while hiding the operational workflow behind the scenes. That is not merely convenience. It reduces the number of manual steps a team must remember, narrows who needs permission to touch sensitive resources, and gives the organization a consistent audit trail.
There is a cost to that power. Once a CRD is installed and teams begin committing custom resources to Git, the CRD becomes an API that people depend on. Renaming fields, changing defaults, altering scope, or removing versions can break deployments just as surely as changing a public REST API. Treat CRD design as contract design: start with the user problem, expose the smallest useful intent, and plan how the API can evolve without surprising every cluster that already stores those objects.
Custom Resources Extend the API, Not the Scheduler
Section titled “Custom Resources Extend the API, Not the Scheduler”A CRD extends the Kubernetes API with a new resource type. After the CRD is accepted by the API server, the new kind participates in normal API machinery: discovery, validation, storage in etcd, watch streams, RBAC checks, kubectl get, kubectl describe, and deletion. That is powerful because clients do not need a separate database or side API to track platform intent; the cluster API becomes the shared contract.
The extension is deliberately narrow. Installing a CRD does not teach the scheduler how to place a database, does not create pods, and does not run backups. The API server stores custom resources and enforces the schema you provide. Behavior comes from a controller, often called an operator, that observes the custom resources and creates or updates other resources. Pause and predict: if you define a Database CRD and then create a Database object with replicas: 3, what happens before any operator is installed?
The distinction matters because troubleshooting starts at different layers. If kubectl apply says it has no match for a kind, discovery or CRD installation is broken. If the object exists but nothing changes in the workload, the controller is absent, unhealthy, unauthorized, or unable to reconcile. If the object is rejected with a field error, the CRD schema is doing its job and the manifest does not match the declared API contract.
A useful way to reason about CRDs is to compare them with ConfigMaps. Both can store structured information, and both can be read by controllers or applications. The difference is that a CRD gets its own resource identity, discovery metadata, schema, RBAC verbs, watch stream, status model, and lifecycle behavior. If the data is just configuration consumed by one application, a ConfigMap may be enough. If the data is a platform object that many users create, inspect, secure, and automate, a CRD becomes more appropriate.
Custom resources also fit Kubernetes’ declarative style better than ad hoc scripts. A script can create several resources once, but it usually does not keep checking whether they still match the desired state. A custom resource can remain in the API as the long-lived statement of intent, while a controller repairs drift over time. That is the same mental model you already use with Deployments: you declare replicas, and a controller keeps working until the observed Pods match that declaration.
Built-in Resources: Custom Resources (via CRDs):├── Pod ├── Certificate (cert-manager)├── Deployment ├── Prometheus (prometheus-operator)├── Service ├── PostgreSQL (postgres-operator)├── ConfigMap ├── VirtualService (istio)└── ... └── YourOwnResourceOnce a CRD exists, kubectl treats the new resource family like any other API resource. The plural name becomes the everyday command target, while short names can make interactive exploration faster. The commands below are intentionally ordinary because that is the design goal: custom resources should feel native to operators, scripts, admission policies, and humans who already know Kubernetes.
# Built-in resourcekubectl get pods
# Custom resource (after CRD is installed)kubectl get certificateskubectl get prometheuseskubectl get postgresqlsA CRD definition has several pieces that appear again in real clusters. The API group keeps your type separate from built-in groups and from other extension projects. The version list controls which versions are served and which version is stored. The OpenAPI v3 schema tells the API server what shape valid custom resources must have. Scope determines whether instances live inside namespaces or at cluster level.
apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: crontabs.stable.example.com # <plural>.<group>spec: group: stable.example.com # API group versions: - name: v1 # API version served: true # Enable this version storage: true # Store in etcd schema: openAPIV3Schema: # Validation schema type: object properties: spec: type: object properties: cronSpec: type: string image: type: string replicas: type: integer scope: Namespaced # or Cluster names: plural: crontabs # kubectl get crontabs singular: crontab # kubectl get crontab kind: CronTab # Kind in YAML shortNames: - ct # kubectl get ctThe naming convention on metadata.name is not decorative. Kubernetes expects the CRD name to be the plural resource name followed by the API group, such as crontabs.stable.example.com. If that does not match spec.names.plural and spec.group, installation fails. This makes discovery unambiguous when many extension projects add resources with similar kinds.
The API group deserves careful naming because it becomes part of every custom resource manifest. Public projects usually use a domain they control, while internal platforms often use an internal domain or organization-specific group. Avoid generic groups such as apps.example.com in real production APIs unless they are only for training, because collisions and unclear ownership make later operations harder. The group should tell a future engineer where the type came from and which team owns its lifecycle.
# Apply the CRDkubectl apply -f crontab-crd.yaml
# Verify it was createdkubectl get crd crontabs.stable.example.com
# Now you can create instanceskubectl get crontabs# No resources found (expected - we haven't created any yet)CRDs are stored in etcd like built-in resources, but the API server is still stricter than a plain key-value store. It validates objects against the schema, exposes the type through discovery, enforces RBAC on verbs such as get and update, and lets clients watch for changes. That shared machinery is why custom resources work cleanly with GitOps tools, admission controllers, audit logs, and standard operational workflows.
That shared machinery is also why CRDs should be installed through the same controlled process as other cluster APIs. A broken CRD can block application installs, and a removed CRD can delete the custom resources that depended on it. In mature clusters, CRD changes are reviewed like code, applied before the custom resources that use them, and tested during upgrades. If a chart or manifest bundle mixes CRDs and custom resources, check the installation order before assuming a later error is a controller problem.
The first custom resource instance looks like any other manifest because the CRD has made stable.example.com/v1 and CronTab recognizable to the API server. The object below records desired state, not an actual running cron implementation. Without a controller, it will sit in etcd as valid declarative data, visible and editable, but not acted upon by the cluster.
apiVersion: stable.example.com/v1kind: CronTabmetadata: name: my-cron-job namespace: defaultspec: cronSpec: "* * * * */5" image: my-awesome-cron-image replicas: 3kubectl apply -f my-crontab.yamlkubectl get crontabskubectl get ct # Using shortNamekubectl describe crontab my-cron-jobStandard operations work because the custom resource is now an API resource. You can list it across namespaces, edit it, delete it, watch it, and retrieve its stored YAML. The exact verbs available still depend on RBAC, so a developer might be able to create a Certificate in one namespace while only platform administrators can manage a cluster-scoped issuer.
The important habit is to read the resource through both the user view and the API view. The user view asks whether the object expresses the desired outcome clearly. The API view asks which group, version, scope, schema, status, and permissions control that object. When a cluster contains many operators, the second view keeps you from treating every unfamiliar kind as mysterious. It is still Kubernetes API machinery, just extended by a project-specific contract.
# Createkubectl apply -f crontab.yaml
# Listkubectl get crontabs -A
# Describekubectl describe crontab my-cron-job
# Editkubectl edit crontab my-cron-job
# Deletekubectl delete crontab my-cron-job
# Watchkubectl get crontabs -w
# Get as YAMLkubectl get crontab my-cron-job -o yamlBefore running the next command sequence in a practice cluster, decide which layer you expect to answer each question. Does kubectl api-resources prove that a controller is working, or only that the API server knows the type? Does kubectl describe on a custom resource show desired state, observed status, or both? That mental split makes CRD debugging feel much less random.
Another practical detail is that discovery can lag briefly during installation because clients cache API resource information. If you install a CRD and immediately apply a custom resource from a script, an older client cache or an ordering issue can produce a confusing “no matches” message. Re-running discovery, applying CRDs before dependent resources, and keeping installation steps explicit are simple ways to avoid that noise. In exams, slow down and prove the type exists before chasing unrelated symptoms.
Operators Turn Custom Resources Into Work
Section titled “Operators Turn Custom Resources Into Work”An operator is a controller that understands a specific domain and reconciles custom resources into real cluster changes. It watches the API, reads desired state from custom resources, compares that desired state with the actual objects in the cluster or an external system, and makes changes until the two match. The controller repeats this loop because Kubernetes is eventually consistent: pods restart, users edit objects, nodes fail, and external APIs return temporary errors.
The word operator is sometimes used loosely, but the CKA-relevant pattern is precise: CRDs define the API surface, and controllers implement the behavior. Many operators also ship RBAC, ServiceAccounts, Deployments, leader election configuration, admission webhooks, conversion webhooks, and status update logic. Those supporting pieces matter because a controller with the wrong permissions can watch resources successfully while failing to create the Deployment, Secret, or StatefulSet it is supposed to manage.
Operators are especially useful when the managed system has operational knowledge that does not fit into a static template. Databases need backups, failover, upgrades, and storage checks. Certificate systems need challenge handling, renewal timing, and external issuer communication. Monitoring systems need target discovery and generated configuration. A template can produce initial YAML, but an operator can keep interpreting intent as the world changes around it.
That does not mean every abstraction needs an operator. A controller is another running component with logs, metrics, RBAC, upgrades, and failure modes. If the desired behavior is one-time rendering, a chart or Kustomize overlay may be simpler. If the behavior must react to changes, repair drift, talk to external systems, and publish status, the operator pattern earns its complexity. The best operators reduce the operational surface exposed to users while taking responsibility for the harder moving parts.
┌────────────────────────────────────────────────────────────────┐│ Operator Pattern ││ ││ You create: ││ ┌─────────────────────────────────────────┐ ││ │ apiVersion: databases.example.com/v1 │ ││ │ kind: PostgreSQL │ ││ │ spec: │ ││ │ version: "15" │ ││ │ replicas: 3 │ ││ │ storage: 100Gi │ ││ └─────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────┐ ││ │ Operator (Controller) │ ││ │ │ ││ │ Watches PostgreSQL resources │ ││ │ Creates: │ ││ │ • StatefulSet with 3 replicas │ ││ │ • PVCs for 100Gi storage │ ││ │ • Services for connections │ ││ │ • Secrets for credentials │ ││ │ • ConfigMaps for configuration │ ││ │ │ ││ │ Manages: │ ││ │ • Automatic failover │ ││ │ • Backups │ ││ │ • Version upgrades │ ││ └─────────────────────────────────────────┘ ││ │└────────────────────────────────────────────────────────────────┘In this diagram, the custom resource is the order form and the generated Kubernetes objects are the work product. The operator is not a one-time installer like a Helm chart; it keeps watching after the first apply. If a managed Deployment is deleted, the operator may recreate it. If a desired replica count changes, the operator may update the child object. If an external certificate authority rejects a request, the operator may write that failure into status conditions.
Owner references often connect managed child objects back to the custom resource, but they are not the whole story. Operators may manage objects that cannot safely be garbage-collected, objects in other namespaces, or resources in external systems such as DNS providers and cloud APIs. That is why status, events, finalizers, and logs remain important. You need to know whether the operator intended to own the child object, whether cleanup is blocked, and whether external work completed.
The reconciliation loop is the controller’s heartbeat. It starts with a watch event, reads desired and current state, calculates the difference, acts, and then waits for the next event or retry. Good controllers are idempotent, which means repeating the same reconciliation should be safe. That property is essential because watches can reconnect, retries can happen after partial work, and several resource changes can arrive while earlier work is still settling.
Reconciliation is easier to understand if you stop thinking in terms of commands and start thinking in terms of convergence. A user changes desired state. The controller notices eventually. The controller makes one or more changes. The cluster or an external system reports new observed state. The controller records that observation and decides whether more work remains. This loop may finish in seconds, or it may stay pending while waiting for storage, DNS, certificates, or human approval.
┌─────────────────────────────────────────────────────────────┐│ Reconciliation Loop ││ ││ ┌─────────┐ ││ │ Watch │◄─────────────────────────────────────────┐ ││ └────┬────┘ │ ││ │ Event: PostgreSQL resource changed │ ││ ▼ │ ││ ┌─────────┐ │ ││ │ Read │ Get current state from cluster │ ││ └────┬────┘ │ ││ │ │ ││ ▼ │ ││ ┌─────────┐ │ ││ │ Compare │ Current state vs. Desired state │ ││ └────┬────┘ │ ││ │ │ ││ ▼ │ ││ ┌─────────┐ │ ││ │ Act │ Create/Update/Delete resources │ ││ └────┬────┘ │ ││ │ │ ││ └─────────────────────────────────────────────►─┘ ││ Repeat forever ││ │└─────────────────────────────────────────────────────────────┘Hypothetical scenario: a Certificate custom resource exists, but the Secret named in spec.secretName never appears. The first question is not “is Kubernetes broken?” It is “which controller should reconcile this type, and can that controller see the object and write the target Secret?” You would check the CRD, the custom resource status and events, the operator Deployment, the operator logs, and the RBAC binding that gives the controller permission to manage Secrets.
What would happen if the cert-manager controller pod crashed while its CRDs remained installed? You could still create Certificate custom resources because the API server knows the type, validates the schema, and stores accepted objects. Existing TLS Secrets would continue to serve traffic because they are ordinary Secrets already consumed by Ingresses or workloads. New issuance and renewal would stall until the controller recovered, because no active reconciliation loop would be processing those custom resources.
This failure mode is common enough that it should become an instinct. When the API accepts an object but nothing follows, inspect the controller path. When the API rejects an object, inspect the CRD and schema path. When the child objects exist but drift back after manual edits, inspect which custom resource owns the desired state. Those three questions cover a large share of operator incidents without requiring deep knowledge of the specific product.
# Install cert-manager (includes CRDs)kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
# Check CRDs createdkubectl get crd | grep cert-manager# certificates.cert-manager.io# clusterissuers.cert-manager.io# issuers.cert-manager.io# ...The Certificate object below is a useful example because the custom resource is short while the actual workflow is not. A controller may need to create CertificateRequests, solve ACME challenges, update status conditions, handle retries, and write a Secret only after successful issuance. The learner-facing API stays compact because the operator hides the procedural work behind a declarative type.
# Create a Certificate resourceapiVersion: cert-manager.io/v1kind: Certificatemetadata: name: myapp-tls namespace: defaultspec: secretName: myapp-tls-secret issuerRef: name: letsencrypt-prod kind: ClusterIssuer dnsNames: - myapp.example.comThe cert-manager operator watches this Certificate and requests a certificate, completes the ACME challenge, stores the certificate in the named Secret, and renews before expiration. That list is not magic hidden inside the API server. It is controller behavior, so failures appear in the custom resource status, events, related resources, and controller logs rather than in scheduler output.
The user-facing resource is intentionally declarative. It names the Secret that should contain the certificate, references an issuer, and lists DNS names. It does not tell cert-manager how many retries to perform or which temporary resources to create for a challenge. That separation is the value of the operator pattern: the custom resource captures stable intent, while the controller can improve its implementation across releases without changing every application manifest.
The Prometheus Operator follows the same pattern with monitoring resources. A Prometheus custom resource declares a Prometheus instance, while ServiceMonitor and related resources describe scrape targets and rules. The operator watches those custom resources and produces StatefulSets, ConfigMaps, Services, and generated configuration. The result is a Kubernetes-native monitoring API rather than a pile of manually edited configuration files.
Monitoring resources also show why selectors and labels matter in operator APIs. Creating a ServiceMonitor does not guarantee that every Prometheus instance will select it. The Prometheus custom resource may define selectors that include only certain labels or namespaces. When scraping fails, the answer may be in the relationship between custom resources, not in the target application’s Pods. Always read both sides of the declarative contract.
# Check Prometheus CRDskubectl get crd | grep monitoring.coreos.com# prometheuses.monitoring.coreos.com# servicemonitors.monitoring.coreos.com# alertmanagers.monitoring.coreos.com# Create a Prometheus instanceapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: name: main namespace: monitoringspec: replicas: 2 serviceAccountName: prometheus serviceMonitorSelector: matchLabels: team: frontendDiscovery is the skill that keeps you oriented in an unfamiliar cluster. Start from CRDs to learn which custom types exist, then move to custom resources to find instances, then locate the controller that owns those types. Namespaces are often clues: cert-manager commonly runs in cert-manager, monitoring stacks often run in monitoring, and platform teams may install operators in dedicated system namespaces.
In a real cluster, there may be dozens or hundreds of CRDs, so pattern recognition helps. Group names often identify products, such as cert-manager.io or monitoring.coreos.com. Resource names often identify the user-facing concept, such as certificates, issuers, servicemonitors, or prometheusrules. Controller pods often include the product or operator name. These clues are not proof by themselves, but together they form a fast map of which API extension belongs to which controller.
# List all CRDs in clusterkubectl get crd
# See all custom resources of a typekubectl get certificates -A
# Check if operator is runningkubectl get pods -A | grep operatorkubectl get pods -A | grep -E "cert-manager|prometheus"Schema, Scope, Status, and Lifecycle
Section titled “Schema, Scope, Status, and Lifecycle”Schema validation turns a CRD from loose storage into a real API contract. Without a useful schema, the API server accepts misspelled fields and nonsensical values, leaving the operator to discover bad input later. With a schema, invalid objects are rejected early, field types are documented through discovery, and clients can produce better errors. In Kubernetes 1.35 and later, structural schemas remain the expected baseline for production CRDs.
The example below adds required fields, a cron-like pattern, numeric bounds, and a default. It is still intentionally small, but it shows the mindset: put stable validation at the API boundary, then let the controller focus on behavior. Do not put every business rule into a static schema; rules that depend on external systems, current cluster state, or asynchronous work belong in admission webhooks or reconciliation status.
Good schemas reduce ambiguity for both people and machines. If a field is required, say so. If a number has safe bounds, enforce them. If a string has a small valid set, use an enum. If a field is optional but the controller assumes a default, declare that default in the CRD when appropriate. These choices make errors appear when the manifest is applied rather than minutes later in a controller log that the application team may not be allowed to read.
There is still a balance to keep. A schema that is too loose pushes every mistake into reconciliation, which makes failures slower and less predictable. A schema that is too rigid can block legitimate use cases or make version upgrades painful. For an exam scenario, focus on identifying where the rejection happens. For a real platform, review the schema with both API users and operator maintainers so the contract is useful without becoming brittle.
schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - cronSpec - image properties: cronSpec: type: string pattern: '^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$' image: type: string replicas: type: integer minimum: 1 maximum: 10 default: 1# This would fail validationkubectl apply -f bad-crontab.yaml# Error: spec.replicas: Invalid value: 15: must be <= 10Additional printer columns improve the day-two experience. A custom resource might have dozens of fields, but the most useful fields should appear in kubectl get without forcing every operator to read YAML. This is especially helpful during exams and incidents, where you need to scan status quickly. Choose columns that help answer operational questions, not every field that exists.
Printer columns are a small feature with a large usability effect. A certificate list that shows readiness and expiration is easier to operate than one that only shows names. A backup list that shows schedule, last run, and phase is easier to triage than a YAML-only interface. If users constantly need the same jsonpath command, that is a signal that the CRD should expose a better column.
versions: - name: v1 additionalPrinterColumns: - name: Schedule type: string jsonPath: .spec.cronSpec - name: Replicas type: integer jsonPath: .spec.replicas - name: Age type: date jsonPath: .metadata.creationTimestampkubectl get crontabs# NAME SCHEDULE REPLICAS AGE# my-cron-job * * * * */5 3 5mSubresources split responsibilities inside the same custom resource. The status subresource lets controllers update observed state without racing with users who edit desired state under spec. The scale subresource lets generic tooling scale a custom resource when the CRD maps the desired and observed replica paths. Before running this in a cluster, predict which user or controller should update spec.replicas and which one should update status.replicas.
The spec-status split is one of the most important Kubernetes habits to preserve in custom APIs. Users write desired state under spec, and controllers write observed state under status. When those responsibilities blur, field ownership becomes confusing and automation starts overwriting user intent. A clean status model lets users ask practical questions: has the controller seen my latest generation, what phase is it in, and what condition explains the current block?
versions: - name: v1 subresources: status: {} # Enable /status subresource scale: # Enable kubectl scale specReplicasPath: .spec.replicas statusReplicasPath: .status.replicas# Now this workskubectl scale crontab my-cron-job --replicas=5Scope is a design decision about ownership. Namespaced custom resources fit application or team-owned intent, such as a certificate for one namespace, a database for one service, or an application deployment policy. Cluster-scoped resources fit shared infrastructure, such as a cluster issuer, storage profile, global policy, or node-level integration. The wrong scope creates either unnecessary duplication or excessive centralization.
Scope also changes the security model. Namespaced resources can usually be delegated with Roles and RoleBindings, which lets teams manage their own instances without touching other namespaces. Cluster-scoped resources need ClusterRoles and cluster-wide names, so they should represent concepts that truly cross namespace boundaries. If you are unsure, ask where ownership and blast radius belong. The answer often tells you the correct scope before any YAML is written.
# Namespaced (default)scope: Namespaced# Resources exist within a namespace# kubectl get crontabs -n myapp
# Cluster-scopedscope: Cluster# Resources are cluster-wide (like Nodes, PVs)# kubectl get clusterissuers (cert-manager example)| Scope | Use When | Examples |
|---|---|---|
| Namespaced | Resource belongs to a team/app | Certificate, Database, Application |
| Cluster | Resource is shared/global | ClusterIssuer, StorageProfile |
Deleting a CRD is a lifecycle event, not a harmless cleanup command. Kubernetes removes the custom resource instances for that CRD because their type no longer exists. Objects created by an operator, such as Secrets or Deployments, may or may not be deleted depending on owner references, finalizers, and controller behavior. Stop and think: if a CRD is deleted before the operator has removed its finalizers, what evidence would you look for to confirm which resources survived?
Finalizers are another reason custom resources are more than simple YAML files. A finalizer can block deletion until a controller finishes cleanup, such as removing external cloud resources or taking a final backup. That makes deletion safer, but it also creates failure modes: if the controller is gone and the finalizer remains, the resource can stay stuck in a terminating state. The fix should be deliberate because removing a finalizer tells Kubernetes to stop waiting for cleanup.
Versioning adds one more lifecycle concern. A CRD can serve multiple versions, but one version is marked for storage. Mature APIs use this to move users gradually from one shape to another, sometimes with conversion webhooks. For CKA work, you mostly need to recognize served versions and storage versions when a manifest fails. In production, you need a migration plan before removing a served version because old manifests and automation may still depend on it.
Treat CRD backups as part of cluster recovery. Backing up etcd may preserve custom resources, but GitOps manifests are often the faster human recovery path because they show which objects should exist and why. Exporting a CRD alone is not enough; you also need the custom resource instances and the operator configuration that gives them behavior. This is especially important before uninstalling an operator, because uninstall steps may remove CRDs and therefore the custom resources they define.
Exam-Relevant Inspection and Troubleshooting
Section titled “Exam-Relevant Inspection and Troubleshooting”The most reliable CRD troubleshooting sequence starts with discovery. Ask whether the type exists, which group and version it belongs to, whether it is namespaced, and which short names are available. kubectl api-resources answers many of these questions without forcing you to read the full CRD. When a manifest fails with “no matches for kind”, compare its apiVersion and kind to discovery output before chasing controller logs.
After discovery, separate object existence from object readiness. A custom resource can exist, be syntactically valid, and still represent work that has not completed. Read metadata.generation and any status field that records the observed generation if the operator provides one. If the observed generation lags behind, the controller may not have processed the latest spec. If generations match but a condition is false, the operator has processed the request and is telling you why it cannot finish.
# List all CRDskubectl get crd
# Get details about a CRDkubectl describe crd certificates.cert-manager.io
# See the full CRD definitionkubectl get crd certificates.cert-manager.io -o yamlCustom resource inspection comes next. A successful kubectl get proves the object exists, but kubectl describe and kubectl get -o yaml show the fields that matter during diagnosis. Look for status.conditions, events, observed generation, finalizers, owner references, and whether the object is in the namespace you expect. Many mature operators use status conditions as their primary support interface.
Events are useful, but they are not a durable database of everything that happened. They can expire, and high-volume clusters can make them noisy. Status conditions tend to be more stable because they live on the object and are updated by the controller. Logs then give the controller’s detailed view, especially when an external system rejects a request. Use all three sources together: events for recent hints, status for current state, and logs for controller reasoning.
# List custom resourceskubectl get <resource-name> -A
# Get specific resourcekubectl get certificate my-cert -o yaml
# Edit custom resourcekubectl edit certificate my-cert
# Delete custom resourcekubectl delete certificate my-certAPI discovery also helps you avoid guessing plural names. The kind may be Certificate, but the resource may be certificates; the kind may be PrometheusRule, but the resource name may be prometheusrules. The API group identifies the owning extension project, which helps you find the right documentation, controller Deployment, and RBAC rules when behavior is missing.
Plural names and short names can be deceptively important under time pressure. If a command fails because you guessed the wrong resource name, the error may look like a missing CRD even though the API is healthy. kubectl api-resources is faster and safer than guessing. It shows the resource name, short names, API group, namespaced status, and kind in one place, which is exactly the information you need to form the next command correctly.
# List all resource types (including custom)kubectl api-resources
# Filter by groupkubectl api-resources --api-group=cert-manager.io
# Show if namespacedkubectl api-resources --namespaced=trueA common exam trap is stopping after the CRD exists. The CRD only proves the API server can store the object. If behavior is missing, inspect the controller Deployment, its logs, its leader election status if applicable, and the permissions on its ServiceAccount. A controller can fail quietly from the learner’s perspective because the custom resource still appears healthy until you inspect its status or events.
Hypothetical scenario: a team applies a ServiceMonitor, but Prometheus never starts scraping the service. A good diagnostic path is to verify the servicemonitors.monitoring.coreos.com CRD, list the ServiceMonitor in the right namespace, check labels against the Prometheus resource selector, inspect Prometheus Operator logs, and confirm RBAC allows the operator to read the namespace. Jumping straight to pod restarts skips the declarative contract that the operator is actually watching.
The same process works for validation failures. If the API server rejects the object, read the field path in the error and compare it with the CRD schema. If the object is accepted but the operator reports Ready=False, read status conditions and events. If status never changes, check whether the controller has the update permission on the resource’s status subresource. Each symptom points to a different boundary.
RBAC failures deserve special attention because they can look like operator bugs. A controller may have permission to list custom resources but not to create a child Secret, update a status subresource, or watch resources in another namespace. The resulting error usually appears in controller logs, and sometimes in status conditions. When a custom resource stays pending with no obvious schema problem, check the ServiceAccount, Roles, ClusterRoles, and bindings that belong to the operator.
Admission webhooks can add another layer. Some operators install validating or mutating webhooks for their custom resources, so an apply can fail even when the static CRD schema looks permissive. The error message usually names the webhook, which tells you the request reached admission but was rejected by extension logic. That is different from a missing CRD and different from a reconciliation failure after storage. Place the symptom on the request path before choosing a fix.
Patterns & Anti-Patterns
Section titled “Patterns & Anti-Patterns”Good CRD design starts with a stable user-facing contract and hides unstable implementation details behind the controller. A learner should be able to read the custom resource and understand intent without knowing every generated child object. When the CRD mirrors every field in a Deployment or StatefulSet, it stops simplifying operations and becomes a second, less familiar API for the same problem.
Use these patterns when they make the cluster easier to operate, not merely because custom resources are available. CRDs are best when they encode reusable operational intent: certificate issuance, backup policy, database lifecycle, monitoring target discovery, or a platform-specific application abstraction. The operator then owns the procedural work that humans would otherwise repeat and forget during stressful moments.
A strong pattern is to design from the runbook backward. If an on-call engineer needs to answer whether a backup is scheduled, last completed, currently failing, or blocked by credentials, those answers should appear in fields, status conditions, printer columns, or related events. The CRD is not just a developer input format; it is also an operations surface. That perspective produces APIs that are easier to debug after the original author has moved on.
| Pattern | When to Use It | Why It Works | Scaling Consideration |
|---|---|---|---|
| Intent-focused CRD | Teams need a simple resource such as Certificate, Database, or BackupPolicy | The CRD describes the desired outcome while the operator handles details | Keep the spec small and version it carefully as usage grows |
| Status conditions | Users need to diagnose asynchronous work | Conditions expose observed state without changing desired state | Standardize condition names so dashboards and runbooks stay consistent |
| Namespaced ownership | Application teams own instances independently | RBAC and quotas can follow namespace boundaries | Provide cluster-scoped templates or issuers for shared defaults |
| Printer columns | Operators need fast kubectl get triage | Important fields appear in lists without reading full YAML | Avoid exposing volatile internals that change every reconcile loop |
Anti-patterns usually come from treating CRDs as either too magical or too trivial. They are not magic because a controller must still reconcile behavior. They are not trivial because once a CRD is widely used, changing schema, versions, and deletion behavior becomes API maintenance. A careless CRD can create long-lived operational debt because every GitOps repo, script, and user’s muscle memory depends on it.
Another anti-pattern is hiding all errors in controller logs. Logs are necessary for maintainers, but API users should not need cluster-admin access to learn why their object is not ready. Status conditions, events, and clear validation messages move the failure closer to the person who can fix the manifest. This is why mature operators invest heavily in status design: it converts asynchronous work into a supportable API conversation.
| Anti-Pattern | What Goes Wrong | Better Alternative |
|---|---|---|
| CRD without controller for behavior | Users create valid objects that never affect workloads | Either install the operator or document the CRD as data-only |
| Overloaded spec fields | The CRD becomes a copy of every underlying Kubernetes object | Expose intent and let the controller choose implementation details |
| No status subresource | Controllers and users race over the same object fields | Enable status and put observed state under .status |
| Unsafe CRD deletion | All custom resources of that type disappear unexpectedly | Back up manifests, delete instances intentionally, and understand finalizers |
Decision Framework
Section titled “Decision Framework”Choosing a CRD is an API design decision. Use one when the resource represents a durable concept in your platform, users need Kubernetes-native workflows, and a controller can reliably turn desired state into observed state. Avoid one when a simpler built-in resource, Helm chart value, ConfigMap, or documentation convention solves the problem without introducing a new API surface and lifecycle.
The question “could we build this as a CRD?” is less useful than “who benefits from this becoming an API?” If one team applies one manifest twice a year, a CRD may be ceremony. If many teams repeat a risky sequence every week, a CRD plus operator can remove toil and reduce mistakes. The stronger the operational knowledge behind the workflow, the stronger the case for an operator rather than a static template.
Need a new Kubernetes-facing abstraction? | vIs the desired state durable and reusable across teams? |-- no --> Prefer built-in resources, Helm values, or a ConfigMap. | yes vDoes something need to reconcile or validate asynchronous work? |-- no --> A CRD may be data-only, but document that clearly. | yes vCan you operate the controller, RBAC, upgrades, and status model? |-- no --> Use an existing operator or narrow the problem first. | yes vDesign a CRD with schema, scope, status, printer columns, and versioning.The decision should also include who will be on call. A custom API can make application teams faster, but the platform team now owns API compatibility, controller health, upgrade paths, and documentation. If the abstraction saves ten teams from repeating risky manual procedures, that tradeoff is often worth it. If it hides two YAML fields behind a new controller, the operational cost usually outweighs the benefit.
Use the following final test before introducing a new custom API: can you explain the desired state in one or two sentences, can you validate the safest parts at admission time, can you expose progress through status, and can you recover if the controller is unavailable? If the answer is yes, you likely have a real Kubernetes extension. If the answer is no, narrow the API or choose a simpler mechanism until ownership is clearer.
| Choice | Use It When | Avoid It When |
|---|---|---|
| Built-in resource | Kubernetes already models the thing directly | Users need a domain concept that spans several resources |
| Helm values | Installation-time templating is enough | The desired state must be watched and reconciled after install |
| ConfigMap | Consumers only need static configuration | You need validation, status, RBAC by resource type, or watches |
| CRD plus operator | The platform needs a durable API and active reconciliation | No team can own controller operations and version compatibility |
Did You Know?
Section titled “Did You Know?”- Kubernetes added CRDs as the stable replacement for the older ThirdPartyResource extension mechanism, and
apiextensions.k8s.io/v1has been the normal production API for years. - A CRD can serve several versions while storing one version, which lets API authors migrate clients gradually instead of breaking every manifest at once.
- Operator SDK, Kubebuilder, and controller-runtime all build on the same Kubernetes watch and reconcile ideas, even though they package scaffolding differently.
- Finalizers are ordinary strings in metadata, but they can block deletion until a controller removes them after cleanup.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Creating a custom resource before the CRD | The API server has no registered kind or discovery entry yet | Install the CRD first, then reapply the custom resource after discovery updates |
| Using the wrong API group or version | Manifests are copied from another operator release or old documentation | Check kubectl api-resources and kubectl get crd <name> -o yaml for the served versions |
| Expecting a CRD to create workloads by itself | The CRD stores desired state, but no reconciliation logic is running | Install and verify the operator controller, then inspect custom resource status |
| Deleting a CRD before deleting instances | The type and its custom resources are removed together | Export or back up instances, delete them intentionally, and understand finalizers before CRD removal |
| Assuming every custom resource is namespaced | Some platform resources are cluster-scoped and ignore -n | Check kubectl api-resources --namespaced=false or read spec.scope on the CRD |
| Ignoring status conditions | Operators often report useful failures on the custom resource, not only in logs | Use kubectl describe and inspect .status.conditions, events, and observed generation |
| Granting the operator broad cluster-admin permissions | It is faster during installation but risky during incidents and audits | Use the documented RBAC, then narrow permissions to the resources and subresources it reconciles |
| Changing CRD schemas casually | Existing manifests, GitOps pipelines, and stored objects depend on the API contract | Version the API, provide conversion or migration guidance, and test upgrades before rollout |
1. Your team applies a `Database` custom resource and it is accepted, but no StatefulSet appears. What do you check first, and why?
Start by proving which layer is working. The accepted custom resource means the CRD exists and the API server can store the object, so the next checks are the operator Deployment, its logs, its ServiceAccount permissions, and the custom resource status or events. A CRD alone does not create a StatefulSet, so repeatedly editing the custom resource is unlikely to help until a controller is watching and reconciling it. If status is empty, also check whether the controller has permission to update the status subresource.
2. A manifest fails with `no matches for kind "Certificate" in version "cert-manager.io/v1"`. How would you debug the API side before investigating cert-manager logs?
First run kubectl get crd | grep cert-manager and kubectl api-resources --api-group=cert-manager.io to confirm the CRD is installed and serving the expected version. If the CRD is missing, install or repair cert-manager before applying the resource. If the group exists but the version differs, update the manifest to a served version or upgrade the CRD according to the vendor documentation. Controller logs are secondary here because the API server rejected the object before any controller could reconcile it.
3. A platform engineer proposes a cluster-wide `BackupPolicy` CRD but sets `scope: Namespaced`. What operational problems should you raise?
Namespaced scope would force every namespace to duplicate what is meant to be a shared policy, increasing drift and making audit results harder to interpret. It also changes RBAC and command behavior because users would need namespace-specific permissions and kubectl commands. If the policy truly governs the whole cluster, scope: Cluster is the clearer API, with ClusterRoles and globally unique names. If teams need local overrides, use a separate namespaced resource or fields that reference a cluster-scoped policy.
4. You delete `certificates.cert-manager.io` while many `Certificate` resources exist. What survives, what is lost, and what recovery path is realistic?
Deleting the CRD removes the custom resources of that type because the API type is gone. Secrets already written by cert-manager may continue to exist and keep serving existing TLS traffic, but the desired certificate objects and renewal intent are gone. Recovery usually means reinstalling the CRD and recreating the Certificate manifests from Git or backup. This is why CRD deletion should be treated as API removal, not ordinary cleanup.
5. A `CronTab` custom resource is rejected because `spec.replicas` is `15`, but the team insists the operator could handle that value. Where should the fix be made?
The rejection happened at the API validation boundary, so the manifest or CRD schema must change before the operator is involved. If replicas above 10 are truly valid, update the CRD schema through a planned API change and test stored objects and clients. If the limit is intentional, the team should correct the custom resource to fit the contract. Operators should not be expected to reconcile objects the API server refuses to store.
6. Prometheus is not scraping a service after a `ServiceMonitor` is created. Which CRD and operator-specific clues help you diagnose the problem?
Confirm that the servicemonitors.monitoring.coreos.com CRD exists and the ServiceMonitor is in the namespace expected by the Prometheus resource. Then compare labels and selectors between the ServiceMonitor and the Prometheus custom resource because the operator may only select monitors with matching labels. Inspect the Prometheus Operator logs and the status of related resources for reconciliation errors. This path checks the declarative link before assuming the workload pods are the problem.
7. An operator-managed Deployment is manually scaled from two replicas to five, but later returns to two. What does this tell you about reconciliation?
It shows that the operator treats the custom resource as the source of truth and repairs drift in child resources. Manual changes to generated objects may appear to work briefly, but the next reconciliation loop compares actual state with desired state and updates the child object back to the declared value. The durable fix is to edit the custom resource field that controls replicas, if the CRD exposes one. This is expected behavior, not the Deployment controller fighting the user by itself.
Hands-On Exercise
Section titled “Hands-On Exercise”In this exercise you will create a small CRD, create custom resources, inspect discovery, add validation concepts, and run a simple educational operator loop. The shell examples are designed for a disposable Kubernetes cluster. They create local files in your current directory and remove them during cleanup, so run them from a scratch workspace rather than inside an application repository.
Read each task as a miniature troubleshooting lab, not just a typing exercise. After each apply, ask which component accepted the request and which component, if any, is expected to act on it. When an object does not appear, use discovery. When an object is rejected, use schema errors. When an object exists but nothing else changes, use the controller path. That repetition is the point of the lab.
Task 1: Create a Website CRD
Section titled “Task 1: Create a Website CRD”Create a Website resource that records a URL and a replica count. This first CRD does not run a website by itself; it gives the API server a new type and lets you practice discovery and validation. Notice that the schema requires spec.url, while spec.replicas has a default.
cat > website-crd.yaml << 'EOF'apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: websites.stable.example.comspec: group: stable.example.com versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object required: - url properties: url: type: string replicas: type: integer default: 1 additionalPrinterColumns: - name: URL type: string jsonPath: .spec.url - name: Replicas type: integer jsonPath: .spec.replicas - name: Age type: date jsonPath: .metadata.creationTimestamp scope: Namespaced names: plural: websites singular: website kind: Website shortNames: - wsEOF
kubectl apply -f website-crd.yamlkubectl get crd websites.stable.example.comkubectl api-resources | grep websiteSolution notes
The CRD should appear as websites.stable.example.com, and kubectl api-resources should show the websites resource with the short name ws. If discovery does not show it, inspect the CRD installation error before creating instances. The API server must know the type before it can accept a Website object.
Task 2: Create and Inspect Custom Resources
Section titled “Task 2: Create and Inspect Custom Resources”Create one Website instance, then use the plural, singular, and short-name forms to inspect it. This task is intentionally about API behavior, not website hosting. The object records desired state and becomes visible to standard Kubernetes tooling.
cat > my-website.yaml << 'EOF'apiVersion: stable.example.com/v1kind: Websitemetadata: name: company-site namespace: defaultspec: url: https://example.com replicas: 3EOF
kubectl apply -f my-website.yaml# List websiteskubectl get websiteskubectl get ws # Short name
# Describekubectl describe website company-site
# Get as YAMLkubectl get website company-site -o yaml
# Editkubectl edit website company-sitecat > blog.yaml << 'EOF'apiVersion: stable.example.com/v1kind: Websitemetadata: name: blogspec: url: https://blog.example.com replicas: 2EOF
kubectl apply -f blog.yamlkubectl get wsSolution notes
Both company-site and blog should appear with the custom printer columns if your cluster has accepted the CRD exactly as written. If the output does not show the columns, check that additionalPrinterColumns is nested under the served version. If kubectl get ws fails, verify that shortNames was accepted in the CRD.
Task 3: Explore Installed Operators and CRDs
Section titled “Task 3: Explore Installed Operators and CRDs”Use discovery commands to identify any operator-managed APIs already present in your cluster. A small practice cluster may not have cert-manager or Prometheus installed, so a missing result is not automatically a failure. The point is to practice moving from CRD names to resource instances and then to likely controller pods.
# Check for cert-managerkubectl get crd | grep cert-manager
# Check for prometheus operatorkubectl get crd | grep monitoring.coreos.com
# List all CRDskubectl get crd# List all CRDskubectl get crd
# Get details on a specific CRDkubectl get crd <crd-name> -o yaml | head -50
# List instances of a CRDkubectl get <resource-name> -A
# Describe a CRDkubectl describe crd <crd-name>Solution notes
For an installed operator, you should be able to connect at least three clues: the CRD group, one or more custom resources, and a controller pod or Deployment. For example, cert-manager resources use the cert-manager.io group, while Prometheus Operator resources use monitoring.coreos.com. If no common operators are installed, use your Website CRD for the same discovery process.
Task 4: Build Validation and Status Examples
Section titled “Task 4: Build Validation and Status Examples”Create a second simple CRD that validates database fields. Then create an invalid resource and a valid one so you can see the difference between API validation failure and successful storage. After that, create a tiny Task CRD with a status subresource so you can inspect how status is modeled even when no controller updates it.
# Create CRDcat << 'EOF' | kubectl apply -f -apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: apps.example.comspec: group: example.com names: kind: App listKind: AppList plural: apps singular: app shortNames: - ap scope: Namespaced versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: image: type: string replicas: type: integerEOF
# Verify CRD existskubectl get crd apps.example.com
# Create an instancecat << 'EOF' | kubectl apply -f -apiVersion: example.com/v1kind: Appmetadata: name: my-appspec: image: nginx:1.25 replicas: 3EOF
# Query using short namekubectl get ap
# Cleanupkubectl delete app my-appkubectl delete crd apps.example.comcat << 'EOF' | kubectl apply -f -apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: databases.stable.example.comspec: group: stable.example.com names: kind: Database plural: databases singular: database shortNames: - db scope: Namespaced versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - engine - version properties: engine: type: string enum: - postgres - mysql - mongodb version: type: string storage: type: string default: "10Gi"EOF
# Try to create invalid resource (should fail)cat << 'EOF' | kubectl apply -f -apiVersion: stable.example.com/v1kind: Databasemetadata: name: invalid-dbspec: engine: oracle # Not in enum! version: "14"EOF
# Create valid resourcecat << 'EOF' | kubectl apply -f -apiVersion: stable.example.com/v1kind: Databasemetadata: name: prod-dbspec: engine: postgres version: "14"EOF
# Cleanupkubectl delete database prod-dbkubectl delete crd databases.stable.example.comcat << 'EOF' | kubectl apply -f -apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: tasks.work.example.comspec: group: work.example.com names: kind: Task plural: tasks scope: Namespaced versions: - name: v1 served: true storage: true subresources: status: {} schema: openAPIV3Schema: type: object properties: spec: type: object properties: command: type: string status: type: object properties: phase: type: string completedAt: type: stringEOF
# Create taskcat << 'EOF' | kubectl apply -f -apiVersion: work.example.com/v1kind: Taskmetadata: name: build-jobspec: command: "make build"EOF
# View the taskkubectl get task build-job -o yaml
# Cleanupkubectl delete task build-jobkubectl delete crd tasks.work.example.comSolution notes
The invalid Database should be rejected before storage because oracle is not in the enum. The valid object should be stored and listed. The Task object should show a status field only if something writes it, but the CRD now exposes a status subresource that a controller could update separately from user-managed spec.
Task 5: Diagnose a Missing CRD and Design a Backup CRD
Section titled “Task 5: Diagnose a Missing CRD and Design a Backup CRD”Apply a resource for a type that does not exist, observe the failure, and then design a Backup CRD with required fields. This mirrors the most common first failure when installing operator-managed applications from incomplete manifests: the custom resource is applied before the CRD.
# Try to create a resource for non-existent CRDcat << 'EOF' | kubectl apply -f -apiVersion: nonexistent.example.com/v1kind: Widgetmetadata: name: testspec: size: largeEOF
# Error: no matches for kind "Widget"
# Diagnosekubectl get crd | grep widget # Nothingkubectl api-resources | grep -i widget # Nothing
# Solution: CRD must be created before resources# Create the CRD first, then the resourceDesign and implement a CRD for a Backup resource with group backup.example.com, required fields source, destination, and schedule, an optional integer retention defaulting to seven days, and schedule validation as a string field. Then create a sample Backup resource and query it through its short name.
# YOUR TASK: Create the CRD and a sample Backup resourceSolution
cat << 'EOF' | kubectl apply -f -apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: backups.backup.example.comspec: group: backup.example.com names: kind: Backup plural: backups shortNames: - bk scope: Namespaced versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - source - destination - schedule properties: source: type: string destination: type: string schedule: type: string retention: type: integer default: 7EOF
cat << 'EOF' | kubectl apply -f -apiVersion: backup.example.com/v1kind: Backupmetadata: name: daily-db-backupspec: source: /data/postgres destination: s3://backups/postgres schedule: "0 2 * * *" retention: 14EOF
kubectl get bkkubectl delete backup daily-db-backupkubectl delete crd backups.backup.example.comTask 6: Run a Basic Educational Operator
Section titled “Task 6: Run a Basic Educational Operator”This final task uses a Bash loop as an educational controller for the Website CRD. It is not a production operator, but it makes reconciliation visible: the loop lists custom resources, reads their desired state, and applies a Deployment named after each website. After you delete the managed Deployment, the loop recreates it from the custom resource.
# 1. First, create the Website CRDcat << 'EOF' | kubectl apply -f -apiVersion: apiextensions.k8s.io/v1kind: CustomResourceDefinitionmetadata: name: websites.stable.example.comspec: group: stable.example.com scope: Namespaced names: plural: websites singular: website kind: Website shortNames: [ws] versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: image: {type: string, default: "nginx:alpine"} replicas: {type: integer, default: 1}EOF
# 2. Create the Operator Script# This represents the controller's reconciliation loopcat << 'EOF' > website-operator.sh#!/bin/bashecho "Starting Website Operator..."while true; do # Find all Website custom resources for ws in $(kubectl get websites -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do image=$(kubectl get website $ws -o jsonpath='{.spec.image}') replicas=$(kubectl get website $ws -o jsonpath='{.spec.replicas}')
# Reconcile: Ensure a Deployment exists with the desired state kubectl create deployment $ws-site --image=$image --replicas=$replicas --dry-run=client -o yaml | kubectl apply -f - >/dev/null 2>&1 echo "Reconciled Website: $ws -> Image: $image, Replicas: $replicas" done sleep 5doneEOFchmod +x website-operator.sh
# 3. Run the operator in the background./website-operator.sh &OPERATOR_PID=$!
# 4. Create a Custom Resourcecat << 'EOF' | kubectl apply -f -apiVersion: stable.example.com/v1kind: Websitemetadata: name: my-portfoliospec: image: "nginx:alpine" replicas: 2EOF
# 5. Observe the reconciliation (Wait a few seconds for the loop)sleep 6kubectl get deploymentskubectl get pods
# 6. Test the Reconciliation Loop# The operator should fight back if we delete the managed deploymentecho "Deleting the managed deployment to simulate a failure..."kubectl delete deployment my-portfolio-site
# 7. Check again in 5-10 secondssleep 6kubectl get deployments# The operator recreated it! This is the reconciliation loop in action.
# 8. Clean upkill $OPERATOR_PIDkubectl delete website my-portfoliokubectl delete deployment my-portfolio-sitekubectl delete crd websites.stable.example.comrm website-operator.shSolution notes
When the script is running, the Deployment should appear after the Website custom resource exists. If you delete the Deployment while the script continues running, the next loop should recreate it. If it does not, inspect whether kubectl get websites returns the custom resource, whether the script has permission to create Deployments, and whether the Deployment name matches the script’s naming convention.
Success Criteria
Section titled “Success Criteria”- Design a CRD with a valid group, names, scope, version, schema, and printer columns.
- Create custom resources and query them with plural, singular, and short-name forms.
- Debug CRD discovery and validation failures using
kubectl get crd,kubectl api-resources, and apply errors. - Implement an operator-style reconciliation loop that recreates a Deployment from a custom resource.
- Evaluate whether a custom resource should be namespaced or cluster-scoped.
- Diagnose operator reconciliation failures by checking custom resource status, events, controller pods, logs, and RBAC.
Cleanup
Section titled “Cleanup”Run this cleanup if you completed the earlier tasks and want to remove the practice resources. Some objects may already be gone if you followed every embedded cleanup command, so --ignore-not-found keeps the final sweep harmless.
kubectl delete website company-site blog my-portfolio --ignore-not-foundkubectl delete deployment my-portfolio-site --ignore-not-foundkubectl delete crd websites.stable.example.com --ignore-not-foundkubectl delete crd apps.example.com --ignore-not-foundkubectl delete crd databases.stable.example.com --ignore-not-foundkubectl delete crd tasks.work.example.com --ignore-not-foundkubectl delete crd backups.backup.example.com --ignore-not-foundrm -f website-crd.yaml my-website.yaml blog.yaml website-operator.shSources
Section titled “Sources”- Kubernetes documentation: Custom Resources
- Kubernetes documentation: Extend the Kubernetes API with CustomResourceDefinitions
- Kubernetes documentation: API concepts
- Kubernetes documentation: Controllers
- Kubernetes documentation: Finalizers
- Kubernetes documentation: API access control
- Kubebuilder book: CronJob tutorial
- Kubebuilder book: Status subresource
- Operator SDK documentation
- controller-runtime documentation
- cert-manager documentation
- Prometheus Operator documentation
Next Module
Section titled “Next Module”Module 1.6: RBAC - Role-Based Access Control for securing your cluster.