Database Operators
Цей контент ще не доступний вашою мовою.
Database Operators
Section titled “Database Operators”Operating stateful databases on bare-metal Kubernetes requires replacing cloud-provider managed services (like RDS or ElastiCache) with in-cluster orchestration. Standard StatefulSet primitives handle identity and stable storage, but lack application-specific knowledge required for safe failover, replication scaling, Point-in-Time Recovery (PITR), and zero-downtime upgrades.
Database operators encode DBA operational routines into custom controllers.
Learning Outcomes
Section titled “Learning Outcomes”- Evaluate and deploy appropriate database operators for PostgreSQL, MySQL, and caching layers on bare metal.
- Configure CloudNativePG for high availability, streaming replication, and Point-In-Time Recovery (PITR) against S3-compatible local object storage.
- Implement database connection pooling using pgBouncer to prevent connection starvation under high microservice pod churn.
- Design MySQL sharding topologies using Vitess for horizontally scalable relational data.
- Diagnose split-brain scenarios, fencing failures, and WAL (Write-Ahead Log) exhaustion in clustered database deployments.
The Bare-Metal Database Architecture
Section titled “The Bare-Metal Database Architecture”On managed cloud providers, database operators often rely on underlying infrastructure APIs (e.g., EBS snapshots) for backups. On bare metal, you must provide the complete stack:
- Fast Local Storage: Databases require low latency. You must use Local Persistent Volumes (via TopoLVM, OpenEBS LocalPV) or highly optimized network block storage (Ceph RBD).
- Object Storage: For continuous backup and PITR, operators stream Write-Ahead Logs (WALs) and base backups to an S3-compatible endpoint. On bare metal, this is typically an internal MinIO cluster or Ceph RadosGW.
- Fencing Mechanisms: To prevent split-brain scenarios during network partitions, operators must reliably “fence” (isolate or kill) the old primary before promoting a replica.
graph TD subgraph K8s Cluster O[Operator Controller] P[Primary DB Pod] R1[Replica DB Pod 1] R2[Replica DB Pod 2]
O -->|Monitors & Manages| P O -->|Monitors & Manages| R1 O -->|Monitors & Manages| R2
P -.->|Streaming Replication| R1 P -.->|Streaming Replication| R2 end
subgraph Bare Metal Infrastructure S3[MinIO / Ceph RGW S3] Local[Local NVMe PVs]
P -->|Archives WALs| S3 P --- Local R1 --- Local R2 --- Local endPostgreSQL: CloudNativePG (CNPG)
Section titled “PostgreSQL: CloudNativePG (CNPG)”CloudNativePG is the current standard for running PostgreSQL on Kubernetes. Unlike older operators that wrap external high-availability tools (like Patroni or Stolon), CNPG interacts directly with the Kubernetes API server for leader election and state management.
Architecture and Replication
Section titled “Architecture and Replication”CNPG deploys instances as a single Cluster Custom Resource. It uses Kubernetes endpoints and leases for leader election.
- Primary: Receives read/write traffic.
- Replicas: Maintain state via asynchronous or synchronous streaming replication.
- Services: The operator automatically generates three services:
<cluster>-rw: Routes exclusively to the Primary.<cluster>-ro: Routes to Replicas for read scaling.<cluster>-r: Routes to any available node (rarely used).
Connection Pooling (pgBouncer)
Section titled “Connection Pooling (pgBouncer)”PostgreSQL spawns a separate OS process for every connection. In Kubernetes, hundreds of microservice pods continuously connecting and disconnecting will exhaust PostgreSQL’s connection limits and crash the node due to OOM (Out of Memory).
CNPG integrates pgBouncer natively via the Pooler CRD. It maintains a persistent connection pool to the database while multiplexing incoming client connections.
Storage and WAL Archiving
Section titled “Storage and WAL Archiving”A production CNPG cluster must have an S3-compatible backup destination configured. Without it, PostgreSQL will retain WAL files locally until they are successfully archived. If the archiver fails (e.g., MinIO is down), the local persistent volume will fill up with WAL files, causing the primary database to crash and refuse to restart.
MySQL at Scale: Vitess
Section titled “MySQL at Scale: Vitess”When a single MySQL primary cannot handle the write throughput or dataset size, vertical scaling on bare metal eventually hits a hardware ceiling. Vitess is a database clustering system for horizontal scaling of MySQL, originally built at YouTube.
Vitess Topology
Section titled “Vitess Topology”Vitess hides the complexity of database sharding from the application. The application connects to Vitess as if it were a standard single-node MySQL database.
graph TD App[Application] -->|MySQL Protocol| VTGate[VTGate Router]
subgraph Vitess Cluster VTGate -->|gRPC| VTTablet1[VTTablet] VTGate -->|gRPC| VTTablet2[VTTablet] VTGate -->|gRPC| VTTablet3[VTTablet]
Topo[Topology Server / etcd] -.-> VTGate Topo -.-> VTTablet1 end
subgraph Shard 0 VTTablet1 --> MySQL_M[MySQL Primary] VTTablet2 --> MySQL_R1[MySQL Replica] end
subgraph Shard 1 VTTablet3 --> MySQL_S1_M[MySQL Primary] end- VTGate: A stateless proxy that parses SQL queries, reads the VSchema (Vitess Schema), and routes the query to the correct underlying shard.
- VTTablet: A sidecar process deployed alongside every
mysqldprocess. It intercepts queries, implements connection pooling, and protects MySQL from bad queries (e.g., queries returning too many rows). - Topology Server: A highly available datastore (typically etcd) that stores cluster metadata, routing rules, and shard configurations.
In-Memory Datastores: Redis, Valkey, and Memcached
Section titled “In-Memory Datastores: Redis, Valkey, and Memcached”In-memory data grids are essential for caching, session management, and rate limiting.
The Shift to Valkey
Section titled “The Shift to Valkey”Following Redis Ltd’s transition to the Server Side Public License (SSPL), the Linux Foundation forked the project into Valkey. For new bare-metal deployments, Valkey operators (or community Redis operators migrating to Valkey) are the standard to avoid restrictive licensing issues in enterprise environments.
Deployment Topologies
Section titled “Deployment Topologies”| Datastore | Architecture | K8s Operator / Pattern | Best For |
|---|---|---|---|
| Memcached | Shared-nothing, consistent hashing handled by client. | Deployment + Headless Service. Operator rarely needed. | Simple, transient key/value caching. High throughput, low complexity. |
| Valkey / Redis (Sentinel) | Primary-Replica with Sentinel nodes for election. | KubeDB, Spotahome Redis Operator. | General caching, pub/sub, single-threaded high performance. |
| Valkey / Redis (Cluster) | Sharded architecture. Multiple Primaries. | KubeDB, OT-Container-Kit Redis Operator. | Datasets exceeding the memory capacity of a single bare-metal node. |
Hands-on Lab
Section titled “Hands-on Lab”This lab deploys a highly available PostgreSQL cluster using CloudNativePG, configures an internal MinIO instance as the S3 backup target for WAL archiving, and validates failover.
Prerequisites
Section titled “Prerequisites”- A running K8s cluster (e.g.,
kindwith 3 worker nodes). kubectlandhelminstalled.- Default StorageClass configured (standard
kindprovisioner is sufficient for this lab).
Step 1: Install CloudNativePG Operator
Section titled “Step 1: Install CloudNativePG Operator”Deploy the CNPG controller using the official manifests.
kubectl apply --server-side -f \ https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.22/releases/cnpg-1.22.1.yaml
# Verify the controller is runningkubectl get pods -n cnpg-systemStep 2: Deploy an Internal MinIO for Backups
Section titled “Step 2: Deploy an Internal MinIO for Backups”On bare metal, you need local object storage. We will deploy a minimal MinIO instance.
helm repo add bitnami https://charts.bitnami.com/bitnamihelm install minio bitnami/minio \ --set auth.rootUser=admin \ --set auth.rootPassword=supersecret \ --set defaultBuckets="cnpg-backups" \ --namespace minio-system --create-namespaceWait for MinIO to become ready:
kubectl rollout status deployment/minio -n minio-systemStep 3: Configure Backup Credentials
Section titled “Step 3: Configure Backup Credentials”Create a K8s Secret in the default namespace holding the MinIO credentials so CNPG can authenticate.
kubectl create secret generic minio-creds \ --from-literal=ACCESS_KEY_ID=admin \ --from-literal=ACCESS_SECRET_KEY=supersecretStep 4: Deploy the PostgreSQL Cluster
Section titled “Step 4: Deploy the PostgreSQL Cluster”Apply the following manifest. It provisions a 3-node PostgreSQL cluster and configures continuous WAL archiving to MinIO.
apiVersion: postgresql.cnpg.io/v1kind: Clustermetadata: name: dojo-dbspec: instances: 3
# Anti-affinity ensures pods run on different bare-metal nodes affinity: enablePodAntiAffinity: true topologyKey: kubernetes.io/hostname
storage: size: 1Gi
backup: barmanObjectStore: destinationPath: s3://cnpg-backups/ endpointURL: http://minio.minio-system.svc.cluster.local:9000 s3Credentials: accessKeyId: name: minio-creds key: ACCESS_KEY_ID secretAccessKey: name: minio-creds key: ACCESS_SECRET_KEY wal: compression: gzipkubectl apply -f pg-cluster.yamlStep 5: Verify Cluster Health and Replication
Section titled “Step 5: Verify Cluster Health and Replication”CloudNativePG provides a standard kubectl plugin, but you can inspect the CRD directly:
# Watch the pods initialize (takes ~2 minutes)kubectl get pods -l cnpg.io/cluster=dojo-db -w
# Check the cluster statuskubectl get cluster dojo-db -o yaml | grep phase# Expected output: phase: Cluster in healthy stateVerify WAL archiving is succeeding by checking the MinIO pod logs, or check the CNPG operator logs for backup events.
Step 6: Simulate Node Failure and Observe Failover
Section titled “Step 6: Simulate Node Failure and Observe Failover”Identify the current primary database pod.
kubectl get pods -l cnpg.io/cluster=dojo-db,cnpg.io/instanceRole=primaryDelete the primary pod forcefully to simulate a node crash:
# Replace pod name with your actual primary pod namekubectl delete pod dojo-db-1 --force --grace-period=0Immediately watch the pods and the Cluster resource. CNPG will detect the failure via API leases, fence the old primary, select the replica with the most advanced LSN (Log Sequence Number), and promote it.
# Watch the roles changekubectl get pods -l cnpg.io/cluster=dojo-db -L cnpg.io/instanceRoleTraffic sent to the dojo-db-rw service will automatically route to the newly promoted primary with minimal disruption.
Practitioner Gotchas
Section titled “Practitioner Gotchas”1. The Local Disk WAL Trap
Section titled “1. The Local Disk WAL Trap”If your S3 backup target (MinIO/Ceph) goes down, CNPG will fail to archive WALs. By design, PostgreSQL will not delete unarchived WALs. Over hours or days, the local PV will fill to 100%. The database will crash and enter a CrashLoopBackOff. Fix: Monitor the cnpg_wal_archive_status metric. Have a runbook for temporarily disabling archiving or expanding the PVC via volume expansion.
2. Connection Starvation During Pod Restarts
Section titled “2. Connection Starvation During Pod Restarts”A deployment scaling from 10 to 50 pods might instantly open 500 new connections to PostgreSQL. If max_connections is hit, the application crashes. Increasing max_connections excessively causes PostgreSQL to consume all available RAM and OOMKill. Fix: Always deploy the CNPG Pooler (pgBouncer) in transaction mode and point applications to the Pooler service, not the direct DB service.
3. CPU Limits Throttling Latency
Section titled “3. CPU Limits Throttling Latency”Setting strict CPU limits on database pods in Kubernetes utilizes the CFS (Completely Fair Scheduler) quota system. Sudden spikes in query complexity can result in severe CPU throttling, adding hundreds of milliseconds to query latency even if the physical node has idle cores. Fix: For databases on bare metal, set CPU requests accurately for scheduling, but frequently omit CPU limits (or set them very high) to allow burst capacity, relying instead on dedicated node pools or vertical scaling.
4. OOMKills During Index Creation
Section titled “4. OOMKills During Index Creation”A CREATE INDEX or VACUUM operation requires significant working memory (maintenance_work_mem). If K8s memory limits are configured too tightly around normal operational metrics, these maintenance tasks will trigger an immediate OOMKill from the Kubelet. Fix: Leave a minimum 20-30% buffer between PostgreSQL’s configured shared buffers/work memory and the container’s hard memory limit.
1. A bare-metal Kubernetes cluster running CloudNativePG experiences a failure in its MinIO object storage cluster that lasts for 48 hours. What is the most likely consequence for the PostgreSQL cluster?
- A) The database switches to synchronous replication mode automatically.
- B) The database purges the oldest data to make room for new transactions.
- C) The primary node’s storage volume fills up with Write-Ahead Logs (WALs), eventually causing the database to crash.
- D) The operator stops accepting read queries and enters read-only mode.
Correct Answer: C (Postgres retains WALs until they are successfully archived. If the archiver fails, the disk fills up).
2. In a Vitess architecture deployed on Kubernetes, which component is responsible for parsing SQL queries and routing them to the correct underlying MySQL shard?
- A) VTTablet
- B) VTGate
- C) Topology Server
- D) VReplication
Correct Answer: B (VTGate is the stateless proxy router that handles query routing based on the VSchema).
3. Why is connection pooling (e.g., pgBouncer) considered mandatory for PostgreSQL running in a microservice-heavy Kubernetes environment?
- A) Because Kubernetes networking limits the number of TCP ports available on a Service.
- B) Because PostgreSQL forks a new OS process for every connection, leading to rapid memory exhaustion when microservices rapidly scale up.
- C) Because pgBouncer encrypts traffic between pods automatically.
- D) Because PostgreSQL cannot perform Point-In-Time Recovery without a pooling layer.
Correct Answer: B (Process-per-connection architecture makes Postgres highly sensitive to connection churn and volume).
4. You need to deploy a highly available, in-memory caching cluster for a new enterprise application. To avoid strict Server Side Public License (SSPL) restrictions, which technology should you implement?
- A) Redis Enterprise
- B) Memcached
- C) Valkey
- D) Cassandra
Correct Answer: C (Valkey is the Linux Foundation fork of Redis created specifically to maintain an open-source alternative to SSPL Redis).
5. When provisioning storage for a database operator on bare metal, why is it critical that the StorageClass supports WaitForFirstConsumer?
- A) It prevents the volume from being encrypted before the pod mounts it.
- B) It delays volume binding until the pod is scheduled, ensuring the local Persistent Volume is created on the exact physical node where the pod will run.
- C) It guarantees that the database application has started before the storage is attached.
- D) It formats the disk with the XFS filesystem automatically.
Correct Answer: B (Volume binding mode must be delayed for local storage so the PV and Pod land on the same physical host).