Private AIOps
Private AIOps
Section titled “Private AIOps”Operating Kubernetes on bare metal often involves strict data privacy, compliance, or air-gapped constraints that preclude the use of SaaS observability platforms like Datadog, New Relic, or cloud-vendor AIOps tools. Private AIOps brings machine learning (ML) and artificial intelligence (AI) directly into your cluster to automate root cause analysis (RCA), detect metric anomalies, and execute self-healing workflows without transmitting cluster state or telemetry over the internet.
This module covers the architecture and implementation of in-cluster AIOps, focusing on localized telemetry anomaly detection, AI-augmented incident response using Robusta and local Large Language Models (LLMs), and predictive scaling.
Learning Outcomes
Section titled “Learning Outcomes”- Design an air-gapped, privacy-preserving AIOps architecture using local LLMs and open-source telemetry pipelines.
- Configure Robusta to enrich Prometheus alerts with cluster context and AI-generated root cause analysis without relying on external SaaS APIs.
- Implement statistical anomaly detection rules in Prometheus and VictoriaMetrics to replace brittle static alerting thresholds.
- Evaluate predictive scaling mechanisms against reactive Horizontal Pod Autoscaler (HPA) behaviors for spiky workloads.
- Establish operational guardrails to constrain automated self-healing workflows, preventing cascading failures caused by AI hallucinations or aggressive remediation loops.
Architectural Paradigms for Private AIOps
Section titled “Architectural Paradigms for Private AIOps”A Private AIOps stack shifts the computational burden of telemetry analysis and natural language reasoning to the local cluster infrastructure.
The architecture consists of four distinct pillars:
- Telemetry Ingestion & Storage: Prometheus, VictoriaMetrics, or Thanos for metrics; Vector or Fluent Bit for logs.
- Anomaly Detection Engine: Statistical engines (VictoriaMetrics
vmanomaly) or advanced PromQL queries that generate dynamic alerts based on historical baselines rather than static thresholds. - Incident Response & Automation: Robusta or custom Kubernetes Operators acting as the glue between alerts, cluster state, and the AI backend.
- Local AI Backend: An OpenAI-compatible API serving local models (e.g., Ollama or vLLM running Llama 3, Qwen, or Mistral).
graph TD subgraph Bare-Metal Cluster P[Prometheus / VictoriaMetrics] -->|Fires Alert| A[Alertmanager] A -->|Webhook| R[Robusta Runner]
K[kube-apiserver] -.->|Fetch Pod Logs / Events| R
R -->|System Prompt + Context| L[Local LLM via vLLM/Ollama] L -->|RCA & Remediation Steps| R
R -->|Formats Message| S[Notification Sink: Slack/Teams/Email] R -->|Trigger Playbook| K endTrade-offs: Local vs. SaaS AIOps
Section titled “Trade-offs: Local vs. SaaS AIOps”| Feature | Private AIOps (Local LLM / vmanomaly) | SaaS AIOps (Datadog / OpenAI) |
|---|---|---|
| Data Privacy | Absolute. Telemetry never leaves the cluster. | High risk. Cluster state and secrets may leak in prompts. |
| Latency | Dependent on local GPU/CPU hardware. Can be slow. | Low latency for API calls, fast inference. |
| Model Capability | Limited by local VRAM (typically 7B-70B parameter models). | State-of-the-art reasoning (GPT-4, Claude 3.5 Sonnet). |
| Cost | High CapEx (GPUs, power, cooling). Zero OpEx. | Low CapEx. High, unpredictable OpEx based on token usage. |
Anomaly Detection in Cluster Metrics
Section titled “Anomaly Detection in Cluster Metrics”Static thresholds in Prometheus (container_memory_usage_bytes > 8Gi) inevitably lead to alert fatigue. Workloads have diurnal patterns, weekly cycles, and seasonal spikes. Anomaly detection identifies deviations from expected patterns.
PromQL Statistical Baselines
Section titled “PromQL Statistical Baselines”Before deploying complex ML models, leverage Prometheus’s native statistical functions. You can calculate a rolling baseline using avg_over_time and standard deviations with stddev_over_time to detect anomalies based on Z-scores.
groups:- name: node-anomaly-detection rules: - record: node:cpu_usage:z_score expr: > ( instance:node_cpu_utilisation:rate1m - avg_over_time(instance:node_cpu_utilisation:rate1m[1w]) ) / stddev_over_time(instance:node_cpu_utilisation:rate1m[1w])
- alert: NodeCPUUsageAnomaly expr: node:cpu_usage:z_score > 3 # 3 standard deviations above the 1-week norm for: 15m labels: severity: warning annotations: summary: "Anomalous CPU usage detected on {{ $labels.instance }}"Machine Learning with VictoriaMetrics Anomaly Detection (vmanomaly)
Section titled “Machine Learning with VictoriaMetrics Anomaly Detection (vmanomaly)”For environments already utilizing VictoriaMetrics, vmanomaly provides an external service that queries historical data, applies ML models (like Prophet or Holt-Winters), and pushes an anomaly score metric (anomaly_score) back into the time-series database.
A typical vmanomaly configuration defines a model and a schedule:
models: prophet_model: class: "prophet" args: interval_width: 0.98
reader: datasource_url: "http://victoria-metrics:8428" queries: ingress_latency: "histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le, ingress))"
writer: datasource_url: "http://victoria-metrics:8428"You then alert on the resulting metric:
- alert: IngressLatencyAnomaly expr: anomaly_score{model="prophet_model", query_name="ingress_latency"} > 1.0 for: 5mAI-Augmented Incident Response with Robusta
Section titled “AI-Augmented Incident Response with Robusta”When an alert fires, engineers waste minutes (or hours) manually gathering context: running kubectl get events, checking logs, and viewing associated Grafana dashboards. Robusta automates this triage phase.
Robusta consists of an in-cluster runner that intercepts Alertmanager webhooks, runs predefined Python “playbooks” to gather cluster state, and forwards the enriched package to a sink.
Integrating Local LLMs
Section titled “Integrating Local LLMs”To utilize Robusta for RCA without leaking data to OpenAI, you must point Robusta’s AI features to a local, OpenAI-compatible endpoint. Ollama or vLLM are the standard choices.
The LLM is prompted with the alert details, recent events, and pod logs. The model returns a structured RCA and suggested remediation steps.
Prerequisite hardware: A 7B parameter model (e.g., llama3.1:8b or mistral) requires ~6-8GB of VRAM. A 70B parameter model requires ~40GB of VRAM. For pure log analysis and RCA, smaller models often suffice if the system prompt is well-engineered.
Configuring the Robusta LLM Endpoint
Section titled “Configuring the Robusta LLM Endpoint”In your values.yaml for the Robusta Helm chart, override the ChatGPT integration settings to point to your in-cluster LLM service:
globalConfig: chat_gpt_api_key: "dummy-key" # Required by the API client, but ignored by Ollama chat_gpt_endpoint: "http://ollama.monitoring.svc.cluster.local:11434/v1" chat_gpt_model: "llama3.1:8b" # Must match the model pulled in Ollama
sinksConfig: - slack_sink: name: main_slack_sink slack_channel: "production-alerts" api_key: "xoxb-your-slack-bot-token"
enablePrometheusStack: false # Set to false if you already have kube-prometheus-stackWhen Robusta receives an alert (e.g., CrashLoopBackOff), it automatically queries the local endpoint, analyzes the crash logs, and appends the AI’s explanation to the Slack message.
Predictive Scaling
Section titled “Predictive Scaling”Standard Horizontal Pod Autoscaler (HPA) is reactive. It scales up after CPU spikes or queue lengths increase. This causes a cold-start latency window where incoming requests are dropped or degraded while new pods initialize.
Predictive scaling analyzes historical metric patterns to scale out proactively, before the anticipated load arrives.
KEDA Predictive Scaling
Section titled “KEDA Predictive Scaling”KEDA (Kubernetes Event-driven Autoscaling) can integrate with predictive models via external metric scalers. While native predictive scaling in KEDA is still maturing, the standard pattern involves deploying a custom metrics API server that exposes predictions generated by an ML model as standard Kubernetes custom metrics.
Alternatively, operators use tools like Predictive Horizontal Pod Autoscaler (PHPA) which acts as a drop-in replacement or wrapper around the standard HPA. It fetches historical metrics, runs statistical models (like Holt-Winters), and overrides the replica count ahead of time.
The Dangers of Black-Box Scaling
Section titled “The Dangers of Black-Box Scaling”Predictive scaling introduces significant operational risk. If the model hallucinates a traffic spike (e.g., due to anomalous historical data from a previous outage), it may scale deployments to their maximum limits, exhausting node resources and starving other workloads.
Guardrails for Predictive Scaling:
- Tight
maxReplicasbounds: Never setmaxReplicasarbitrarily high. Cap it at 120% of your known peak capacity. - Fallback to Reactive HPA: If the predictive metric server goes down or returns NaN, the system must immediately fall back to standard CPU/Memory reactive scaling.
- Ignore Anomaly Windows: The training pipeline for the predictive model must strip out data generated during outages or DDoS attacks, otherwise the model will predict “outages” as standard seasonal traffic.
Self-Healing Workflows and Guardrails
Section titled “Self-Healing Workflows and Guardrails”The ultimate goal of AIOps is automated remediation: the system detects an anomaly, the LLM determines the fix, and an operator applies it. On bare metal, automated remediation usually involves node cordoning, pod evictions, restarting deadlocked services, or rolling back deployments.
Implementing Playbooks
Section titled “Implementing Playbooks”Robusta allows you to define custom Python playbooks triggered by specific Prometheus alerts.
# Example Robusta custom playbook triggercustomPlaybooks: - triggers: - on_prometheus_alert: alert_name: HighErrorRate actions: - custom_rollback_action: namespace: "{{ alert.labels.namespace }}" deployment: "{{ alert.labels.deployment }}"The “Human-in-the-Loop” Mandate
Section titled “The “Human-in-the-Loop” Mandate”Permitting an LLM to autonomously execute state-mutating commands (kubectl delete, kubectl rollout undo) is a severe anti-pattern in production environments. Models hallucinate, and incorrect remediation can transform a localized failure into a cluster-wide outage.
Strict Guardrails for Remediation:
- Read-Only Service Accounts: The RBAC token mounted to the AIOps engine must strictly be bound to
get,list, andwatchverbs. - Action Suggestions, Not Execution: The LLM should generate the specific
kubectlcommands required to fix the issue and output them in the notification sink. The human operator copies, validates, and executes the command. - Approval Webhooks: If automated execution is required, implement a Slack interactive button (e.g., “Approve Rollback”). The button triggers an intermediary service that validates the request against an allowlist of permitted actions before executing it with elevated privileges.
- Rate Limiting: If a self-healing loop initiates, it must be rate-limited. If a node is cordoned and workloads are evicted, the system must wait for cluster stabilization before cordoning a second node. Failure to rate-limit can lead to all nodes being cordoned simultaneously.
Hands-on Lab: AI-Enriched Alerting with Robusta and Local LLMs
Section titled “Hands-on Lab: AI-Enriched Alerting with Robusta and Local LLMs”In this lab, you will deploy a local Ollama instance running a small LLM, install Robusta, and configure it to intercept a failing Pod alert, utilizing the local AI to explain the failure.
Prerequisites
Section titled “Prerequisites”- A Kubernetes cluster (v1.32+).
kindorminikubeis sufficient if you have at least 8GB of RAM available. helmandkubectlinstalled.kube-prometheus-stackinstalled in themonitoringnamespace.
Step 1: Deploy Local LLM (Ollama)
Section titled “Step 1: Deploy Local LLM (Ollama)”Deploy Ollama to serve the AI model locally. We will use llama3.2:1b for lab environments due to its low memory footprint. In production, use a larger model.
Create ollama.yaml:
apiVersion: apps/v1kind: Deploymentmetadata: name: ollama namespace: monitoringspec: replicas: 1 selector: matchLabels: app: ollama template: metadata: labels: app: ollama spec: containers: - name: ollama image: ollama/ollama:0.5.4 ports: - containerPort: 11434 # Allocate sufficient resources. Adjust based on your hardware. resources: requests: cpu: "2" memory: "4Gi" limits: memory: "8Gi"---apiVersion: v1kind: Servicemetadata: name: ollama namespace: monitoringspec: selector: app: ollama ports: - port: 11434 targetPort: 11434Apply the manifests:
kubectl apply -f ollama.yamlWait for the pod to be ready, then exec into it to pull the model:
kubectl wait --for=condition=ready pod -l app=ollama -n monitoring --timeout=120skubectl exec -it deploy/ollama -n monitoring -- ollama run llama3.2:1b(Type /bye to exit the Ollama prompt once the model is pulled).
Step 2: Configure and Install Robusta
Section titled “Step 2: Configure and Install Robusta”Generate a Robusta configuration. You will need a Robusta account for the UI sink (optional) or you can configure a Slack sink. For this lab, we will configure Robusta to use the local Ollama instance and output to stdout for verification.
Create robusta-values.yaml:
globalConfig: cluster_name: "lab-cluster" # Point to the local Ollama service using the OpenAI compatibility layer chat_gpt_endpoint: "http://ollama.monitoring.svc.cluster.local:11434/v1" chat_gpt_model: "llama3.2:1b" chat_gpt_api_key: "dummy-key" # Required field, but ignored by Ollama
sinksConfig: # Output alerts to the robusta-runner logs for easy viewing in the lab - stdout_sink: name: main_stdout_sink
enablePlatform: false # Disable cloud UI for true private AIOpsenablePrometheusStack: false # We assume you already have Prometheus
# Configure the AI enrichment playbookcustomPlaybooks: - triggers: - on_prometheus_alert: {} actions: - ask_chat_gpt: prompt: "Analyze this Prometheus alert and the provided Kubernetes context. Provide a concise root cause analysis and exactly two kubectl commands to investigate or remediate the issue."Install Robusta via Helm:
helm repo add robusta https://robusta-charts.storage.googleapis.comhelm repo updatehelm install robusta robusta/robusta -f robusta-values.yaml -n monitoringVerify the runner is active:
kubectl get pods -n monitoring -l app=robusta-runner# EXPECTED: READY 1/1Step 3: Configure Alertmanager to forward to Robusta
Section titled “Step 3: Configure Alertmanager to forward to Robusta”Update your kube-prometheus-stack values to forward alerts to Robusta.
alertmanager: config: route: receiver: 'robusta' group_by: ['alertname', 'namespace'] group_wait: 10s group_interval: 1m repeat_interval: 4h receivers: - name: 'robusta' webhook_configs: - url: 'http://robusta-runner.monitoring.svc.cluster.local/api/alerts' send_resolved: trueApply the update:
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack -f prometheus-values.yaml -n monitoringStep 4: Trigger an Alert and Verify AI Enrichment
Section titled “Step 4: Trigger an Alert and Verify AI Enrichment”Deploy a failing pod to trigger a standard KubePodCrashLooping or KubePodNotReady alert.
apiVersion: v1kind: Podmetadata: name: failing-app namespace: default labels: app: failingspec: containers: - name: app image: busybox command: ["/bin/sh", "-c", "echo 'Connecting to database...'; sleep 2; exit 1"]kubectl apply -f crashing-pod.yamlWait 5-10 minutes for Prometheus to fire the alert and Alertmanager to route it to Robusta.
Tail the Robusta runner logs to observe the AI enrichment:
kubectl logs -f deploy/robusta-runner -n monitoringExpected Output (Truncated): You should see Robusta intercept the webhook, make an HTTP POST to the Ollama endpoint, and output the enriched alert containing the LLM’s response.
[INFO] Triggering playbook ask_chat_gpt for alert KubePodCrashLooping[INFO] Querying LLM endpoint http://ollama.monitoring.svc.cluster.local:11434/v1...[STDOUT_SINK] Alert: KubePodCrashLooping in defaultAI Analysis: The pod 'failing-app' is in a CrashLoopBackOff state. The logs indicate the process exits with code 1 immediately after printing "Connecting to database...". This suggests a missing configuration, network policy blocking DB access, or missing credentials.
Suggested Actions:1. View detailed pod events: `kubectl describe pod failing-app -n default`2. Check recent logs: `kubectl logs failing-app -n default --previous`Troubleshooting
Section titled “Troubleshooting”- LLM Connection Refused: Ensure the Ollama pod is running and the service endpoint
http://ollama.monitoring.svc.cluster.local:11434resolves from within therobusta-runnerpod. - Alerts not arriving: Check Alertmanager logs (
kubectl logs statefulset/alertmanager-kube-prometheus-stack-alertmanager -n monitoring) to ensure the webhook to Robusta is succeeding. - OOMKilled Ollama: Local LLMs consume significant memory. If the Ollama pod restarts with
OOMKilled, increase the memory limits or use a smaller model quantization (e.g.,q4_0).
Practitioner Gotchas
Section titled “Practitioner Gotchas”- Context Window Truncation: Feeding
kubectl describe podand 500 lines of logs into a local LLM will rapidly exceed the model’s context window (often 8k tokens for smaller models), resulting in truncated prompts or HTTP 500 errors. Implement log tailing limits (e.g., last 50 lines) before sending data to the AI backend. - LLM API Timeout Constraints: Robusta and Alertmanager webhooks expect relatively fast responses. Inference on CPU-only bare metal nodes for a 8B parameter model can take 30-60 seconds. This will cause webhook timeouts. You must increase Alertmanager webhook timeout configurations or ensure GPU acceleration for the LLM inference server.
- Hallucinated Kubernetes Objects: LLMs confidently hallucinate CRDs and specific resource names. An AI might suggest running
kubectl restart deployment/db-backendwhen the actual resource is a StatefulSet namedpostgres-db. Never pipe AI output directly intobashorkubectl. - Standard Deviation Alert Flapping: Using Z-scores for anomaly detection on metrics with low baseline variance (e.g., a service that receives exactly 1 request per hour) will cause minor fluctuations to appear as massive anomalies (Z-score > 10). Always combine statistical anomaly detection with an absolute minimum threshold (e.g.,
z_score > 3 AND rate > 100).
1. You are configuring Robusta to use a local LLM running in your cluster to ensure strict data privacy. The Robusta Helm chart requires a chat_gpt_api_key. What is the correct approach?
- A) Leave the field empty or comment it out; the local API doesn’t need it.
- B) Provide a dummy string; the OpenAI compatible client requires the field to be populated, but the local backend (like Ollama) will ignore it.
- C) Generate an API key from the OpenAI platform and use it to authenticate the local traffic.
- D) Configure the Robusta runner RBAC to read the key dynamically from the
kube-systemnamespace.
Answer
**Correct Answer: B.** The underlying Python OpenAI client library used by Robusta strictly requires an API key string to initialize, even if the base URL is pointed to a local unauthenticated service like Ollama. A dummy string satisfies the client.2. Which of the following is a critical operational guardrail when implementing Predictive Horizontal Pod Autoscaling (PHPA)?
- A) Setting the minimum replica count to zero to maximize resource utilization during off-peak hours.
- B) Ensuring the predictive model has admin access to the cluster to create new node pools.
- C) Capping the
maxReplicasto a conservative boundary (e.g., 120% of expected peak) to prevent hallucinated traffic spikes from exhausting cluster resources. - D) Disabling the reactive HPA entirely to prevent conflict with the predictive scaler.
Answer
**Correct Answer: C.** Predictive models can hallucinate or over-predict based on dirty historical data. Tight `maxReplicas` bounds prevent the scaler from requesting infinite pods and crashing the cluster.3. When implementing PromQL-based anomaly detection using Z-scores (expr: node:cpu_usage:z_score > 3), you notice the alert is flapping continuously on a low-traffic development cluster. What is the most likely architectural flaw?
- A) The standard deviation is too high, masking the true anomalies.
- B) The baseline time window (
1w) is too short to capture standard variance. - C) Evaluating long time windows is causing Prometheus to silently drop metrics.
- D) The metric has a near-zero baseline variance, meaning tiny absolute changes generate massive Z-scores.
Answer
**Correct Answer: D.** If a baseline is perfectly flat, the standard deviation approaches zero. Any minor change divided by near-zero results in a massive Z-score, triggering the anomaly alert. You must add an absolute minimum threshold to the rule.4. Why is routing automated remediation playbooks directly from an LLM to the Kubernetes API server (e.g., kubectl delete pod) considered a severe anti-pattern?
- A) The API server rate limits incoming requests from service accounts by default.
- B) LLMs hallucinate resource names and remediation steps, which can escalate a localized failure into a cluster outage if executed autonomously.
- C) Robusta does not support executing commands against the local cluster.
- D) Local LLMs cannot process JSON output required by the Kubernetes API.
Answer
**Correct Answer: B.** LLMs lack situational awareness and absolute determinism. They may hallucinate a fix that destroys healthy resources. Human-in-the-loop verification is mandatory for destructive actions.5. You are running a 7B parameter local LLM for private AIOps on bare metal without GPUs. You notice that Robusta often fails to attach AI summaries to Slack alerts, and the Robusta logs show HTTP 504 Gateway Timeout. What is the primary cause?
- A) The Alertmanager webhook timeout is shorter than the CPU-bound inference time required by the LLM.
- B) Robusta cannot communicate with services in the
monitoringnamespace. - C) The LLM’s context window has been exceeded by the size of the Slack payload.
- D) The Slack API is rejecting messages that contain AI-generated content.