Module 1.4: Loki
Цей контент ще не доступний вашою мовою.
Toolkit Track | Complexity:
[COMPLEX]| Time: 40-45 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Module 1.1: Prometheus — Labels and querying concepts
- Module 1.3: Grafana — Visualization and exploration
- Basic understanding of log aggregation
- kubectl log experience
The infrastructure team at a fast-growing SaaS company stared at their AWS bill in disbelief. Elasticsearch: $127,000 last month. For logs. Not revenue-generating features, not customer-facing services—just storing text that nobody read 99% of the time.
“We need these logs for compliance,” the security team argued. “Seven years retention, fully searchable.”
The math was brutal: 2TB of logs per day, indexed across 12 fields, replicated 3x, stored for 2,555 days. The five-year projection showed $8.4 million in Elasticsearch infrastructure costs alone.
Then the platform architect discovered Loki. Same 2TB/day, but stored in S3 at $0.023/GB instead of hot Elasticsearch nodes at $0.30/GB. Labels indexed, content compressed. The seven-year compliance requirement? Achievable for $340,000—a 96% cost reduction.
Six months after migration, the CFO asked the infrastructure team to present at the board meeting. “This is the first time anyone’s ever invited infrastructure to talk about cost savings,” the architect joked. The board approved their next three proposals on the spot.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Loki for cost-effective log aggregation with label-based indexing and object storage backends
- Implement LogQL queries for log exploration, metric extraction, and pattern-based alerting
- Configure Promtail and Grafana Alloy agents for Kubernetes log collection with label enrichment
- Optimize Loki’s storage configuration and retention policies for high-volume log environments
Why This Module Matters
Section titled “Why This Module Matters”Logs tell you what happened. Metrics tell you something is wrong; logs tell you why. But traditional logging solutions (ELK, Splunk) are expensive—indexing every field in every log line costs storage and compute.
Loki takes a different approach: index only labels, store logs compressed. It’s “Prometheus for logs”—same label-based discovery, fraction of the cost. At scale, this matters enormously.
Did You Know?
Section titled “Did You Know?”- Loki was inspired by frustration—Grafana engineers were tired of running expensive Elasticsearch clusters just to find a needle in a haystack of logs
- Loki can be 10-100x cheaper than Elasticsearch for the same volume—because it doesn’t index log content, only labels
- The name comes from Norse mythology—Loki is the trickster god, fitting for a system that “tricks” you into thinking you have full-text search
- Loki’s query language (LogQL) borrows heavily from PromQL—if you know Prometheus, you’re 80% there
Loki Architecture
Section titled “Loki Architecture”┌─────────────────────────────────────────────────────────────────┐│ LOKI ARCHITECTURE │├─────────────────────────────────────────────────────────────────┤│ ││ LOG SOURCES ││ ┌──────────────┐ ││ │ App Pods │ ││ │ stdout/err │─────┐ ││ └──────────────┘ │ ││ ┌──────────────┐ │ ┌─────────────────────────────────┐││ │ System Logs │ ├────▶│ PROMTAIL │││ │ /var/log/* │ │ │ ┌───────────────────────────┐ │││ └──────────────┘ │ │ │ • Discover log files │ │││ ┌──────────────┐ │ │ │ • Extract labels │ │││ │ K8s Events │─────┘ │ │ • Push to Loki │ │││ └──────────────┘ │ └───────────────────────────┘ │││ └──────────────┬──────────────────┘││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ LOKI CLUSTER │ ││ │ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │ ││ │ │ Distributor│ │ Ingester │ │ Querier │ │ ││ │ │ │ │ │ │ │ │ ││ │ │ Receives │─▶│ Builds │ │ Queries chunks │ │ ││ │ │ logs │ │ chunks │ │ from ingesters │ │ ││ │ │ │ │ in memory │ │ and object store │ │ ││ │ └────────────┘ └─────┬──────┘ └────────────────────┘ │ ││ │ │ │ ││ │ ▼ │ ││ │ ┌──────────────────────────────────────────────────────┐│ ││ │ │ OBJECT STORAGE (S3/GCS/Minio) ││ ││ │ │ ┌─────────────────┐ ┌─────────────────────────┐ ││ ││ │ │ │ Chunks │ │ Index │ ││ ││ │ │ │ (compressed │ │ (labels → chunk refs) │ ││ ││ │ │ │ log lines) │ │ │ ││ ││ │ │ └─────────────────┘ └─────────────────────────┘ ││ ││ │ └──────────────────────────────────────────────────────┘│ ││ └──────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Core Components
Section titled “Core Components”| Component | Purpose |
|---|---|
| Promtail | Agent that discovers, labels, and pushes logs |
| Distributor | Receives log streams, validates, routes to ingesters |
| Ingester | Builds compressed chunks, holds recent data |
| Querier | Executes LogQL queries across ingesters and storage |
| Compactor | Merges and deduplicates index files |
Why Labels Matter
Section titled “Why Labels Matter”LOKI'S KEY INSIGHT: Only index labels, not log content
Traditional (Elasticsearch): Loki:─────────────────────────────────────────────────────────────────
Every word indexed: Only labels indexed:{ { "timestamp": "2024-01-15T10:30:00Z", namespace="production", "message": "User login failed", ◀─ idx app="api-gateway", "user": "john@example.com", ◀─ idx pod="api-gateway-xyz", "error": "invalid password", ◀─ idx container="api" "ip": "192.168.1.100", ◀─ idx } "pod": "api-gateway-xyz" ◀─ idx} Log line: stored compressed
Storage: 10GB Storage: 1GBIndex: 8GB Index: 100MBLogQL: Query Language
Section titled “LogQL: Query Language”Basic Syntax
Section titled “Basic Syntax”{label="value"} |= "search term" | json | line_format "{{.field}}" └── stream └── filter └── parser └── formatter selectorStream Selectors
Section titled “Stream Selectors”# Exact match{namespace="production"}
# Regex match{namespace=~"prod.*"}
# Not equal{namespace!="kube-system"}
# Multiple labels (AND){namespace="production", app="api-gateway"}
# Regex OR{namespace=~"production|staging"}Line Filters
Section titled “Line Filters”# Contains (case-sensitive){app="api"} |= "error"
# Does not contain{app="api"} != "debug"
# Regex match{app="api"} |~ "error|warn|fatal"
# Regex not match{app="api"} !~ "health|ready"Parsers
Section titled “Parsers”# JSON parser - extracts all JSON fields as labels{app="api"} | json
# JSON with specific fields{app="api"} | json status_code="response.status"
# Logfmt parser (key=value format){app="api"} | logfmt
# Pattern parser (custom format){app="nginx"} | pattern `<ip> - - [<timestamp>] "<method> <path> <_>" <status> <size>`
# Regex parser{app="api"} | regexp `(?P<level>\w+) (?P<msg>.*)`Label Filters (after parsing)
Section titled “Label Filters (after parsing)”# Filter by extracted label{app="api"} | json | status_code >= 500
# String comparison{app="api"} | json | level = "error"
# Duration comparison{app="api"} | logfmt | duration > 1s
# Size comparison{app="api"} | json | bytes > 1MBAggregations (Metric Queries)
Section titled “Aggregations (Metric Queries)”# Count logs per minutecount_over_time({app="api"} |= "error" [1m])
# Rate of logs per secondrate({app="api"} |= "error" [5m])
# Bytes processedbytes_rate({app="api"} [5m])
# Sum by labelsum by (status_code) ( count_over_time({app="api"} | json [5m]))
# Top 5 error sourcestopk(5, sum by (pod) ( count_over_time({app="api"} |= "error" [1h]) ))Practical LogQL Examples
Section titled “Practical LogQL Examples”# Find all errors in production in the last hour{namespace="production"} |= "error" [1h]
# Parse JSON logs and filter by status{app="api-gateway"} | json | status_code >= 400 | line_format "{{.method}} {{.path}} - {{.status_code}}"
# Calculate error rate per servicesum by (app) ( rate({namespace="production"} |= "error" [5m]))
# Find slow requests (>1s) with context{app="api"} | json | duration > 1s | line_format "{{.path}} took {{.duration}} - user: {{.user_id}}"
# Logs around a specific time (context){app="api"} | json | ts >= "2024-01-15T10:30:00Z" | ts <= "2024-01-15T10:35:00Z"Promtail Configuration
Section titled “Promtail Configuration”Basic Promtail Setup
Section titled “Basic Promtail Setup”server: http_listen_port: 9080 grpc_listen_port: 0
positions: filename: /tmp/positions.yaml # Track file read positions
clients: - url: http://loki:3100/loki/api/v1/push
scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod
relabel_configs: # Keep only pods with logging enabled - source_labels: [__meta_kubernetes_pod_annotation_promtail_io_scrape] action: keep regex: true
# Extract namespace label - source_labels: [__meta_kubernetes_namespace] target_label: namespace
# Extract pod name - source_labels: [__meta_kubernetes_pod_name] target_label: pod
# Extract container name - source_labels: [__meta_kubernetes_pod_container_name] target_label: container
# Extract app label from pod - source_labels: [__meta_kubernetes_pod_label_app] target_label: appPipeline Stages
Section titled “Pipeline Stages”scrape_configs: - job_name: app-logs static_configs: - targets: - localhost labels: job: app __path__: /var/log/app/*.log
pipeline_stages: # Parse JSON logs - json: expressions: level: level msg: message user_id: user.id
# Add labels from parsed fields - labels: level: user_id:
# Drop debug logs in production - match: selector: '{job="app"}' stages: - drop: expression: '.*level=debug.*'
# Restructure log line - output: source: msg
# Add timestamp from log - timestamp: source: ts format: RFC3339Multiline Logs (Stack Traces)
Section titled “Multiline Logs (Stack Traces)”scrape_configs: - job_name: java-app static_configs: - targets: - localhost labels: job: java __path__: /var/log/java/*.log
pipeline_stages: # Combine multiline stack traces - multiline: firstline: '^\d{4}-\d{2}-\d{2}' # Starts with date max_wait_time: 3s max_lines: 128
# Then parse as usual - json: expressions: level: level exception: exception.classLoki Deployment Patterns
Section titled “Loki Deployment Patterns”Monolithic (Simple)
Section titled “Monolithic (Simple)”# Good for: < 100GB/day, single teamapiVersion: apps/v1kind: Deploymentmetadata: name: loki namespace: monitoringspec: replicas: 1 selector: matchLabels: app: loki template: metadata: labels: app: loki spec: containers: - name: loki image: grafana/loki:2.9.0 args: - -config.file=/etc/loki/loki.yaml ports: - containerPort: 3100 volumeMounts: - name: config mountPath: /etc/loki - name: storage mountPath: /loki volumes: - name: config configMap: name: loki-config - name: storage persistentVolumeClaim: claimName: loki-storage---apiVersion: v1kind: ConfigMapmetadata: name: loki-config namespace: monitoringdata: loki.yaml: | auth_enabled: false
server: http_listen_port: 3100
common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory
schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h
ruler: alertmanager_url: http://alertmanager:9093
limits_config: retention_period: 168h # 7 daysMicroservices Mode (Scale)
Section titled “Microservices Mode (Scale)”# Good for: > 100GB/day, multi-tenant# Deploy each component separately
# Distributor (2+ replicas)apiVersion: apps/v1kind: Deploymentmetadata: name: loki-distributorspec: replicas: 2 template: spec: containers: - name: distributor image: grafana/loki:2.9.0 args: - -target=distributor - -config.file=/etc/loki/loki.yaml
# Ingester (3+ replicas, StatefulSet)apiVersion: apps/v1kind: StatefulSetmetadata: name: loki-ingesterspec: replicas: 3 serviceName: loki-ingester template: spec: containers: - name: ingester image: grafana/loki:2.9.0 args: - -target=ingester - -config.file=/etc/loki/loki.yaml volumeMounts: - name: wal mountPath: /loki/wal volumeClaimTemplates: - metadata: name: wal spec: accessModes: [ReadWriteOnce] resources: requests: storage: 10Gi
# Querier (2+ replicas)apiVersion: apps/v1kind: Deploymentmetadata: name: loki-querierspec: replicas: 2 template: spec: containers: - name: querier image: grafana/loki:2.9.0 args: - -target=querier - -config.file=/etc/loki/loki.yamlObject Storage Configuration (Production)
Section titled “Object Storage Configuration (Production)”# loki-config.yaml for S3-compatible storageauth_enabled: true # Enable multi-tenancy
storage_config: aws: s3: s3://access-key:secret-key@region/bucket-name s3forcepathstyle: true
boltdb_shipper: active_index_directory: /loki/index shared_store: s3 cache_location: /loki/cache
# Or for GCSstorage_config: gcs: bucket_name: my-loki-bucket
boltdb_shipper: active_index_directory: /loki/index shared_store: gcs
# Or for Azurestorage_config: azure: account_name: mystorageaccount account_key: base64-encoded-key container_name: loki-logsMulti-Tenancy
Section titled “Multi-Tenancy”Enabling Multi-Tenancy
Section titled “Enabling Multi-Tenancy”auth_enabled: true # This is all it takes
limits_config: # Per-tenant limits ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 max_streams_per_user: 10000 max_global_streams_per_user: 0 # 0 = unlimited
# Override for specific tenantsoverrides: tenant-a: ingestion_rate_mb: 50 retention_period: 720h # 30 days
tenant-b: ingestion_rate_mb: 5 retention_period: 168h # 7 daysSending Logs with Tenant ID
Section titled “Sending Logs with Tenant ID”# Promtail config - send tenant headerclients: - url: http://loki:3100/loki/api/v1/push tenant_id: team-a
# Or via HTTP header# X-Scope-OrgID: team-aQuerying with Tenant ID
Section titled “Querying with Tenant ID”# Grafana: Configure data source with HTTP header# X-Scope-OrgID: team-a
# Or query directly:curl -H "X-Scope-OrgID: team-a" \ "http://loki:3100/loki/api/v1/query?query={app=\"api\"}"Alerting with Loki
Section titled “Alerting with Loki”Recording Rules
Section titled “Recording Rules”groups: - name: recording-rules interval: 1m rules: # Create a metric from logs - record: http_requests:rate5m expr: | sum by (app, status_code) ( rate({namespace="production"} | json [5m]) )
# Error rate by service - record: error_rate:1m expr: | sum by (app) ( rate({namespace="production"} |= "error" [1m]) ) / sum by (app) ( rate({namespace="production"} [1m]) )Alerting Rules
Section titled “Alerting Rules”groups: - name: alerting-rules rules: - alert: HighErrorRate expr: | sum by (app) ( rate({namespace="production"} |= "error" [5m]) ) > 10 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.app }}" description: "{{ $labels.app }} has {{ $value }} errors/sec"
- alert: NoLogsReceived expr: | sum(rate({namespace="production"}[5m])) == 0 for: 10m labels: severity: critical annotations: summary: "No logs from production" description: "Haven't received logs from production in 10 minutes"
- alert: HighLatencyRequests expr: | count_over_time( {app="api"} | json | duration > 5s [5m] ) > 100 for: 5m labels: severity: warning annotations: summary: "Many slow requests in API"Grafana Integration
Section titled “Grafana Integration”Adding Loki Data Source
Section titled “Adding Loki Data Source”apiVersion: 1datasources: - name: Loki type: loki access: proxy url: http://loki:3100 jsonData: maxLines: 1000 httpHeaderName1: X-Scope-OrgID secureJsonData: httpHeaderValue1: my-tenantCorrelating Metrics and Logs
Section titled “Correlating Metrics and Logs”In Grafana:1. Metrics panel: Prometheus query showing error spike2. Click data point → "Explore" → switch to Loki3. Automatic time range correlation4. See exact logs during incident
Or use "Derived Fields":- Extract trace_id from logs- Link to Tempo/Jaeger for traces- Full observability correlationCommon Mistakes
Section titled “Common Mistakes”| Mistake | Why It’s Bad | Better Approach |
|---|---|---|
| Too many labels | Cardinality explosion, slow queries | Use 5-10 static labels, parse dynamic values at query time |
| Storing structured data as labels | Labels should be low-cardinality | Keep user_id, request_id in log content, not labels |
| No retention policy | Disk fills up, costs explode | Set retention_period, use lifecycle policies on S3 |
| Querying all namespaces | Slow, expensive queries | Always filter by namespace first |
| Not using time ranges | Scans entire history | Always include time range [1h] or use Grafana time picker |
| Ignoring line limits | Returning millions of lines | Use ` |
War Story: The Stream Explosion That Killed Production
Section titled “War Story: The Stream Explosion That Killed Production”A startup migrated from Elasticsearch to Loki for cost savings. Initial results were spectacular: 80% cost reduction, faster queries, simpler operations. The VP of Engineering sent a company-wide email celebrating the win.
Week 3, Tuesday 2:14 PM: Developers noticed log queries timing out.
Week 3, Tuesday 2:45 PM: Loki ingesters started OOMing. Kubernetes restarted them in a loop.
Week 3, Tuesday 3:30 PM: Complete logging blackout. No logs ingested or queryable.
Week 3, Tuesday 4:00 PM: Root cause identified. A well-intentioned developer had added a “helpful” label to make debugging easier:
# The problematic Promtail configpipeline_stages: - json: expressions: request_id: request_id # ← This created a new stream per request - labels: request_id: # ← Labels must be low-cardinality!The Math That Broke Everything:
Request volume: 1,000,000 requests/dayLabels before: ~500 unique combinations (namespace × service × pod)Labels after: 500 × 1,000,000 = 500,000,000 stream combinations
Ingester memory: Before: 500 streams × 64KB each = 32MB After: 500M streams × 64KB each = 32TB (impossible)Financial Impact:
─────────────────────────────────────────────────────────────────Incident duration: 4 hoursLogs lost: ~2.8 million eventsCompliance gap: SOC 2 finding (minor)Engineering hours: 8 engineers × 6 hours = $9,600Cluster rebuild: $2,400 (new nodes, data migration)Audit remediation: $15,000 (documentation, controls)─────────────────────────────────────────────────────────────────Total Impact: ~$27,000The Fix: Move high-cardinality values to log content, query with parsers:
# Correct approachpipeline_stages: - json: expressions: request_id: request_id # Don't add request_id as label! - output: source: message # Keep request_id in log contentQuery at runtime:
{app="api"} | json | request_id="abc123-def456"The Lesson: Labels are for grouping, not for unique identifiers. If a value has more than ~100 unique values, it should be parsed at query time, not indexed as a label.
Question 1
Section titled “Question 1”What makes Loki cheaper than Elasticsearch for the same log volume?
Show Answer
Loki only indexes labels, not log content. Elasticsearch indexes every word in every log line. For 1TB of logs, Elasticsearch might need 800GB of index storage, while Loki needs only a few GB for label indices. The logs themselves are stored compressed without content indexing.
This trade-off means Loki queries are “grep at scale” - they must scan log content at query time. But for most use cases, filtering by labels first makes this fast enough.
Question 2
Section titled “Question 2”Given this LogQL query, explain what it returns:
topk(3, sum by (pod) ( count_over_time({namespace="prod", app="api"} |= "error" | json | status_code >= 500 [1h]) ))Show Answer
This query returns the top 3 pods with the most 5xx errors in the last hour.
Breaking it down:
{namespace="prod", app="api"}- Select logs from API pods in production|= "error"- Filter lines containing “error”| json- Parse as JSON| status_code >= 500- Keep only 5xx status codescount_over_time(...[1h])- Count matching lines in 1-hour windowssum by (pod)- Sum counts per podtopk(3, ...)- Return top 3 results
This is useful for finding which pods are generating the most errors during an incident.
Question 3
Section titled “Question 3”Why would you use a StatefulSet for Loki ingesters instead of a Deployment?
Show Answer
Ingesters need stable network identities and persistent storage for the Write-Ahead Log (WAL).
-
Stable identity: The hash ring uses pod names to distribute log streams. If pod names change on restart (as with Deployments), streams could be assigned to wrong ingesters.
-
WAL persistence: Ingesters buffer logs in memory and write to a WAL before flushing to object storage. If a pod crashes, the WAL allows recovery of unflushed data. This requires persistent volumes that survive pod restarts.
-
Ordered deployment: StatefulSets deploy pods one at a time, allowing the cluster to rebalance between each. Deployments might start all replicas simultaneously, causing thundering herd issues.
Question 4
Section titled “Question 4”How would you configure Promtail to combine Java stack traces into a single log entry?
Show Answer
Use the multiline pipeline stage with a regex that matches the start of a new log entry:
pipeline_stages: - multiline: firstline: '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}' max_wait_time: 3s max_lines: 128This tells Promtail:
- A new log entry starts with a timestamp (ISO format in this case)
- Wait up to 3 seconds for more lines before flushing
- Combine up to 128 lines into one entry
Without this, each line of a stack trace becomes a separate log entry, making them impossible to search together.
Question 5
Section titled “Question 5”Calculate: You have 500GB of logs per day. Elasticsearch costs $0.30/GB/month for hot storage. Loki with S3 costs $0.023/GB/month. What’s the monthly and annual savings?
Show Answer
Monthly calculation:
Daily volume: 500 GBMonthly volume: 500 × 30 = 15,000 GB
Elasticsearch: 15,000 × $0.30 = $4,500/monthLoki + S3: 15,000 × $0.023 = $345/month
Monthly savings: $4,155 (92% reduction)Annual calculation:
Elasticsearch: $4,500 × 12 = $54,000/yearLoki + S3: $345 × 12 = $4,140/year
Annual savings: $49,860Caveats:
- Elasticsearch may need 3x replication (multiply by 3)
- Loki needs querier/ingester compute (add ~$500-1000/month)
- S3 has request costs (add ~$50-100/month for GET/PUT)
- True savings: ~85-90% after all factors
Real-world: Companies typically see 80-95% cost reduction migrating from ELK to Loki.
Question 6
Section titled “Question 6”Your LogQL query {app="api"} |= "error" returns 10 million results and times out. How would you make it efficient?
Show Answer
Step 1: Add time constraints (most important):
{app="api"} |= "error" [1h] # Query only last hourStep 2: Add more label filters:
{app="api", namespace="production", level="error"} [1h]Step 3: Use aggregation instead of raw logs:
# Count errors per 5 minutessum by (service) ( count_over_time({app="api"} |= "error" [5m]))Step 4: Limit output:
{app="api"} |= "error" | limit 1000Step 5: Parse and filter:
{app="api"} | json | level="error" | status_code >= 500 | limit 100Root cause: Loki scans all chunks matching labels. More specific labels + shorter time range = fewer chunks scanned.
Question 7
Section titled “Question 7”What’s the difference between count_over_time(), rate(), and bytes_rate() in LogQL?
Show Answer
| Function | Returns | Use Case |
|---|---|---|
count_over_time({...}[5m]) | Number of log lines | ”How many errors in 5 minutes?” |
rate({...}[5m]) | Log lines per second | ”What’s the error rate per second?” |
bytes_rate({...}[5m]) | Bytes per second | ”How much log data am I ingesting?” |
Relationship:
rate() = count_over_time() / range_in_seconds
# These are equivalent:rate({app="api"}[5m])# ≈ count_over_time({app="api"}[5m]) / 300Practical examples:
# Alert: More than 100 errors per minuterate({app="api"} |= "error" [1m]) > 100
# Dashboard: Errors per 5-minute windowcount_over_time({app="api"} |= "error" [5m])
# Capacity planning: Log ingestion ratesum(bytes_rate({namespace="production"}[5m]))Question 8
Section titled “Question 8”How would you set up Loki alerting to page when a specific error message appears more than 10 times in 5 minutes?
Show Answer
Loki ruler configuration:
groups: - name: critical-errors interval: 1m rules: - alert: CriticalDatabaseError expr: | sum( count_over_time( {app="api"} |= "FATAL: database connection failed" [5m] ) ) > 10 for: 0m # Alert immediately when condition is true labels: severity: critical team: database annotations: summary: "Database connection failures spiking" description: "{{ $value }} database connection failures in last 5 minutes" runbook: "https://wiki.company.com/runbooks/db-connection"
- alert: PaymentProcessingError expr: | sum by (payment_provider) ( count_over_time( {app="payment-service"} | json | level="error" | error_type="payment_declined" [5m] ) ) > 10 for: 2m # Sustained for 2 minutes labels: severity: warning annotations: summary: "Payment failures for {{ $labels.payment_provider }}"Key elements:
for: 0mvsfor: 2m: How long condition must be truesum(): Aggregate across all matching streamsby (label): Group results for per-dimension alerting- Route to Alertmanager just like Prometheus alerts
Hands-On Exercise
Section titled “Hands-On Exercise”Scenario: Debug a Production Issue
Section titled “Scenario: Debug a Production Issue”You’re on-call and receive an alert: “API error rate spiked.” Use Loki to investigate.
# Create kind clusterkind create cluster --name loki-lab
# Install Loki stackhelm repo add grafana https://grafana.github.io/helm-chartshelm repo update
helm install loki grafana/loki-stack \ --namespace monitoring \ --create-namespace \ --set grafana.enabled=true \ --set prometheus.enabled=true \ --set promtail.enabled=true
# Wait for podskubectl -n monitoring wait --for=condition=ready pod -l app.kubernetes.io/name=grafana --timeout=120s
# Get Grafana passwordkubectl -n monitoring get secret loki-grafana -o jsonpath="{.data.admin-password}" | base64 -decho
# Port forwardkubectl -n monitoring port-forward svc/loki-grafana 3000:80 &Deploy Test Application
Section titled “Deploy Test Application”apiVersion: apps/v1kind: Deploymentmetadata: name: error-generator namespace: defaultspec: replicas: 2 selector: matchLabels: app: error-generator template: metadata: labels: app: error-generator spec: containers: - name: app image: busybox command: ["/bin/sh", "-c"] args: - | while true; do # Generate mix of logs echo '{"level":"info","msg":"Request processed","status_code":200,"path":"/api/users","duration":"50ms"}' echo '{"level":"info","msg":"Request processed","status_code":200,"path":"/api/orders","duration":"120ms"}'
# Occasionally generate errors if [ $((RANDOM % 5)) -eq 0 ]; then echo '{"level":"error","msg":"Database connection failed","status_code":500,"path":"/api/users","error":"connection timeout"}' fi if [ $((RANDOM % 10)) -eq 0 ]; then echo '{"level":"error","msg":"External API error","status_code":502,"path":"/api/payments","error":"upstream unavailable"}' fi
sleep 1 donekubectl apply -f error-app.yamlInvestigation Tasks
Section titled “Investigation Tasks”-
Find the error spike
- Open Grafana (http://localhost:3000)
- Go to Explore, select Loki
- Query:
{app="error-generator"} |= "error" - Observe error patterns
-
Calculate error rate
sum(rate({app="error-generator"} |= "error" [1m])) -
Break down by error type
sum by (path, status_code) (count_over_time({app="error-generator"} | json | level="error" [5m])) -
Find slow requests
{app="error-generator"}| json| duration > 100ms| line_format "{{.path}} took {{.duration}}" -
Create a dashboard showing:
- Error rate over time
- Errors by path (pie chart)
- Recent error logs (logs panel)
Success Criteria
Section titled “Success Criteria”- Can query logs by namespace and app
- Can filter by log level using JSON parser
- Can calculate error rates
- Can create LogQL aggregation queries
- Dashboard shows error trends
Cleanup
Section titled “Cleanup”kind delete cluster --name loki-labKey Takeaways
Section titled “Key Takeaways”Before moving on, ensure you can:
- Explain why Loki is cheaper than Elasticsearch (index-free design, label-based)
- Calculate approximate storage costs: logs/day × compression ratio × retention
- Configure Promtail with pipeline stages for JSON extraction and multiline parsing
- Write LogQL queries using stream selectors, line filters, and parsers
- Calculate log rates using
rate()andcount_over_time()functions - Identify and avoid high-cardinality labels (request IDs, user IDs as labels)
- Set up Grafana dashboards with log panels and LogQL variables
- Configure recording rules for expensive queries run frequently
- Design retention policies balancing compliance and cost
- Troubleshoot common issues: dropped logs, stream explosion, ingestion limits
Next Module
Section titled “Next Module”Continue to Module 1.5: Distributed Tracing where we’ll connect logs to traces using Jaeger and Tempo.
“Logs tell the story. Labels help you find the right chapter. LogQL lets you skip to the good parts.”