Module 1.2: OTel Collector Advanced
Complexity:
[COMPLEX]- Multiple interacting components, pipeline logicTime to Complete: 60-75 minutes
Prerequisites: Module 1 (OpenTelemetry Fundamentals), basic Kubernetes knowledge
OTCA Domain: Domain 3 - OTel Collector (26% of exam)
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Design multi-pipeline Collector configurations that route traces, metrics, and logs through distinct receiver, processor, connector, and exporter chains.
- Configure advanced processors, including
memory_limiter,filter,transform,tail_sampling, andbatch, to reduce telemetry volume while preserving critical signals. - Deploy the Collector as a Kubernetes 1.35 DaemonSet agent and Deployment gateway with resource limits, health checks, and scaling behavior that match the workload.
- Diagnose Collector pipeline issues using the debug exporter, zpages, and Collector internal metrics to identify bottlenecks, drops, and exporter failures.
- Evaluate OTLP transports, connectors, and Collector distributions so you can choose between gRPC, HTTP, span-derived metrics, Core, Contrib, and custom builds.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: your platform team has just replaced three vendor agents with OpenTelemetry Collectors across a Kubernetes 1.35 cluster. The pods are Ready, the health endpoint returns success, and application teams are already sending OTLP traces, Prometheus-format metrics, and container logs into the new path. On Monday morning, the incident review asks why slow checkout traces are missing while noisy readiness checks still appear in the backend. The Collector did not crash, and Kubernetes did not report a failed rollout; the failure lives in the pipeline logic between receiver, processor, connector, and exporter.
That is the operational reason this module spends so much time on configuration shape. The Collector is not merely a sidecar that forwards whatever it sees. It is a programmable telemetry data plane: it receives data through multiple protocols, applies ordered processors, bridges signals through connectors, sends data to one or more backends, and exposes its own health and debugging surfaces. A configuration can be syntactically valid while still dropping the exact spans you need, duplicating cluster metrics, or making tail-sampling decisions with incomplete traces.
For the OTCA exam, Domain 3 matters because it tests whether you can reason about that data plane under constraints, not whether you can memorize a single example file. For real operations, the same skill decides whether observability becomes a reliable troubleshooting tool or another distributed system that needs debugging during an outage. In this lesson, you will build from the Collector’s config anatomy to multi-signal production patterns, then practice validating a working pipeline with debug output, zpages, and internal metrics.
Collector Architecture and Configuration Anatomy
Section titled “Collector Architecture and Configuration Anatomy”The Collector configuration is best read as a wiring diagram rather than as a long YAML file. Receivers describe how telemetry enters, processors describe what happens to it in memory, exporters describe where it leaves, connectors bridge one pipeline into another, extensions expose supporting services, and the service section decides which components are actually active. A component definition by itself is only inventory; the pipeline list is the assembly line that makes it run.
That distinction prevents a common exam and production mistake. You can declare a filter processor, a debug exporter, or a zpages extension in the right top-level section, but none of those components affects telemetry until the service stanza references them. Treat the top-level component blocks like parts on a workbench, then treat service.pipelines as the exact order in which those parts are bolted into the machine.
# The five building blocks of every Collector configreceivers: # How data gets IN to the Collectorprocessors: # How data gets TRANSFORMED inside the Collectorexporters: # How data gets OUT of the Collectorconnectors: # Bridge between pipelines (output of one, input of another)extensions: # Auxiliary services (health checks, auth, debugging)
service: # Wires everything together into pipelines extensions: [health_check, zpages] pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp/backend] metrics: receivers: [otlp, prometheus] processors: [batch] exporters: [prometheusremotewrite] logs: receivers: [otlp, filelog] processors: [batch] exporters: [otlp/backend]The most important design rule is that processors execute in the order listed in a pipeline. A memory_limiter near the end is less useful because data has already accumulated in earlier processors, while batch near the beginning can increase memory pressure before later filters remove unwanted telemetry. The exam often presents this as a simple ordering question, but the deeper lesson is that each processor changes the risk profile of the next component.
┌─────────────────────────────────────────────────────────────────┐│ OTel Collector ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │Receivers │──▶│Processors│──▶│Exporters │──▶│ Backends │ ││ │ │ │ │ │ │ │ │ ││ │ otlp │ │ batch │ │ otlp │ │ Jaeger │ ││ │ prometheus│ │ filter │ │ prometheus│ │ Prometheus│ ││ │ filelog │ │ transform│ │ debug │ │ Loki │ ││ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ Extensions: health_check, zpages, pprof, bearertokenauth │ ││ └────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘Notice that the diagram shows extensions beside the pipeline rather than inside it. Extensions do not transform telemetry records, but they can be essential for operating the Collector safely. Health checks make Kubernetes probes meaningful, zpages help you inspect live pipeline state, pprof supports profiling, and authentication extensions let receivers or exporters enforce basic trust boundaries before telemetry crosses namespace or network edges.
Receivers define the front door of the Collector. The OTLP receiver is the universal input for OpenTelemetry-native traffic, and it commonly listens on gRPC port 4317 and HTTP port 4318. The Prometheus receiver scrapes metrics endpoints, the filelog receiver reads node log files, the hostmetrics receiver collects local system metrics, and the Kubernetes cluster receiver reads cluster-wide state from the API server. Those receivers solve different collection problems, so you should not run all of them everywhere just because the distribution includes them.
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 max_recv_msg_size_mib: 4 # Default: 4 MiB http: endpoint: 0.0.0.0:4318 cors: allowed_origins: ["*"] # For browser-based appsWhen you choose between OTLP/gRPC and OTLP/HTTP, think first about the network path rather than the application language. gRPC is efficient for service-to-Collector and Collector-to-Collector traffic when you control HTTP/2 routing, but HTTP is friendlier to browsers, older proxies, and debugging with ordinary tools. If a question says a browser is sending telemetry directly, or a proxy cannot handle HTTP/2 cleanly, OTLP/HTTP is usually the pragmatic answer.
| Aspect | gRPC (:4317) | HTTP (:4318) |
|---|---|---|
| Performance | Higher throughput, streaming | Slightly lower |
| Compression | Built-in (gzip, zstd) | Requires config |
| Firewall-friendly | No (HTTP/2, specific ports) | Yes (standard HTTP) |
| Browser support | No (needs proxy) | Yes (for web apps) |
| Best for | Service-to-collector, collector-to-collector | Browser RUM, edge ingestion |
Prometheus, file, host, and Kubernetes receivers add useful non-OTLP entry points, but they also tie the Collector to a deployment location. A filelog receiver needs node filesystem access, so it belongs on an agent DaemonSet. A k8s_cluster receiver reads global Kubernetes state, so running it on every node duplicates metrics and increases API pressure. Before running this, what output do you expect from each receiver if it is moved from an agent to a gateway, and which receivers would simply stop seeing their data source?
receivers: prometheus: config: scrape_configs: - job_name: 'k8s-pods' scrape_interval: 15s kubernetes_sd_configs: - role: podPrometheus Kubernetes service discovery also needs Kubernetes API RBAC, typically get, list, and watch on the resource kinds it discovers, such as pods, endpoints, and services in the namespaces it scrapes.
receivers: filelog: include: [/var/log/pods/*/*/*.log] operators: - type: json_parser timestamp: parse_from: attributes.time layout: '%Y-%m-%dT%H:%M:%S.%LZ'receivers: hostmetrics: collection_interval: 30s scrapers: cpu: {} memory: {} disk: {} filesystem: {} network: {} load: {}receivers: k8s_cluster: collection_interval: 30s node_conditions_to_report: [Ready, MemoryPressure] allocatable_types_to_report: [cpu, memory]The k8s_cluster receiver is the clean example of “right component, wrong placement.” It needs Kubernetes API permissions, and it observes objects that are already global from the perspective of a namespace. If you run it as a DaemonSet on every node, each agent reports similar cluster-level facts; if you run one gateway instance or one coordinated gateway Deployment, you get the same information with less duplicate work and less RBAC sprawl.
Processors are where the Collector becomes a data shaping system instead of a forwarding proxy. The memory_limiter protects the process before buffering grows, batch improves exporter efficiency, filter removes telemetry you intentionally do not want, attributes edits attributes, transform applies OTTL statements, and tail_sampling delays decisions until it can evaluate a trace. Those processors can all be valid, but the order decides whether they reduce risk or amplify it.
processors: batch: send_batch_size: 8192 # Number of items per batch send_batch_max_size: 10000 # Hard upper limit timeout: 200ms # Flush interval even if batch isn't fullThe batch processor is almost always present because exporters are more efficient when they send groups of telemetry items rather than one record at a time. Batching reduces request overhead and improves compression, but it also means the Collector briefly holds more data in memory. That tradeoff is why batch normally appears late in a pipeline, after processors have limited memory, dropped unwanted data, and redacted fields that should not leave the cluster.
processors: memory_limiter: check_interval: 1s limit_mib: 512 # Hard limit spike_limit_mib: 128 # Buffer for spikesIn this memory limiter example, the processor begins refusing data when memory approaches the configured limit minus the spike allowance, and it force-drops under harder pressure. That behavior is not a substitute for correct resource requests and limits, but it gives the Collector a controlled failure mode before the container is killed. It is usually better to lose some telemetry intentionally and visibly than to have Kubernetes restart the entire pipeline without preserving context.
processors: filter: error_mode: ignore traces: span: - 'attributes["http.route"] == "/healthz"' # Drop health checks - 'attributes["http.route"] == "/readyz"' metrics: metric: - 'name == "http.server.duration" and resource.attributes["service.name"] == "debug-svc"'The filter processor should be treated like a firewall rule for observability data. It is powerful because it removes low-value signals close to the source, but a broad condition can silently discard evidence you will need later. Use error_mode: ignore so a malformed record does not stop the processor, and validate the filter with the debug exporter before sending traffic only to the long-term backend.
processors: attributes: actions: - key: environment value: production action: upsert # Insert or update - key: db.password action: delete # Remove sensitive data - key: user.email action: hash # Hash PIIAttribute processing gives you a predictable way to normalize resource and span metadata before downstream queries depend on it. Adding an environment attribute can make dashboards consistent, deleting a password-like attribute prevents accidental exposure, and hashing an email address preserves grouping without retaining the original value. The key operational habit is to make these transformations explicit and reviewed, because observability metadata often becomes part of alerting, retention, and cost controls.
processors: transform: error_mode: ignore trace_statements: - context: span statements: - set(attributes["deployment.env"], "prod") where resource.attributes["k8s.namespace.name"] == "production" - truncate_all(attributes, 256) # Limit attribute value length - replace_pattern(attributes["http.url"], "token=([^&]*)", "token=***") metric_statements: - context: datapoint statements: - convert_sum_to_gauge() where metric.name == "system.cpu.time" log_statements: - context: log statements: - merge_maps(attributes, ParseJSON(body), "insert") where IsMatch(body, "^\\{")OTTL, the OpenTelemetry Transformation Language, is a high-leverage exam topic because it appears wherever the Collector needs more expressive changes than simple attribute actions. Functions such as set, delete, truncate_all, replace_pattern, merge_maps, ParseJSON, and IsMatch let you alter telemetry based on context. The risk is that expressive rules deserve the same review discipline as application code, especially when they touch URLs, identifiers, or signal names that dashboards rely on.
processors: tail_sampling: decision_wait: 10s # Wait for trace to complete num_traces: 100000 # Traces held in memory policies: - name: errors-always type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces type: latency latency: {threshold_ms: 1000} - name: low-volume-sample type: probabilistic probabilistic: {sampling_percentage: 10}Tail sampling is different from head sampling because it waits until enough spans have arrived to judge the trace. That makes policies such as “keep all errors, keep all slow traces, sample the rest” possible, but it also means the Collector must hold traces in memory and route every span for a trace to the same decision point. Pause and predict: what do you think happens if a trace is split across two gateway replicas before tail sampling runs?
Exporters, Connectors, and OTLP Transport Choices
Section titled “Exporters, Connectors, and OTLP Transport Choices”Exporters are the back door of the Collector, and their behavior often determines whether a healthy-looking pipeline is actually delivering data. An OTLP exporter can send to another Collector or an observability backend, OTLP/HTTP can cross environments where gRPC is awkward, Prometheus can expose a scrape endpoint, debug can print records for validation, and file can write telemetry to disk. Each exporter has reliability and security settings that matter as much as the destination name.
exporters: otlp: endpoint: tempo.observability.svc.cluster.local:4317 tls: insecure: false cert_file: /certs/client.crt key_file: /certs/client.key compression: gzip # or zstd retry_on_failure: enabled: true initial_interval: 5s max_interval: 30s max_elapsed_time: 300sThe OTLP exporter is the default answer for Collector-to-Collector and Collector-to-backend communication because it preserves OpenTelemetry semantics cleanly. TLS and retry settings deserve explicit review: internal cluster traffic may use service mesh or private network controls, while traffic leaving the cluster should have transport security and clear retry limits. Unlimited retry pressure can make a backend outage worse, but no retry can turn a short network interruption into preventable data loss.
exporters: otlphttp: endpoint: https://ingest.example.com compression: gzip headers: Authorization: "Bearer ${env:API_TOKEN}"OTLP/HTTP is the practical choice when the network path favors ordinary HTTP handling, but the authentication example also shows why examples should use environment variables and placeholders rather than embedded secrets. A Collector ConfigMap is often visible to several platform roles, and pushing realistic credentials into examples or Git history is both unnecessary and dangerous. Keep sensitive values in Kubernetes Secrets or an external secret manager, then reference them intentionally.
exporters: prometheus: endpoint: 0.0.0.0:8889 namespace: otel resource_to_telemetry_conversion: enabled: true # Promote resource attributes to labelsThe Prometheus exporter flips the usual push pattern into a scrape pattern. That can be useful when Prometheus is already the metrics backend, but promoting resource attributes to labels should be done carefully because label cardinality affects storage, query speed, and cost. A service name and namespace are usually helpful labels; user IDs, raw URLs, or request-specific values are almost always a problem.
exporters: debug: verbosity: detailed # basic | normal | detailed sampling_initial: 5 # First N items logged sampling_thereafter: 200 # Then every Nth itemThe debug exporter is not a production backend, but it is one of the safest ways to prove a pipeline is working before you depend on it. Add it beside the real exporter during rollout, send a known trace or metric, and confirm that the transformed record looks the way you expect. Remove or reduce verbose debug output after validation because detailed telemetry logs can grow quickly and may include sensitive attributes if redaction is not complete.
exporters: file: path: /data/otel-output.json rotation: max_megabytes: 100 max_days: 7 max_backups: 5Connectors are special because they behave as an exporter in one pipeline and a receiver in another. The spanmetrics connector is the classic example: traces enter the traces pipeline, the connector derives RED-style metrics, and those generated metrics enter the metrics pipeline. This lets you create request rate, error, and duration metrics from trace data, but it also means dimension choices can create high-cardinality metrics if you include fields that vary per request.
connectors: spanmetrics: histogram: explicit: buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 500ms, 1s, 5s] dimensions: - name: http.method - name: http.status_code namespace: traces.spanmetrics
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp/tempo, spanmetrics] # spanmetrics is an exporter here metrics: receivers: [otlp, spanmetrics] # spanmetrics is a receiver here processors: [batch] exporters: [prometheus]Read that configuration slowly because it teaches the connector mental model better than a definition does. The traces pipeline exports to both Tempo and spanmetrics, while the metrics pipeline receives from both ordinary OTLP and spanmetrics. If you see a connector in only one side of that relationship, the derived signal will not appear where you expect it.
connectors: count: traces: spans: - name: span.count description: "Count of spans" logs: log_records: - name: log.record.count description: "Count of log records"OTLP itself has two common transports, and the right answer depends on constraints rather than preference. gRPC uses HTTP/2 and Protocol Buffers, supports streaming patterns, and is a strong internal default. HTTP can use Protobuf or JSON over request-response paths such as /v1/traces, /v1/metrics, and /v1/logs, which makes it easier to work through proxies and easier to inspect with ordinary HTTP tooling.
| Feature | OTLP/gRPC | OTLP/HTTP |
|---|---|---|
| Transport | HTTP/2 with Protocol Buffers | HTTP/1.1 with Protobuf or JSON |
| Port | 4317 | 4318 |
| Compression | gzip, zstd (built-in) | gzip (via Content-Encoding) |
| Streaming | Yes (bidirectional) | No (request/response) |
| Path (traces) | N/A (gRPC service) | /v1/traces |
| Path (metrics) | N/A | /v1/metrics |
| Path (logs) | N/A | /v1/logs |
| Proxy support | Needs HTTP/2-aware proxy | Works with any HTTP proxy |
Compression should be a deliberate exporter setting in production. Telemetry data contains repeated attribute names, service names, and resource metadata, so compression often provides meaningful bandwidth reduction. zstd can be attractive for internal traffic when supported end to end, while gzip is the more broadly compatible choice for external endpoints and mixed infrastructure.
exporters: otlp: endpoint: gateway:4317 compression: zstd # Best ratio for telemetry data otlphttp: endpoint: https://ingest.example.com compression: gzip # More widely supportedWhich approach would you choose here and why: an internal agent-to-gateway path in a cluster you control, or browser-originated telemetry that must cross a corporate proxy? The first case usually favors OTLP/gRPC with efficient compression and service discovery. The second usually favors OTLP/HTTP, CORS-aware ingestion, and stricter attention to authentication and rate limits at the edge.
Kubernetes Deployment Patterns and the Operator
Section titled “Kubernetes Deployment Patterns and the Operator”Kubernetes changes the Collector design conversation because placement controls what data the Collector can see. A DaemonSet agent runs one Collector per node, which makes it good for local logs, host metrics, and near-source buffering. A Deployment gateway runs a shared pool, which makes it good for aggregation, tail sampling, routing, and backend-specific export policy. The most common production pattern uses both because neither placement solves every problem alone.
Agent Mode (DaemonSet) Gateway Mode (Deployment)────────────────────── ─────────────────────────
┌─────────────────────┐ ┌─────────────────────┐│ Node 1 │ │ Node 1 ││ ┌─────┐ ┌────────┐ │ │ ┌─────┐ ││ │App A│─▶│Collector│─┤ │ │App A│──┐ ││ └─────┘ │(Agent) │ │ │ └─────┘ │ ││ ┌─────┐ │ │ │ │ ┌─────┐ │ ││ │App B│─▶│ │ │ │ │App B│──┤ ││ └─────┘ └───┬────┘ │ │ └─────┘ │ │└─────────────┼──────┘ └──────────┼──────────┘ │ │ ▼ │┌─────────────────────┐ ││ Node 2 │ ┌─────────▼──────────┐│ ┌─────┐ ┌────────┐ │ │ Gateway Collector ││ │App C│─▶│Collector│─┤───▶Backend │ (Deployment, 2+ ││ └─────┘ │(Agent) │ │ │ replicas) │──▶Backend│ └───┬────┘ │ │ │└─────────────┼──────┘ └─────────▲──────────┘ │ │ ▼ ┌──────────┼──────────┐ Backend │ Node 2 │ │ ┌─────┐ │ │ │ │App C│──┘ │ │ └─────┘ │ └─────────────────────┘The agent and gateway split is also a scaling boundary. Agents scale with nodes and provide backpressure close to workloads, while gateways scale with telemetry volume and export complexity. Tail sampling belongs at the gateway because it needs enough spans from the same trace to make a decision, and host-level collection belongs at the agent because the gateway cannot read every node’s local files or system counters.
| Aspect | Agent (DaemonSet) | Gateway (Deployment) |
|---|---|---|
| Deployment | One per node | Shared pool (2+ replicas) |
| Resource use | Light per node | Heavier but centralized |
| Tail sampling | Not possible (incomplete traces) | Yes (full traces arrive) |
| Host metrics | Yes (local access) | No |
| Filelog | Yes (local files) | No |
| Scaling | Scales with nodes | HPA on CPU/memory |
| Best for | Collection, basic processing | Aggregation, sampling, routing |
The simple production chain is applications to node agents, agents to gateways, and gateways to one or more backends. That shape lets you keep node-local collection simple while centralizing expensive or policy-heavy processing. It also creates a clear failure model: if a backend is unavailable, gateways can absorb retry behavior without every application or node agent needing full backend-specific logic.
Apps ──▶ Agent (DaemonSet) ──▶ Gateway (Deployment) ──▶ Backends - hostmetrics - tail_sampling - filelog - spanmetrics - memory_limiter - routing - batch - export to N backendsHorizontal gateway scaling creates one subtle trace problem: all spans of a trace must reach the same gateway if tail sampling is enabled. Ordinary round-robin load balancing can split a trace across replicas, so each gateway sees an incomplete story and makes a poor decision. The load balancing exporter solves that by routing based on trace ID and discovering gateway instances through a resolver such as DNS.
# On the Agentexporters: loadbalancing: protocol: otlp: tls: insecure: true resolver: dns: hostname: otel-gateway-headless.observability.svc.cluster.local port: 4317The OpenTelemetry Operator adds Kubernetes-native management on top of those deployment modes. Instead of hand-writing every Deployment, DaemonSet, Service, and ConfigMap, you can describe an OpenTelemetryCollector custom resource and let the Operator reconcile the underlying objects. It also supports auto-instrumentation through the Instrumentation custom resource, which is useful when application teams need a low-friction starting point.
# Install cert-manager first (required dependency)kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml
# Install the OTel Operatorkubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yamlOperator-managed Collectors are still Collectors, so the same pipeline reasoning applies. The mode field chooses daemonset, deployment, statefulset, or sidecar, while spec.config contains the receiver, processor, exporter, connector, extension, and service configuration. Do not let the custom resource abstraction hide the data-flow questions you already learned to ask.
apiVersion: opentelemetry.io/v1beta1kind: OpenTelemetryCollectormetadata: name: otel-agent namespace: observabilityspec: mode: daemonset # daemonset | deployment | statefulset | sidecar image: otel/opentelemetry-collector-contrib:0.98.0 config: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: check_interval: 1s limit_mib: 512 batch: {} exporters: otlp: endpoint: otel-gateway.observability.svc.cluster.local:4317 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp]Auto-instrumentation is powerful because the Operator can inject language agents into pods through annotations, but it should be introduced with clear ownership. The application still needs compatible runtimes, predictable resource overhead, and an endpoint that can accept the resulting telemetry. Start with a small namespace or a single workload, then expand after you have verified trace quality, attribute names, and sampling behavior.
Before copying an Instrumentation manifest, confirm the CRD versions served by the installed Operator. Some distributions and exam contexts expect a newer served version such as v1alpha2, while older and upstream examples may still show v1alpha1 with conversion support.
apiVersion: opentelemetry.io/v1alpha1kind: Instrumentationmetadata: name: auto-instrumentation namespace: observabilityspec: exporter: endpoint: http://otel-agent-collector.observability.svc.cluster.local:4318 propagators: - tracecontext - baggage sampler: type: parentbased_traceidratio argument: "0.25" java: image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest python: image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest nodejs: image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latestapiVersion: apps/v1kind: Deploymentmetadata: name: my-java-appspec: template: metadata: annotations: instrumentation.opentelemetry.io/inject-java: "true" # Java auto-instrument # Other options: # instrumentation.opentelemetry.io/inject-python: "true" # instrumentation.opentelemetry.io/inject-nodejs: "true" # instrumentation.opentelemetry.io/inject-dotnet: "true" spec: containers: - name: app image: my-java-app:latestCollector distributions matter more as you move from lab to production. The Core distribution gives you a smaller, project-maintained base; Contrib gives you broad integration coverage; a custom distribution built with the OpenTelemetry Collector Builder includes only the components you select. The exam expects you to know the tradeoff, and production security reviews usually care about it because every unused component is dependency surface.
| Distribution | Components | Use Case |
|---|---|---|
Core (otel/opentelemetry-collector) | ~20 components (otlp, batch, debug, etc.) | Minimal footprint, security-sensitive environments |
Contrib (otel/opentelemetry-collector-contrib) | 200+ components (all community receivers, processors, exporters) | Development, when you need specific integrations |
Custom (built with ocb) | Exactly what you choose | Production — include only what you use |
The Collector Builder configuration is explicit dependency management for your telemetry data plane. You declare the distribution metadata and the exact receiver, processor, and exporter modules you want compiled in. That can reduce binary size, startup time, and audit scope, but it also means you own the build process and must update the selected components as OpenTelemetry releases move forward.
dist: name: my-collector description: "Production collector" output_path: ./dist otelcol_version: "0.98.0"
receivers: - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.98.0 - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/filelogreceiver v0.98.0
processors: - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.98.0 - gomod: go.opentelemetry.io/collector/processor/memorylimiterprocessor v0.98.0
exporters: - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.98.0 - gomod: go.opentelemetry.io/collector/exporter/debugexporter v0.98.0# Build itocb --config builder-config.yamlA custom Collector is not automatically better; it is better when you have a stable component set, a release process, and a reason to reduce footprint or dependency surface. During early experimentation, Contrib is often the faster path because it includes the receiver or exporter you are testing. Once the pipeline stabilizes, a custom distribution lets you remove components that were useful during discovery but unnecessary for long-term operation.
Debugging and Operating a Multi-Signal Pipeline
Section titled “Debugging and Operating a Multi-Signal Pipeline”Debugging a Collector starts with a basic question: did telemetry enter, change, and leave the pipeline the way you intended? Kubernetes pod health only answers whether the process is alive. The Collector’s own telemetry, debug exporter, and zpages answer whether receivers accepted data, processors dropped or transformed it, exporters sent it, and extensions are running. Those signals should be present in every serious rollout plan.
extensions: health_check: endpoint: 0.0.0.0:13133 # Liveness/readiness probe target
zpages: endpoint: 0.0.0.0:55679 # Internal debug UI at /debug/tracez, /debug/pipelinez
pprof: endpoint: 0.0.0.0:1777 # Go pprof profiling
bearertokenauth: token: "${env:OTEL_AUTH_TOKEN}"
service: extensions: [health_check, zpages, pprof, bearertokenauth]| Extension | Purpose | Default Port |
|---|---|---|
health_check | K8s liveness/readiness probes | 13133 |
zpages | Debug UI: pipeline status, trace samples | 55679 |
pprof | Performance profiling | 1777 |
bearertokenauth | Authenticate incoming/outgoing requests | N/A |
The debug exporter should be used as a temporary observability mirror. When a filter, transform, or connector is introduced, add debug beside the real exporter and send a known test record. If the record appears before a processor but not after it, you have narrowed the failure without guessing about the backend, network policy, or dashboard query.
exporters: debug: verbosity: detailed
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp/backend, debug] # Add debug alongside real exporterInternal telemetry is the Collector’s dashboard of itself. Receiver accepted counters show whether input is arriving, processor dropped counters show whether filtering or pressure is removing data, exporter sent counters show successful output, and exporter failed counters reveal backend or network problems. These metrics let you distinguish “the app emitted nothing” from “the Collector dropped it” from “the backend rejected it.”
service: telemetry: logs: level: debug # debug | info | warn | error encoding: json # For structured log parsing metrics: level: detailed # none | basic | normal | detailed address: 0.0.0.0:8888 # Collector's own /metrics endpointThe internal metric names are intentionally operational: otelcol_receiver_accepted_spans says spans arrived, otelcol_processor_dropped_spans says a processor removed spans, otelcol_exporter_sent_spans says spans left successfully, and otelcol_exporter_send_failed_spans says an exporter could not deliver. When you diagnose Collector pipeline issues, compare those counters by signal and pipeline before changing application instrumentation.
| Endpoint | What It Shows |
|---|---|
/debug/pipelinez | Active pipelines and their components |
/debug/tracez | Sample traces flowing through the Collector |
/debug/rpcz | gRPC call statistics |
/debug/extensionz | Running extensions |
# Port-forward to access zpageskubectl port-forward svc/otel-collector 55679:55679The following complete configuration ties the ideas together. It receives traces, metrics, and logs, scrapes Prometheus and host metrics, filters health checks, redacts sensitive values, generates span metrics, exports to separate backends, enables health and zpages, and exposes internal metrics. It is intentionally richer than a minimal exam answer because production failures usually happen where signals and policy meet.
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 prometheus: config: scrape_configs: - job_name: 'k8s-pods' scrape_interval: 15s kubernetes_sd_configs: - role: pod filelog: include: [/var/log/pods/*/*/*.log] operators: - type: json_parser hostmetrics: collection_interval: 30s scrapers: cpu: {} memory: {} disk: {}
processors: memory_limiter: check_interval: 1s limit_mib: 1024 spike_limit_mib: 256 batch: send_batch_size: 8192 timeout: 200ms filter/healthz: error_mode: ignore traces: span: - 'attributes["http.route"] == "/healthz"' - 'attributes["http.route"] == "/readyz"' transform/redact: error_mode: ignore trace_statements: - context: span statements: - replace_pattern(attributes["http.url"], "token=([^&]*)", "token=REDACTED") log_statements: - context: log statements: - replace_pattern(body, "password=\\S+", "password=***")
exporters: otlp/tempo: endpoint: tempo.observability.svc.cluster.local:4317 tls: insecure: true otlp/loki: endpoint: loki.observability.svc.cluster.local:3100 tls: insecure: true prometheus: endpoint: 0.0.0.0:8889 debug: verbosity: basic
connectors: spanmetrics: histogram: explicit: buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 500ms, 1s, 5s] dimensions: - name: http.method - name: http.status_code
extensions: health_check: endpoint: 0.0.0.0:13133 zpages: endpoint: 0.0.0.0:55679
service: extensions: [health_check, zpages] telemetry: logs: level: info metrics: address: 0.0.0.0:8888 pipelines: traces: receivers: [otlp] processors: [memory_limiter, filter/healthz, transform/redact, batch] exporters: [otlp/tempo, spanmetrics, debug] metrics: receivers: [otlp, prometheus, hostmetrics, spanmetrics] processors: [memory_limiter, batch] exporters: [prometheus, debug] logs: receivers: [otlp, filelog] processors: [memory_limiter, transform/redact, batch] exporters: [otlp/loki, debug]When reviewing a configuration like this, walk one signal at a time. For traces, OTLP input passes through memory protection, health-check filtering, redaction, batching, Tempo export, spanmetric generation, and debug output. For metrics, OTLP, Prometheus, host, and derived span metrics share a pipeline before Prometheus and debug export. For logs, OTLP and file input share redaction and batching before the Loki-style OTLP endpoint and debug mirror.
That walk-through is the fastest way to find hidden mistakes. If a processor exists but is not in the relevant signal pipeline, it cannot affect that signal. If a connector appears as an exporter but not as a receiver in another pipeline, its output has nowhere useful to go. If debug output is present but backend data is missing, the exporter, backend, authentication, or network path becomes the next investigation target.
Worked Review: Safely Changing a Collector Pipeline
Section titled “Worked Review: Safely Changing a Collector Pipeline”Imagine that the next change request is to reduce telemetry cost while keeping enough detail for incident response. A weak review would look only for valid YAML and perhaps confirm that the Collector pod restarts cleanly. A strong review follows the trace of data through the pipeline and asks what each component can drop, transform, buffer, or duplicate. That review style is slower on the first few attempts, but it prevents expensive surprises after the Collector becomes the shared path for every team.
Start with receiver intent. If applications send OTLP directly to a gateway, the gateway can receive application spans, but it cannot read each node’s container log files unless those files are mounted from the node. If agents scrape pod metrics, they need discovery permissions and a clear label strategy. If a browser sends OTLP/HTTP, CORS and authentication are not optional edge details; they are part of the receiver contract because the receiver is exposed to a different trust boundary.
Next, review processor order as a safety sequence. Memory protection should happen before expensive buffering, broad filtering should happen before backend export, redaction should happen before records cross a boundary, and batching should happen after most per-record work is complete. This order is not just about performance. It also decides which metrics you can use to explain a drop, because a processor can only report work that reaches it.
Then review every filter as if it were a production access rule. Health-check spans are usually safe to drop, but a condition that matches service names, URL prefixes, or status codes can remove evidence from real incidents. Use positive examples and negative examples when testing a filter: one record that should be dropped, one similar record that must survive, and one malformed record that proves error_mode behavior. Without those cases, a filter can pass a happy-path demo and still damage observability.
Transform rules deserve the same review because they can rename attributes that dashboards, alerts, and service maps already depend on. A replace_pattern that redacts a token is helpful; a broad set that overwrites a resource attribute can break grouping across an entire backend. When a transform changes names, write down the downstream query or dashboard that will consume the result, then validate that query against debug output before the change reaches a shared environment.
Sampling policy needs an even stricter placement review. Head sampling can happen near the source because the decision is made before a full trace exists, but tail sampling is a gateway concern because it uses facts discovered later in the trace. A policy that keeps errors and slow requests is only meaningful if all spans for a trace arrive at the same sampling processor. If traffic is balanced randomly across gateways, the policy is still configured but the data it evaluates is incomplete.
Connector review is about both directions of signal flow. When spanmetrics appears in a traces pipeline exporter list, it receives spans and creates metrics. When the same connector appears in a metrics pipeline receiver list, those created metrics enter the metrics path and can be exported. If either side is missing, the connector may look present while the derived signal never reaches the intended backend. That is why connector bugs often feel like silent failures.
Cardinality review belongs beside connector review because generated metrics can grow faster than hand-written metrics. The most tempting dimensions are often the most dangerous: raw URLs, user identifiers, pod names in short-lived workloads, or request IDs that change every call. Stable dimensions such as method, status code, route template, service name, and namespace usually preserve useful grouping without creating a new time series for every request.
Exporter review should separate delivery concerns from data-shaping concerns. If debug output shows a record after processing, the Collector produced the right payload. If the backend is missing it, investigate exporter endpoint, TLS, authentication, compression support, retries, network policy, and backend limits. Changing a filter or transform at that point can mask the real problem because the data path has already reached the exporter boundary.
Retry settings are especially easy to overlook because they sound like a reliability improvement by default. Retries help with transient network or backend failures, but they also consume memory and can increase pressure when a backend is already unhealthy. A production gateway should have retry behavior that matches backend expectations, queueing decisions that match memory limits, and alerts that distinguish temporary send failures from sustained exporter failure.
Health checks need similar humility. A successful liveness probe means the Collector process can answer the health extension; it does not mean every receiver is accepting data or every exporter is delivering it. Readiness probes are still important because Kubernetes needs a signal for rollout and restart decisions, but pipeline validation must come from the Collector’s own telemetry and from known test records flowing through the configured paths.
Resource sizing should be reviewed with signal volume in mind rather than copied from a sample. A node agent that reads logs can experience spikes during crash loops, while a gateway that performs tail sampling holds trace state until decision_wait expires. Memory limits, spike limits, batch sizes, and sampling buffers interact, so a configuration that works for a small namespace may not survive a busy cluster without tuning.
RBAC is another placement clue. A filelog receiver needs hostPath-style access and node scheduling; a Kubernetes cluster receiver needs API permissions; an Operator-managed Collector needs permissions appropriate to the objects it reconciles. If a design grants every Collector instance broad permissions because one receiver needs them, reconsider the placement. Splitting agent and gateway responsibilities often reduces both runtime load and permission scope.
Version review matters because Collector examples age quickly. The structure in this module is stable, but component names, maturity levels, and image versions should be checked against the OpenTelemetry documentation before a production rollout. Kubernetes version 1.35 does not change the core DaemonSet and Deployment reasoning, yet clusters can differ in admission policy, Pod Security settings, and network policy that affect how the Collector is allowed to run.
Finally, write the rollback plan in terms of data flow. If a transform breaks dashboards, you should know whether to remove only that processor, remove it from one signal pipeline, or switch exporter output back to debug for validation. If gateway sampling loses traces, you should know whether the fix is routing affinity, sampling parameters, or temporarily bypassing tail sampling. A Collector rollback that simply redeploys the previous YAML may be enough, but the review should identify the smallest safe reversal.
This review pattern is also how you answer scenario questions under exam pressure. Identify the signal, locate the Collector placement, trace the pipeline order, verify connector direction, then inspect exporter and diagnostic surfaces. The details vary from question to question, but that sequence keeps you from treating a healthy pod as a healthy telemetry path or treating a backend symptom as an application instrumentation bug.
There is one more habit worth practicing before you leave the review: name the evidence that would change your mind. If you believe a filter is dropping spans, the evidence is a processor dropped counter or a debug comparison before and after that processor. If you believe the backend is rejecting data, the evidence is exporter failure metrics or backend-side rejection logs while debug output still shows processed records. Clear evidence targets keep troubleshooting from becoming a sequence of speculative YAML edits.
The same evidence mindset improves change communication. Instead of saying “we added tail sampling,” say “we added gateway tail sampling, routed traces by trace ID, kept errors and slow traces, and verified accepted, dropped, sent, and failed counters during rollout.” That sentence tells reviewers what changed, where it runs, why it is safe, and how the team checked it. Good Collector operations are often less about clever configuration and more about making every data-flow assumption observable.
Patterns & Anti-Patterns
Section titled “Patterns & Anti-Patterns”Strong Collector designs are usually boring in the best sense: node agents collect what only nodes can see, gateways perform centralized policy, processors are ordered by risk reduction, and debug surfaces stay available during rollout. The patterns below are not decorative architecture rules; they are operational shortcuts that reduce ambiguity when telemetry disappears, duplicates, or overwhelms a backend.
| Pattern | When to Use It | Why It Works | Scaling Consideration |
|---|---|---|---|
| Agent plus gateway | Clusters with logs, host metrics, and distributed traces | Agents collect local data while gateways centralize sampling and routing | Scale agents with nodes and gateways with telemetry volume |
| Memory limiter first, batch last | Almost every traces, metrics, or logs pipeline | The Collector rejects pressure before buffering and exports efficiently after processing | Tune limits against pod memory requests and backend throughput |
| Debug mirror during rollout | New filters, transforms, connectors, or exporters | You can inspect records before blaming applications or backends | Reduce verbosity after validation to control log volume |
| Trace-ID-aware gateway routing | Tail sampling with more than one gateway replica | All spans for one trace reach the same sampling decision point | Use a headless Service or resolver strategy that exposes replicas |
Anti-patterns often begin as convenience. A team runs Contrib everywhere because it is easy, places tail sampling on agents because agents are already deployed, or leaves a broad filter untested because the Collector stayed Ready. Those choices are understandable during experiments, but they create confusion when the system becomes part of incident response.
| Anti-pattern | What Goes Wrong | Better Alternative |
|---|---|---|
| Running global receivers on every agent | Duplicate metrics and unnecessary API server load | Run cluster-wide receivers in a gateway or single coordinated instance |
| Treating health checks as pipeline validation | The process is alive even when telemetry is dropped | Use debug exporter, zpages, and internal metrics for data-flow validation |
| Using request-specific dimensions in spanmetrics | Metrics cardinality grows too quickly | Use stable dimensions such as method, status code, service, and route templates |
| Keeping broad Contrib builds forever | Larger dependency surface and unused components | Build a custom Collector once the production component set stabilizes |
Decision Framework
Section titled “Decision Framework”Collector decisions become manageable when you separate data source, processing policy, and export constraints. First ask where the telemetry originates and which Collector placement can see it. Then decide whether records need local protection, central aggregation, sampling, redaction, derived metrics, or backend-specific routing. Finally, choose the transport and distribution that fit your network and operational maturity.
| Decision | Choose This | When the Constraints Look Like This |
|---|---|---|
| Placement | DaemonSet agent | Node logs, host metrics, node-local buffering, or low-latency collection near workloads |
| Placement | Deployment gateway | Tail sampling, centralized routing, shared authentication, spanmetrics, or backend fan-out |
| Transport | OTLP/gRPC | Internal traffic, HTTP/2 support, high throughput, Collector-to-Collector paths |
| Transport | OTLP/HTTP | Browser telemetry, HTTP-only proxies, JSON debugging, simple edge ingestion |
| Distribution | Core | Minimal component set and security-sensitive baseline |
| Distribution | Contrib | Discovery, labs, or integrations not present in Core |
| Distribution | Custom | Stable production pipeline with explicit dependency control |
Use this mental flow when a scenario question gives you more information than you need. If the problem mentions missing complete traces after scaling gateways, look at trace routing before changing sampling policy. If it mentions duplicated cluster metrics, look at receiver placement before touching Prometheus queries. If it mentions exporter failures while debug output still shows records, focus on transport, credentials, TLS, backend availability, and retry behavior.
Telemetry source? ├─ Node-local files or host metrics ──▶ Agent DaemonSet │ └─ Forward to gateway for shared policy ├─ Application OTLP signals ─────────▶ Agent or gateway, based on latency and ownership │ └─ Use OTLP/gRPC internally when possible └─ Cluster-wide API metrics ─────────▶ Gateway or single collector instance └─ Avoid per-node duplication
Need tail sampling or spanmetrics? ├─ Yes ──▶ Gateway with trace-aware routing and enough memory └─ No ───▶ Keep processing close to source unless backend policy needs centralizationThe framework is deliberately compact because real incidents do not wait while you admire a perfect architecture diagram. Start with visibility, then placement, then order, then export. That sequence keeps you from fixing the wrong layer, which is the most common failure mode when a Collector configuration is valid YAML but invalid operations.
Did You Know?
Section titled “Did You Know?”- OTLP/gRPC and OTLP/HTTP use different default ports: 4317 for gRPC and 4318 for HTTP, and many exam scenarios hinge on recognizing which transport is being described.
- The
spanmetricsconnector turns trace data into RED-style metrics, so one trace pipeline can also feed request rate, error, and duration metrics into a metrics pipeline. - The Collector has Core, Contrib, and custom distribution models; Contrib includes 200+ components, while
ocblets production teams compile only the components they use. - zpages exposes live debugging paths such as
/debug/pipelinezand/debug/tracezon port 55679 when the extension is enabled.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Declaring a component but not adding it to a pipeline | The top-level config reads like the component is active, but service.pipelines is the real wiring | Always trace the relevant signal through receivers, processors, connectors, and exporters under service |
Placing batch before memory_limiter | Batching feels like a universal optimization, so it gets added first | Put memory_limiter early and batch late so pressure is controlled before buffering grows |
| Running tail sampling on agents | Agents are already deployed on every node, so sampling seems close to the source | Run tail sampling on gateways and use trace-ID-aware routing across gateway replicas |
Running k8s_cluster on every node | Teams copy the same receiver set into every Collector mode | Run cluster-wide receivers in one gateway-style placement with the right RBAC |
Forgetting error_mode: ignore on filter or transform processors | The rule examples work on clean data during testing | Set error handling deliberately and validate malformed or unexpected records with debug output |
| Promoting unstable attributes to Prometheus labels | Resource-to-telemetry conversion looks convenient for every attribute | Promote only bounded, stable dimensions such as service, namespace, method, and route templates |
| Treating pod readiness as proof of telemetry delivery | Kubernetes can only tell that the Collector process is responding | Check debug exporter output, zpages, and otelcol_* internal metrics for actual data flow |
| Keeping a broad Contrib image after the pipeline stabilizes | It speeds up early experimentation and never gets revisited | Move to a custom Collector build when component choices and release ownership are mature |
Question 1: Your team designs a multi-pipeline Collector configuration for traces, metrics, and logs, but the new `filter/healthz` processor is not changing trace output. What do you check first?
Check whether filter/healthz is listed under service.pipelines.traces.processors, not merely declared under the top-level processors block. A Collector component exists only as inventory until a pipeline references it. If the processor is present in the wrong signal pipeline, it still will not affect traces, even though the configuration can load successfully. After confirming pipeline wiring, use the debug exporter to compare records before and after the processor.
Question 2: A gateway Deployment was scaled to several replicas, and tail sampling started missing slow traces. The policy still says to keep slow traces. What design problem is most likely?
The likely problem is that spans from the same trace are being split across gateway replicas before the tail_sampling processor sees them. Tail sampling needs enough of the completed trace to make a correct decision, so the agent-to-gateway path should use trace-ID-aware routing such as the load balancing exporter. Changing the latency threshold would not fix incomplete trace visibility. Adding more replicas can make the issue worse unless routing preserves trace affinity.
Question 3: You configure advanced processors to reduce volume, but an exporter reports fewer spans than expected after a transform rollout. Which signals help you diagnose whether data was dropped inside the Collector or rejected by the backend?
Compare Collector internal metrics across the receiver, processor, and exporter stages. otelcol_receiver_accepted_spans shows whether spans entered, otelcol_processor_dropped_spans shows whether a processor removed them, otelcol_exporter_sent_spans shows successful export, and otelcol_exporter_send_failed_spans points toward backend or network failure. The debug exporter can also mirror representative records so you can inspect transformed attributes. This approach narrows the failure layer before you change application instrumentation.
Question 4: A Kubernetes 1.35 cluster needs container logs, host metrics, spanmetrics, and tail sampling. How would you deploy the Collector, and why?
Use a DaemonSet agent for container logs and host metrics because those data sources are node-local, then forward to a gateway Deployment for spanmetrics and tail sampling. The gateway can centralize derived metrics, sampling decisions, routing, and backend authentication. Tail sampling should not run on agents because each agent may see only part of a distributed trace. This split also lets you scale gateways independently from node count.
Question 5: A browser-based application cannot send OTLP/gRPC through the enterprise proxy, but the team still wants OpenTelemetry-native ingestion. Which OTLP transport do you evaluate, and what tradeoff do you accept?
Evaluate OTLP/HTTP on port 4318 because it works through ordinary HTTP infrastructure and supports paths such as /v1/traces. The tradeoff is that it may not provide the same internal throughput characteristics as gRPC, and you need to configure edge concerns such as CORS, authentication, compression, and rate limits carefully. For internal Collector-to-Collector traffic, OTLP/gRPC remains a strong default when HTTP/2 is supported. The transport choice follows the network constraint rather than a universal preference.
Question 6: Your spanmetrics connector produces far more metrics series than expected after adding request URL as a dimension. What is wrong with the design?
The connector is using an unstable, high-cardinality dimension. Raw request URLs can contain IDs, query strings, and other per-request values, so deriving metrics from them can create a large number of time series. Use stable dimensions such as http.method, http.status_code, service name, and route templates instead. The connector is useful, but it inherits the same cardinality discipline required for any metrics pipeline.
Question 7: A security review asks why production Collectors still use the Contrib distribution even though the pipeline only needs OTLP, filelog, memory limiter, batch, transform, debug, and one OTLP exporter. What should you propose?
Propose building a custom Collector with ocb once the component set and release process are stable. Contrib is useful during discovery because it includes many integrations, but it also ships many components the production pipeline may never use. A custom build reduces binary size and dependency surface while keeping the required receivers, processors, and exporters explicit. The proposal should include upgrade ownership, because a custom distribution must still track OpenTelemetry releases.
Hands-On Exercise: Build a Multi-Signal Pipeline
Section titled “Hands-On Exercise: Build a Multi-Signal Pipeline”Exercise scenario: you are preparing a small observability namespace for a team that wants one Collector receiving traces, metrics, and logs through OTLP, generating span-derived metrics, and exposing enough diagnostics to prove the pipeline works before a backend is added. The lab uses debug output instead of a vendor backend so you can focus on Collector behavior. The same validation pattern applies when you later replace debug with production exporters.
You need a working Kubernetes cluster, kubectl, and the ability to create a namespace. The commands below use kind for a disposable local cluster, but you can skip that first command if you already have a lab cluster. Keep the Collector in an observability namespace so later cleanup is straightforward and the Service names match the examples.
# Create a kind cluster (skip if you already have one)kind create cluster --name otel-lab
# Create namespacekubectl create namespace observabilityTask 1: Deploy the Collector
Section titled “Task 1: Deploy the Collector”Apply the ConfigMap, Deployment, and Service in one command. The configuration includes OTLP receivers for all signals, memory_limiter and batch processors, a spanmetrics connector, debug exporters, health checks, zpages, and internal metrics. Read the service.pipelines section before applying it so you can predict which components are active for each signal.
kubectl apply -n observability -f - <<'EOF'apiVersion: v1kind: ConfigMapmetadata: name: otel-collector-configdata: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: memory_limiter: check_interval: 1s limit_mib: 256 spike_limit_mib: 64 batch: send_batch_size: 1024 timeout: 1s
connectors: spanmetrics: dimensions: - name: http.method
exporters: debug: verbosity: detailed
extensions: health_check: endpoint: 0.0.0.0:13133 zpages: endpoint: 0.0.0.0:55679
service: extensions: [health_check, zpages] telemetry: metrics: address: 0.0.0.0:8888 pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug, spanmetrics] metrics: receivers: [otlp, spanmetrics] processors: [memory_limiter, batch] exporters: [debug] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug]---apiVersion: apps/v1kind: Deploymentmetadata: name: otel-collectorspec: replicas: 1 selector: matchLabels: app: otel-collector template: metadata: labels: app: otel-collector spec: containers: - name: collector image: otel/opentelemetry-collector-contrib:0.98.0 args: ["--config=/etc/otel/config.yaml"] ports: - containerPort: 4317 - containerPort: 4318 - containerPort: 13133 - containerPort: 55679 volumeMounts: - name: config mountPath: /etc/otel livenessProbe: httpGet: path: / port: 13133 readinessProbe: httpGet: path: / port: 13133 volumes: - name: config configMap: name: otel-collector-config---apiVersion: v1kind: Servicemetadata: name: otel-collectorspec: selector: app: otel-collector ports: - name: otlp-grpc port: 4317 - name: otlp-http port: 4318 - name: health port: 13133 - name: zpages port: 55679EOFSolution notes for Task 1
The Deployment should create one Collector pod, and the readiness probe should pass through the health check extension. If the pod does not become Ready, inspect the logs for configuration parsing errors first because the Collector validates component names and pipeline references during startup. A successful rollout does not yet prove telemetry is flowing; it only proves the process loaded and the health extension responded.
Task 2: Send Test Telemetry
Section titled “Task 2: Send Test Telemetry”Wait for readiness, port-forward the OTLP/HTTP Service port, and send a minimal trace payload. The payload uses a fixed service name and an HTTP method attribute so the debug exporter and spanmetrics connector have visible data to work with. Before running the curl command, predict where you expect to see the trace and which derived metric pipeline should also receive a signal.
# Wait for the collector to be readykubectl wait --for=condition=ready pod -l app=otel-collector -n observability --timeout=60s
# Port-forward to send datakubectl port-forward -n observability svc/otel-collector 4318:4318 &
# Send a test trace via OTLP/HTTPcurl -X POST http://localhost:4318/v1/traces \ -H "Content-Type: application/json" \ -d '{ "resourceSpans": [{ "resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]}, "scopeSpans": [{ "spans": [{ "traceId": "5b8aa5a2d2c872e8321cf37308d69df2", "spanId": "051581bf3cb55c13", "name": "GET /api/users", "kind": 2, "startTimeUnixNano": "1000000000", "endTimeUnixNano": "2000000000", "attributes": [ {"key": "http.method", "value": {"stringValue": "GET"}}, {"key": "http.status_code", "value": {"intValue": "200"}} ] }] }] }] }'Solution notes for Task 2
The HTTP request should return a successful response from the Collector receiver. If it fails, check that the port-forward is still running and that the Service exposes port 4318. If the request succeeds but nothing appears in logs, verify that the traces pipeline exports to debug and that the Collector pod you are reading is the current Ready pod.
Task 3: Verify Pipeline Behavior
Section titled “Task 3: Verify Pipeline Behavior”Use three independent checks: debug logs for the trace record, zpages for active pipelines, and internal metrics for accepted spans. This triangulation is more reliable than a single check because each surface answers a different question. Logs show representative payload content, zpages show pipeline structure, and metrics show counters that can be used in alerts.
# Check collector logs — you should see the trace in debug outputkubectl logs -n observability -l app=otel-collector --tail=50
# Check zpageskubectl port-forward -n observability svc/otel-collector 55679:55679 &# Open http://localhost:55679/debug/pipelinez in your browser
# Check Collector's own metricskubectl port-forward -n observability svc/otel-collector 8888:8888 &curl -s http://localhost:8888/metrics | grep otelcol_receiver_acceptedSolution notes for Task 3
You should see the test span in the Collector logs, all three pipelines listed in zpages, and receiver accepted counters in the internal metrics endpoint. If debug logs show the trace but internal metrics do not, make sure the metrics telemetry address is exposed by the pod and port-forwarded correctly. If zpages is unreachable, check that the extension is enabled under service.extensions, not just declared under extensions.
Task 4: Explain the Production Changes
Section titled “Task 4: Explain the Production Changes”Write down the changes you would make before moving this lab pattern into production. Consider where you would split agent and gateway responsibilities, whether you would keep Contrib or build a custom Collector, which exporter would replace debug, how you would protect credentials, and what metrics would become alerts. This task forces you to connect the working lab to the design framework instead of treating it as a copy-paste manifest.
Solution notes for Task 4
A production design would usually run node agents for file logs and host metrics, then forward to gateways for sampling, spanmetrics, routing, and backend exports. Debug output would become temporary or sampled, credentials would move into Secrets or external secret management, and internal Collector metrics would feed alerts for accepted, dropped, sent, and failed telemetry. A stable pipeline should also consider a custom Collector distribution rather than keeping a broad Contrib image forever.
Success Criteria
Section titled “Success Criteria”- Design multi-pipeline Collector configurations by tracing the lab’s traces, metrics, and logs through the active
service.pipelinesentries. - Configure advanced processors by explaining why
memory_limiterappears beforebatchand wherefilter,transform, ortail_samplingwould fit. - Deploy Collector workloads in Kubernetes 1.35 and explain when this Deployment should become an agent DaemonSet, a gateway Deployment, or both.
- Diagnose Collector pipeline issues by using debug logs, zpages, and
otelcol_*internal metrics instead of relying only on pod readiness. - Evaluate OTLP transports, connectors, and distributions by choosing a production exporter, a spanmetrics dimension set, and Core, Contrib, or custom Collector packaging.
Sources
Section titled “Sources”- OpenTelemetry Collector configuration
- OpenTelemetry Collector deployment
- OpenTelemetry Protocol specification
- OpenTelemetry Collector transformation documentation
- OpenTelemetry Collector batch processor
- OpenTelemetry Collector memory limiter processor
- OpenTelemetry Collector tail sampling processor
- OpenTelemetry Collector spanmetrics connector
- OpenTelemetry Operator
- OpenTelemetry Collector Builder
- Kubernetes DaemonSet documentation
- Kubernetes Deployment documentation
Next Module
Section titled “Next Module”OTCA Track Overview - Instrument applications using OTel SDKs across multiple languages.