Skip to content

Module 1.2: OpenTelemetry

Toolkit Track | Complexity: [COMPLEX] | Time: 60-75 min

Before starting this module, you should be comfortable reading service logs, reasoning about HTTP request flows, and interpreting basic Prometheus metrics. You do not need to be an OpenTelemetry expert yet, but you should already understand why distributed systems need more than single-process logging.

Complete these first:

  • Module 1.1: Prometheus
  • Observability Theory Track
  • Basic understanding of distributed tracing concepts such as traces, spans, and parent-child relationships
  • Basic Kubernetes skills with kubectl; this module uses k as a shorthand alias after the first command

After completing this module, you will be able to:

  • Design an OpenTelemetry signal flow that routes traces, metrics, and logs through a Collector without locking applications to one backend.
  • Implement automatic and manual instrumentation choices for services, then justify where each approach fits.
  • Debug broken traces by inspecting resource attributes, propagation headers, Collector pipelines, and backend export behavior.
  • Evaluate head sampling, tail sampling, filtering, and batching trade-offs for cost, reliability, and incident usefulness.
  • Operate an OpenTelemetry Collector in Kubernetes with memory limits, health checks, and verification steps that expose configuration mistakes.

Large organizations often inherit multiple tracing systems across generations of services, so one end-to-end request can appear as disconnected fragments in different backends.

In regulated environments, teams often need end-to-end tracing for critical transaction flows. A practical migration path is to place OpenTelemetry between applications and backends, ingest existing signals through compatible receivers, normalize them in a Collector, and migrate services incrementally instead of rewriting every application at once.

That decision mattered because it changed the migration from an application rewrite into an infrastructure rollout. The teams could prove the payment path first, then improve service instrumentation one workload at a time. OpenTelemetry did not magically fix missing spans, bad service names, or sampling mistakes, but it gave the platform a standard vocabulary and a neutral transport. That is the skill you are building in this module: not memorizing OpenTelemetry terms, but designing and debugging a telemetry path that survives real production constraints.

OpenTelemetry is the instrumentation and telemetry transport standard for modern observability. It gives application teams a common API and SDK for creating telemetry, a protocol called OTLP for moving it, and a Collector for receiving, processing, and exporting it. The practical result is that your application code can describe what happened without being tied to one observability backend.

The important word is standard, not tool. OpenTelemetry is not a full observability backend by itself. It does not replace dashboards, long-term storage, alerting rules, or incident workflows. Instead, it sits at the boundary where applications produce telemetry and platforms decide where that telemetry should go. This boundary is where vendor lock-in usually grows, so standardizing it gives the platform team leverage.

┌──────────────────────────────────────────────────────────────────────────────┐
│ OPENTELEMETRY SIGNAL PATH │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ Application Code │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Business operation: "create order", "charge card", "ship item" │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Traces │ │ Metrics │ │ Logs │ │ │
│ │ │ span trees │ │ measurements │ │ event lines │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ └──────────────────┼──────────────────┘ │ │
│ │ ▼ │ │
│ │ OpenTelemetry API and SDK │ │
│ └────────────────────────────┬──────────────────────────────────────────┘ │
│ │ OTLP gRPC or OTLP HTTP │
│ ▼ │
│ Collector Layer │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Receivers ─────────▶ Processors ─────────▶ Exporters │ │
│ │ otlp, jaeger, zipkin batch, filter, memory otlp, prometheus, debug │ │
│ └────────────────────────────┬──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Backends: tracing store, metrics store, log store, commercial APM, archive │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

A service can emit telemetry directly to a backend, but that choice pushes platform concerns into every workload. When you need to add a second backend, filter noisy spans, change sampling policy, or buffer during an outage, every service becomes part of the migration. The Collector moves those concerns into infrastructure where platform teams can operate them consistently.

ComponentWhat It DoesDesign Question
APIDefines vendor-neutral calls used by instrumentation libraries and application codeCan developers create telemetry without importing backend-specific packages?
SDKImplements batching, sampling, resource metadata, and export behavior inside the processWhat work should happen in the application before data leaves the process?
OTLPCarries traces, metrics, and logs over gRPC or HTTPWhich transport fits the network path between workloads and collectors?
CollectorReceives, processes, and exports telemetry outside the applicationWhat should be centralized so teams do not repeat it in every service?
Semantic conventionsStandardize attribute names for common operationsWill dashboards and queries work across services written by different teams?

Stop and think: If your organization changed tracing vendors next quarter, which parts of your current applications would need code changes? Separate the answer into instrumentation code, runtime configuration, Collector configuration, and backend dashboard work. The more work you find inside application repositories, the more value a vendor-neutral instrumentation layer can provide.

OpenTelemetry also helps with correlation. A trace explains how one request moved through services. Metrics explain aggregate behavior such as request rate, error rate, and latency distribution. Logs explain individual events with detailed context. When these signals share resource attributes and trace identifiers, an engineer can move from an alert to a metric, from the metric to an exemplar, from the exemplar to a trace, and from the trace to logs for the failing span.

TRACES METRICS LOGS
──────────────────────────────────────────────────────────────────────────────
Span tree for one request Aggregated measurements Timestamped events
│ │ │
├─ trace_id ├─ counter ├─ severity
├─ span_id ├─ gauge ├─ body
├─ parent_span_id ├─ histogram ├─ attributes
├─ attributes ├─ resource attributes ├─ trace_id
├─ events └─ exemplars to traces └─ span_id
└─ status
Shared correlation keys: service.name, deployment.environment, trace_id, span_id

A mature platform treats these signals as connected evidence, not separate products. If checkout-service latency spikes, a metric should show the spike, a trace should identify whether the delay is payment, inventory, or shipping, and logs should show the local details around the span that failed. OpenTelemetry gives you the shared identifiers and transport needed to make that investigation possible.

2. Instrumentation Strategy: Automatic First, Manual Where It Matters

Section titled “2. Instrumentation Strategy: Automatic First, Manual Where It Matters”

Instrumentation is the act of making software describe its behavior. Automatic instrumentation attaches to common frameworks and libraries, such as HTTP servers, database clients, message queues, and gRPC clients. Manual instrumentation adds spans, attributes, metrics, and events around business operations that generic library hooks cannot understand.

Start with automatic instrumentation because it gives fast coverage and exposes the request graph quickly. A Java agent can instrument Spring, JDBC, and HTTP clients without changing application code. A Python auto-instrumentation setup can instrument many common frameworks and client libraries without changing application code. This is especially useful when your first goal is to find missing service-to-service edges rather than model every business decision.

┌──────────────────────────────────────────────────────────────────────────────┐
│ AUTO VS MANUAL INSTRUMENTATION │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ Automatic instrumentation │
│ ┌─────────────────────┐ captures ┌────────────────────────────┐ │
│ │ HTTP framework │───────────────────▶│ inbound request spans │ │
│ │ DB client │───────────────────▶│ query spans │ │
│ │ Message library │───────────────────▶│ publish and consume spans │ │
│ └─────────────────────┘ └────────────────────────────┘ │
│ │
│ Manual instrumentation │
│ ┌─────────────────────┐ explains ┌────────────────────────────┐ │
│ │ approve_loan() │───────────────────▶│ business decision span │ │
│ │ calculate_risk() │───────────────────▶│ domain attributes │ │
│ │ reserve_inventory() │───────────────────▶│ failure reason events │ │
│ └─────────────────────┘ └────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

Manual instrumentation is still necessary because the most important production questions are often business questions. A generic HTTP span can tell you that POST /orders took too long. It cannot tell you whether the delay happened while validating a coupon, reserving inventory, retrying payment authorization, or waiting on fraud scoring. You add manual spans at the boundaries where business meaning changes.

Here is a runnable Python example that creates a trace, adds business attributes, records an exception, and exports through OTLP. It is deliberately small so you can see the mechanics before Kubernetes, sampling, and Collector processing add more layers.

# requirements:
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
from random import random
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
resource = Resource.create(
{
"service.name": "order-service",
"service.version": "1.0.0",
"deployment.environment": "dev",
}
)
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://127.0.0.1:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")
def validate_order(order_id: str) -> None:
with tracer.start_as_current_span("validate_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("validation.type", "full")
if random() < 0.2:
raise ValueError("inventory reservation failed")
def save_order(order_id: str) -> None:
with tracer.start_as_current_span("save_order") as span:
span.set_attribute("order.id", order_id)
def process_order(order_id: str) -> None:
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
try:
validate_order(order_id)
save_order(order_id)
except Exception as exc:
span.record_exception(exc)
span.set_status(trace.Status(trace.StatusCode.ERROR, str(exc)))
raise
if __name__ == "__main__":
process_order("order-123")
provider.shutdown()

The worked example has a root span called process_order and child spans for validation and saving. If validation fails, the exception is recorded on the business span, not only in a log line. That difference matters during incidents because the failing decision becomes visible in the trace tree, and the backend can keep error traces even when successful traces are sampled aggressively.

Pause and predict: If the service.name resource attribute is missing from the example, what will the backend show? Most systems will still ingest the spans, but they may appear under an unknown service or be grouped with unrelated telemetry. Predict how that would affect ownership, alert routing, and dashboard filters before reading further.

The resource is not decoration. It is the identity card attached to telemetry. service.name, service.version, and deployment.environment let backends group data correctly and let platform teams distinguish a production checkout failure from a development test. Missing or inconsistent resource attributes are one of the most common causes of “OpenTelemetry is working, but nobody can find anything.”

Metrics follow the same principle. You use counters for events that only increase, histograms for distributions such as latency and payload size, and gauges for values that rise and fall. The code below records request counts and durations with attributes that are useful for aggregation. Avoid putting high-cardinality values such as raw user IDs, order IDs, or full URLs into metrics labels because they can explode storage costs.

# requirements:
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
import time
from random import uniform
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
resource = Resource.create(
{
"service.name": "order-service",
"deployment.environment": "dev",
}
)
exporter = OTLPMetricExporter(endpoint="http://127.0.0.1:4317", insecure=True)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=5000)
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("order-service")
request_counter = meter.create_counter(
name="http.server.requests",
unit="1",
description="Total HTTP requests handled by the service",
)
duration_histogram = meter.create_histogram(
name="http.server.duration",
unit="s",
description="HTTP request duration in seconds",
)
def handle_request(route: str, method: str, status_code: int) -> None:
start = time.time()
time.sleep(uniform(0.01, 0.2))
elapsed = time.time() - start
attributes = {
"http.route": route,
"http.request.method": method,
"http.response.status_code": status_code,
}
request_counter.add(1, attributes)
duration_histogram.record(elapsed, attributes)
if __name__ == "__main__":
handle_request("/orders/{id}", "GET", 200)
provider.shutdown()

This example uses route templates rather than raw paths. /orders/{id} is safe because it groups many requests into one series. /orders/123, /orders/124, and every other concrete ID would create a new time series, which raises cost and makes dashboards slower. Senior observability work is often about this kind of restraint: record the attribute that supports a decision, not every value that happens to exist.

3. Context Propagation: The Difference Between Spans and Traces

Section titled “3. Context Propagation: The Difference Between Spans and Traces”

A span is one timed operation. A trace is a connected tree of spans that share a trace ID. The connection depends on context propagation: each service must extract incoming trace context, create child spans under that context, and inject updated context into outgoing requests. When propagation breaks, the backend receives spans but cannot connect them into one request journey.

┌──────────────────────────────────────────────────────────────────────────────┐
│ CONTEXT PROPAGATION │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ Client request │
│ │ │
│ ▼ │
│ ┌──────────────┐ HTTP headers ┌──────────────┐ gRPC metadata │
│ │ Service A │────────────────────▶│ Service B │────────────────────▶ │
│ │ span_id: a1 │ traceparent header │ span_id: b1 │ traceparent metadata │
│ │ parent: none │ │ parent: a1 │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Service C │ │
│ │ │ span_id: c1 │ │
│ │ │ parent: b1 │ │
│ │ └──────────────┘ │
│ │ │
│ ▼ │
│ One trace ID ties A, B, and C together; parent span IDs build the tree. │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

The W3C traceparent header is the default modern propagation format. Some environments also need B3 propagation for compatibility with Zipkin-era services. During migrations, it is common to accept more than one propagator so that old and new workloads can share context while teams gradually standardize.

from opentelemetry.baggage.propagation import W3CBaggagePropagator
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
set_global_textmap(
CompositePropagator(
[
TraceContextTextMapPropagator(),
W3CBaggagePropagator(),
B3MultiFormat(),
]
)
)

The most useful propagation debugging habit is to follow the boundary, not the service. If Service A and Service C appear in one trace but Service B appears as a separate root trace, inspect the request into B and the request out of B. The failure may be an uninstrumented custom client, a reverse proxy stripping headers, an async queue that does not copy message attributes, or manual spans created without the active parent context.

Stop and think: A team says, “The Collector is losing spans because checkout and payment are in separate traces.” What evidence would prove or disprove that claim? Think about where trace IDs are created, where headers cross process boundaries, and whether a Collector can reconstruct a parent-child relationship that the applications never emitted.

A Collector can transform, filter, sample, and export telemetry, but it cannot infer missing parentage reliably after the fact. If two services create different trace IDs because propagation failed, the Collector sees two independent traces. This is why propagation debugging starts in application boundaries and network intermediaries before blaming the backend.

4. The Collector: Receivers, Processors, Exporters, and Pipelines

Section titled “4. The Collector: Receivers, Processors, Exporters, and Pipelines”

The OpenTelemetry Collector is the operational control point for telemetry. Receivers accept data, processors modify or decide what to keep, exporters send data onward, and pipelines connect those pieces per signal type. A Collector can receive OTLP from applications, scrape Prometheus endpoints, accept legacy Zipkin traffic, batch data, apply memory limits, remove noisy spans, and export to multiple destinations.

┌──────────────────────────────────────────────────────────────────────────────┐
│ OTEL COLLECTOR │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ RECEIVERS PROCESSORS EXPORTERS │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ otlp/grpc │─────────────────▶│ memory_limit │───────────▶│ otlp │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ otlp/http │─────────────────▶│ batch │───────────▶│ prometheus││
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ prometheus │─────────────────▶│ resource │───────────▶│ debug │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ zipkin │─────────────────▶│ filter │───────────▶│ archive │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ │
│ Pipelines bind selected receivers, processors, and exporters per signal. │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

A minimal production-minded Collector configuration should include a memory limiter before expensive processing, batching before export, explicit endpoints, and health or debug extensions for operations. The configuration below uses the debug exporter for local verification and an OTLP exporter for a tracing backend such as Tempo or Jaeger with OTLP enabled.

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: otel-collector
static_configs:
- targets:
- 127.0.0.1:8888
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: deployment.environment
value: dev
action: upsert
filter/drop_health_checks:
traces:
span:
- 'name == "GET /healthz"'
- 'name == "GET /readyz"'
exporters:
debug:
verbosity: detailed
otlp/tempo:
endpoint: tempo.observability.svc.cluster.local:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
extensions:
health_check:
endpoint: 0.0.0.0:13133
service:
extensions:
- health_check
pipelines:
traces:
receivers:
- otlp
processors:
- memory_limiter
- filter/drop_health_checks
- batch
exporters:
- debug
- otlp/tempo
metrics:
receivers:
- otlp
- prometheus
processors:
- memory_limiter
- resource
- batch
exporters:
- prometheus
logs:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- debug

Processor order matters. The memory limiter should run early because it protects the Collector from overload before queues grow. Filters usually run before batching so unwanted telemetry does not consume batch capacity. Batching should run late because it improves export efficiency after data has been shaped.

A common senior-level mistake is treating one Collector as both a node-local agent and a central gateway. Agent Collectors run near workloads and usually focus on receiving local telemetry, adding Kubernetes metadata, and forwarding efficiently. Gateway Collectors run as shared services and usually apply routing, sampling, redaction, and multi-backend export. Combining both roles can work in small environments, but separating them becomes cleaner as traffic and ownership grow.

┌──────────────────────────────────────────────────────────────────────────────┐
│ AGENT AND GATEWAY DEPLOYMENT │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ Kubernetes Node A Observability Namespace │
│ ┌──────────────────────────────┐ ┌──────────────────────────────┐ │
│ │ app pod ─────┐ │ │ gateway collector replica 1 │ │
│ │ app pod ─────┼──▶ agent │─────────▶│ gateway collector replica 2 │ │
│ │ app pod ─────┘ collector │ │ gateway collector replica 3 │ │
│ └──────────────────────────────┘ └──────────────┬───────────────┘ │
│ │ │
│ Kubernetes Node B ▼ │
│ ┌──────────────────────────────┐ ┌──────────────────────────────┐ │
│ │ app pod ─────┐ │ │ trace backend, metrics store,│ │
│ │ app pod ─────┼──▶ agent │─────────▶│ log backend, archive export │ │
│ │ app pod ─────┘ collector │ └──────────────────────────────┘ │
│ └──────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

Sampling decides which traces are kept. Filtering removes telemetry that should not be exported. Batching changes how telemetry is sent. These controls are operationally connected because they all affect cost, memory, and incident usefulness. A platform that samples too aggressively saves money but misses rare failures. A platform that keeps everything may discover the incident quickly and then lose budget or storage capacity.

Head sampling makes the decision near the beginning of a trace. It is simple, cheap, and easy to run in application SDKs or agent Collectors. Its weakness is that it decides before the system knows whether the trace will become slow or fail, so rare but important traces can be dropped.

Tail sampling waits until a trace is complete or until a decision timeout is reached. It can keep all error traces, all slow traces, and a smaller percentage of normal traces. Its weakness is complexity: the Collector must buffer spans long enough to make the decision, and horizontally scaled tail-sampling gateways must see all spans for the same trace or use a load-balancing strategy that routes trace IDs consistently.

HEAD SAMPLING TAIL SAMPLING
──────────────────────────────────────────────────────────────────────────────
Decision time: trace start Decision time: after spans arrive
Memory cost: low Memory cost: higher
Can keep all errors: no Can keep all errors: yes
Can keep slow traces: no Can keep slow traces: yes
Operational risk: missed incidents Operational risk: buffering and routing
Best fit: edge, high-volume normal traffic Best fit: central gateway, incident analysis

Here is a Collector tail-sampling configuration that keeps error traces, keeps slow traces, and samples a small percentage of normal traces. The numbers are intentionally conservative for a small lab. In production, you would size num_traces, expected_new_traces_per_sec, memory limits, and gateway replicas using measured traffic rather than guesses.

processors:
tail_sampling:
decision_wait: 10s
num_traces: 5000
expected_new_traces_per_sec: 300
policies:
- name: keep-errors
type: status_code
status_code:
status_codes:
- ERROR
- name: keep-slow
type: latency
latency:
threshold_ms: 1000
- name: sample-normal-traffic
type: probabilistic
probabilistic:
sampling_percentage: 5

Pause and predict: Your gateway Collector runs three replicas behind a normal Kubernetes Service, and you enable tail sampling. Some spans from one trace land on replica one, while other spans from the same trace land on replica two. What will happen to sampling decisions? Predict the failure mode before reading the answer.

Tail sampling only works well when the decision maker sees enough of the trace. If spans from the same trace are scattered across independent Collectors, each replica may make incomplete decisions. One replica may keep a partial trace because it saw an error span, while another may drop related normal spans. Production designs usually place tail sampling in a gateway layer and add routing that keeps spans for the same trace together, or they accept head sampling when the operational complexity of tail sampling is not justified.

ControlStrengthRiskGood Use
Head samplingSimple and low overheadDrops traces before knowing whether they matterReducing high-volume normal traffic near the source
Tail samplingKeeps traces based on outcomeRequires buffering and trace-aware routingPreserving errors and slow requests under cost limits
FilteringRemoves predictable noiseCan hide useful signals if rules are too broadDropping health checks and synthetic probes
BatchingImproves export efficiencyCan delay visibility slightlyAlmost every Collector pipeline
Memory limitingProtects Collector stabilityDrops telemetry under pressureAll production Collectors

Cost control is not only about sampling percentages. Attribute design can be more important. A metric with user.id, order.id, or full URL paths can create very high-cardinality series even when request volume is moderate. A trace attribute with sensitive data can create compliance risk even if ingestion cost is acceptable. Good OpenTelemetry design asks: “Will this attribute help us route, aggregate, debug, or explain behavior?” If not, leave it out or put it somewhere safer.

Debugging OpenTelemetry works best when you test one boundary at a time. First prove the application creates telemetry. Then prove the SDK exports it. Then prove the Collector receives it. Then prove processors keep it. Then prove exporters deliver it. Jumping straight to the backend often wastes time because a missing dashboard can be caused by a missing resource attribute, a network policy, a dropped processor rule, or a backend authentication error.

┌──────────────────────────────────────────────────────────────────────────────┐
│ TELEMETRY DEBUGGING CHECKLIST │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Application creates telemetry │
│ └─ Is the SDK initialized before requests start? │
│ │
│ 2. Resource attributes identify the service │
│ └─ Is service.name set consistently? │
│ │
│ 3. SDK exports to the expected endpoint │
│ └─ Is the app using 4317 for gRPC or 4318 for HTTP? │
│ │
│ 4. Collector receives the signal │
│ └─ Do Collector logs or self-metrics show accepted spans or metrics? │
│ │
│ 5. Processors keep the signal │
│ └─ Did a filter, sampler, or memory limiter drop the data? │
│ │
│ 6. Exporter delivers to backend │
│ └─ Are auth, TLS, DNS, and backend limits correct? │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

Collector self-metrics are an underused debugging tool. Metrics such as accepted spans, refused spans, processor drops, exporter queue size, and send failures can tell you whether the problem is upstream or downstream. If accepted spans increase but exporter failures also increase, the Collector is receiving data and failing to deliver it. If accepted spans stay at zero, focus on application endpoints, protocol mismatch, DNS, and network policy.

A protocol mismatch is easy to miss. OTLP/gRPC commonly uses port 4317; OTLP/HTTP commonly uses port 4318. Sending HTTP payloads to the gRPC receiver, or gRPC traffic to the HTTP receiver, often looks like a generic connection or parsing failure. During incident response, explicitly verify the exporter type and endpoint instead of assuming the port is enough.

Terminal window
# Verify Collector pods and service endpoints.
kubectl get pods -n observability
kubectl get svc -n observability otel-collector
# After this module introduces the alias, use k for shorter commands.
alias k=kubectl
# Check Collector logs for receiver, processor, and exporter errors.
k logs -n observability -l app=otel-collector --tail=100
# Port-forward the Collector's Prometheus exporter from the lab.
k port-forward -n observability svc/otel-collector 8889:8889
# Inspect exported metrics from the Collector lab endpoint.
curl -s http://127.0.0.1:8889/metrics | grep -E "otelcol|http"

When telemetry is missing, resist the urge to change several things at once. Add the debug exporter temporarily, send one known request, and confirm whether the Collector prints the span. If it does, the application-to-Collector path works and you should focus on processors and backend export. If it does not, focus on SDK configuration, service discovery, protocol, and network connectivity.

MistakeWhat BreaksHow to DiagnoseBetter Practice
Missing service.name or environment attributesTelemetry arrives but cannot be grouped, owned, or routed correctlySearch the backend for unknown services and inspect resource attributes in debug exportSet resource attributes in SDKs, operators, or Collector resource processors
Exporting directly from every app to every backendBackend changes require application redeploys and repeated configurationCount how many repositories contain backend exporter settingsSend workloads to a Collector and change backend routing centrally
Using OTLP/HTTP against the gRPC receiver, or the reverseExport attempts fail with connection, parsing, or unavailable errorsCompare exporter type with port 4317 or 4318 and receiver configurationStandardize endpoint names and document which protocol each port expects
Running the Collector without memory_limiter and resource limitsSpikes or backend outages can cause OOM kills and telemetry lossCheck pod restarts, memory usage, and exporter queue growthPut memory_limiter early in every pipeline and set Kubernetes memory limits
Applying broad filters before proving signal valueUseful traces disappear and teams blame instrumentationTemporarily add a debug exporter before and after filter changesFilter predictable noise such as health checks, not whole routes or services casually
Enabling tail sampling without trace-aware gateway routingTraces become partial or sampling decisions appear inconsistentCompare spans for one trace across Collector replicas and gateway logsUse routing that keeps trace IDs together, or use head sampling until routing is ready
Adding high-cardinality attributes to metricsStorage cost and query latency grow rapidlyInspect label values for user IDs, order IDs, raw paths, or request IDsUse route templates and bounded labels; keep unique IDs in traces or logs
Assuming auto-instrumentation captures business meaningTraces show framework calls but not why the operation matteredLook for spans named only after HTTP routes or database callsAdd manual spans around domain decisions, retries, and external dependency boundaries
1. Your team deploys auto-instrumentation and sees HTTP spans, but all services appear as `unknown_service`. What do you check and how do you fix it?

Check resource attributes first, especially service.name, service.version, and deployment.environment. The instrumentation may be creating spans correctly, but the backend cannot group them under useful service identities. Fix the SDK, agent environment variables, OpenTelemetry Operator configuration, or Collector resource processor so every workload emits a stable service.name. Then send one request and verify the debug exporter or backend shows the corrected resource.

2. A payment trace shows checkout and inventory, but payment appears as a separate root trace. The Collector is receiving spans from all three services. Where should you investigate first?

Investigate context propagation at the payment service boundary. Confirm checkout injects traceparent, confirm any proxy preserves the header, confirm payment extracts the incoming context before creating spans, and confirm payment injects context into further outgoing calls. Since the Collector receives spans from all services, the missing relationship is probably created before collection, not inside the backend.

3. Your platform wants to keep every error trace but only a small percentage of successful traces. Which sampling approach fits, and what operational risk must you manage?

Tail sampling fits because it can wait until spans reveal status codes, latency, or attributes before deciding what to keep. The operational risk is buffering and trace completeness. Gateway Collectors need enough memory and must receive all or most spans for the same trace, usually through trace-aware routing. Without that, tail sampling can create partial traces or inconsistent decisions.

4. A Collector receives spans, but the backend shows none. What sequence of checks would isolate whether the problem is processing or export?

First add or enable a debug exporter in the same traces pipeline to prove spans survive receivers and early processors. Then inspect Collector self-metrics and logs for processor drops, memory limiter pressure, exporter queue growth, and send failures. If debug output shows spans but backend export fails, focus on exporter endpoint, protocol, TLS, authentication, DNS, and backend ingestion limits. If debug output shows no spans, move upstream to SDK endpoint and receiver configuration.

5. A developer adds `user.id` and full request URL labels to a latency histogram so incidents are easier to debug. What do you recommend?

Do not put unbounded identifiers into metric attributes. They create high-cardinality series that increase cost and slow queries. Use route templates such as /orders/{id}, bounded dimensions such as method and status code, and keep unique identifiers in traces or logs where they support point investigation. If a user-specific investigation is required, link from metrics to exemplars or traces rather than turning every user into a metric series.

6. Your app exports to `http://otel-collector:4317`, but the SDK is configured for OTLP/HTTP. The Collector has gRPC on `4317` and HTTP on `4318`. What symptom do you expect and what is the fix?

The application will likely fail export with connection, protocol, or parsing errors because OTLP/HTTP is being sent to the gRPC receiver. Change the endpoint to http://otel-collector:4318 for OTLP/HTTP, or change the exporter to OTLP/gRPC if you want to use 4317. After changing it, verify accepted spans or metrics in Collector logs or self-metrics.

7. A team wants to remove all health-check spans with a filter processor. How do you make the change without hiding real traffic?

Start by identifying exact span names or route attributes for health checks, then add a narrow filter such as GET /healthz and GET /readyz. Keep a debug exporter during the change or compare accepted and dropped span counts before and after rollout. Avoid broad patterns such as every route containing health unless you have proved they only match probes. The goal is to reduce predictable noise while preserving user-facing failure evidence.

8. During a backend outage, Collector memory grows and pods restart. Which Collector settings and Kubernetes settings do you inspect?

Inspect the memory_limiter processor, exporter queue settings, batch processor settings, and Kubernetes memory limits. A backend outage can cause queued telemetry to accumulate, especially if the Collector has no early memory protection. The better design sets pod memory limits, places memory_limiter early in each pipeline, uses bounded queues, and accepts that telemetry may be dropped under pressure rather than allowing the Collector to crash repeatedly.

Hands-On Exercise: Build and Debug an OpenTelemetry Pipeline

Section titled “Hands-On Exercise: Build and Debug an OpenTelemetry Pipeline”

In this exercise you will deploy an OpenTelemetry Collector, send traces and metrics from a small Python application, and verify the pipeline from the application to the Collector. The goal is not to build a perfect production stack. The goal is to practice the debugging sequence you will use when a real platform says, “Telemetry is missing.”

Terminal window
kubectl create namespace observability
alias k=kubectl

Success criteria:

  • The observability namespace exists.
  • You can run k get ns observability successfully.
  • You understand that the alias is only a shell shortcut for this exercise.
Terminal window
k apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 256
spike_limit_mib: 64
batch:
timeout: 1s
send_batch_size: 256
resource:
attributes:
- key: deployment.environment
value: lab
action: upsert
exporters:
debug:
verbosity: detailed
prometheus:
endpoint: 0.0.0.0:8889
extensions:
health_check:
endpoint: 0.0.0.0:13133
service:
extensions:
- health_check
pipelines:
traces:
receivers:
- otlp
processors:
- memory_limiter
- resource
- batch
exporters:
- debug
metrics:
receivers:
- otlp
processors:
- memory_limiter
- resource
- batch
exporters:
- prometheus
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.99.0
args:
- --config=/etc/otel/config.yaml
ports:
- name: otlp-grpc
containerPort: 4317
- name: otlp-http
containerPort: 4318
- name: metrics
containerPort: 8889
- name: health
containerPort: 13133
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
- name: metrics
port: 8889
targetPort: 8889
- name: health
port: 13133
targetPort: 13133
EOF

Success criteria:

  • k get pods -n observability shows the Collector pod running.
  • k logs -n observability -l app=otel-collector --tail=50 shows no configuration errors.
  • You can explain why memory_limiter appears before batch in the pipelines.
  • You can identify which port is OTLP/gRPC and which port is OTLP/HTTP.

Step 3: Deploy a Small Instrumented Application

Section titled “Step 3: Deploy a Small Instrumented Application”
Terminal window
k apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: demo-app-code
namespace: observability
data:
app.py: |
import random
import time
from flask import Flask
from opentelemetry import metrics, trace
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
resource = Resource.create(
{
"service.name": "demo-app",
"service.version": "1.0.0",
}
)
trace_provider = TracerProvider(resource=resource)
trace_exporter = OTLPSpanExporter(
endpoint="http://otel-collector.observability.svc.cluster.local:4317",
insecure=True,
)
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)
metric_exporter = OTLPMetricExporter(
endpoint="http://otel-collector.observability.svc.cluster.local:4317",
insecure=True,
)
metric_reader = PeriodicExportingMetricReader(
metric_exporter,
export_interval_millis=5000,
)
metric_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(metric_provider)
tracer = trace.get_tracer("demo-app")
meter = metrics.get_meter("demo-app")
request_counter = meter.create_counter("demo.requests", unit="1")
request_duration = meter.create_histogram("demo.request.duration", unit="s")
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
@app.route("/")
def index():
start = time.time()
request_counter.add(1, {"http.route": "/", "http.request.method": "GET"})
request_duration.record(time.time() - start, {"http.route": "/"})
return "Hello from OpenTelemetry\n"
@app.route("/checkout")
def checkout():
start = time.time()
with tracer.start_as_current_span("reserve_inventory") as span:
delay = random.uniform(0.05, 0.3)
span.set_attribute("inventory.delay_seconds", delay)
time.sleep(delay)
with tracer.start_as_current_span("authorize_payment") as span:
approved = random.random() > 0.2
span.set_attribute("payment.approved", approved)
if not approved:
span.set_attribute("payment.failure_reason", "issuer_declined")
elapsed = time.time() - start
request_counter.add(1, {"http.route": "/checkout", "http.request.method": "GET"})
request_duration.record(elapsed, {"http.route": "/checkout"})
return f"checkout completed in {elapsed:.3f}s\n"
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-app
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: demo-app
template:
metadata:
labels:
app: demo-app
spec:
containers:
- name: app
image: python:3.12-slim
command:
- /bin/sh
- -c
args:
- |
pip install --no-cache-dir flask opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-flask &&
python -u /app/app.py
ports:
- containerPort: 8080
volumeMounts:
- name: app-code
mountPath: /app
volumes:
- name: app-code
configMap:
name: demo-app-code
---
apiVersion: v1
kind: Service
metadata:
name: demo-app
namespace: observability
spec:
selector:
app: demo-app
ports:
- name: http
port: 8080
targetPort: 8080
EOF

Success criteria:

  • k get pods -n observability -l app=demo-app shows the demo application running.
  • The application code sets service.name before creating traces and metrics.
  • You can identify the manually created spans inside the /checkout handler.
  • You can explain which spans Flask auto-instrumentation should add.

Step 4: Generate Traffic and Verify Traces

Section titled “Step 4: Generate Traffic and Verify Traces”
Terminal window
k port-forward -n observability svc/demo-app 8080:8080

In a second terminal:

Terminal window
curl -s http://127.0.0.1:8080/
curl -s http://127.0.0.1:8080/checkout
curl -s http://127.0.0.1:8080/checkout

Then inspect Collector logs:

Terminal window
k logs -n observability -l app=otel-collector --tail=200 | grep -E "demo-app|reserve_inventory|authorize_payment"

Success criteria:

  • You see evidence of demo-app in Collector debug output.
  • You see spans for the /checkout request.
  • You can distinguish framework spans from manual business spans.
  • You can explain why the debug exporter is useful during first deployment.
Terminal window
k port-forward -n observability svc/otel-collector 8889:8889

In a second terminal:

Terminal window
curl -s http://127.0.0.1:8889/metrics | grep demo

Success criteria:

  • You can query the Collector’s Prometheus exporter.
  • You see demo.requests or translated metric names exposed for scraping.
  • You can explain why route templates are safer metric labels than raw URLs.
  • You can describe how Prometheus from Module 1.1 would scrape this endpoint.

Change the application exporter endpoint from port 4317 to 4318 while still using the gRPC exporter, redeploy the application, and generate traffic again. Then inspect the application logs and Collector logs.

Success criteria:

  • You can observe export failures or missing accepted telemetry.
  • You can explain why OTLP/gRPC and OTLP/HTTP are not interchangeable just because both are OTLP.
  • You can restore the correct endpoint and verify telemetry returns.
  • You can write a short debugging note that names the boundary you tested.
Terminal window
k delete namespace observability

Success criteria:

  • The namespace and lab workloads are removed.
  • You can summarize the full signal path from application SDK to Collector exporter.
  • You can name at least two places where telemetry could be dropped intentionally.
  • You can name at least two places where telemetry could be lost accidentally.

Continue to Module 1.3: Grafana to learn how dashboards and visual exploration turn telemetry into operational decisions.

  • OpenTelemetry Collector — General lesson point for an illustrative rewrite.
  • OTLP Specification 1.10.0 — Backs OTLP as the OpenTelemetry transport protocol, including gRPC and HTTP transport details, request/response behavior, default ports, retry semantics, and signal transport expectations.
  • opentelemetry.io: supported libraries — The official supported-libraries page directly lists Spring, JDBC, and multiple HTTP client instrumentations for the Java agent.
  • opentelemetry.io: python — General lesson point for an illustrative rewrite.
  • opentelemetry.io: service — The semantic-conventions spec explicitly defines the unknown_service fallback for missing service.name.
  • prometheus.io: data model — Prometheus documents that time series are identified by metric name plus labels and that changing a label value creates a new series.
  • opentelemetry.io: api propagators — The Propagators API spec explicitly includes B3 among the propagators distributed by the OpenTelemetry project.
  • OpenTelemetry: Sampling — Backs head-vs-tail sampling concepts, cost-control rationale, and practical guidance about reducing trace volume while preserving useful signal.
  • opentelemetry.io: gateway — The official gateway deployment guidance explicitly discusses routing by trace ID so tail-sampling collectors see all spans for a trace.
  • opentelemetry.io: what is opentelemetry — The official OpenTelemetry overview page directly states that the project resulted from a merger of OpenTracing and OpenCensus.
  • opentelemetry.io: internal telemetry — The official internal-telemetry docs enumerate queue and send-failure metrics and explain how they indicate downstream exporter trouble.
  • opentelemetry.io: deploy — The official deployment-pattern docs describe the supported Collector deployment shapes and when to use them.
  • OpenTelemetry Traces — Covers traces, spans, parent-child relationships, and the conceptual model behind propagation and trace debugging.