Перейти до вмісту

Module 1.1: OTel API & SDK Deep Dive

Цей контент ще не доступний вашою мовою.

Complexity: [COMPLEX] - Core domain, 46% of OTCA exam weight

Time to Complete: 90-120 minutes

Prerequisites: Basic familiarity with distributed systems, HTTP, and at least one programming language (Python or Go)


After completing this module, you will be able to:

  1. Instrument an application with the OpenTelemetry SDK, configuring TracerProvider, MeterProvider, and LoggerProvider pipelines
  2. Implement custom spans with attributes, events, and status codes that capture business-meaningful telemetry
  3. Configure exporters (OTLP, Prometheus, console) and processors (batch, filter) to control how signals leave your application
  4. Evaluate the tradeoffs between automatic and manual instrumentation and select the right approach for each service

A team at a mid-size fintech company had Jaeger deployed, Prometheus scraping, and Fluentd shipping logs. Three separate instrumentation libraries, three configuration surfaces, three vendor lock-in vectors. When they needed to switch from Jaeger to Grafana Tempo, it took six weeks to re-instrument 40 services. Then OpenTelemetry arrived, and the next migration — Tempo to an OTLP-native backend — took an afternoon. One SDK, one wire protocol, swap the exporter.

That is the entire point of OpenTelemetry: decouple instrumentation from backends. Domain 2 of the OTCA exam (46% of your score) tests whether you understand the machinery that makes this possible — the Provider/Processor/Exporter pipelines for traces, metrics, and logs.

Get this module right and you have nearly half the exam locked down.


  • OpenTelemetry is the second most active CNCF project after Kubernetes itself, with 1,000+ contributors across 11 language SIGs.
  • The OTel Collector is optional. The SDK can export directly to backends. The Collector adds routing, batching, and multi-backend fan-out, but it is a deployment choice, not a requirement.
  • Semantic conventions took years to stabilize. The rename from http.method to http.request.method (v1.23+) broke dashboards everywhere — which is why the exam tests whether you know the current stable names.
  • Zero-code instrumentation in Java modifies bytecode at load time. The Java agent uses java.lang.instrument and byte-buddy to rewrite class files before the JVM executes them — no source changes needed.

Part 1: The Three Pillars — Provider Pipelines

Section titled “Part 1: The Three Pillars — Provider Pipelines”

Every signal in OTel follows the same pattern: Provider → Processor/Reader → Exporter. Understand one and you understand all three.

┌──────────────────────────────────────────────────────────────────┐
│ TracerProvider │
│ │
│ Resource: { service.name="cart-svc", deployment.env="prod" } │
│ │
│ ┌────────┐ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Tracer │───▶│ SpanProcessor │───▶│ SpanExporter │ │
│ │ │ │ │ │ │ │
│ │ start │ │ SimpleProcessor │ │ OTLPSpanExporter │ │
│ │ span() │ │ (sync, dev) │ │ ConsoleSpanExporter │ │
│ │ │ │ │ │ JaegerExporter │ │
│ │ end │ │ BatchProcessor │ │ ZipkinExporter │ │
│ │ span() │ │ (async, prod) │ │ │ │
│ └────────┘ └──────────────────┘ └──────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

SpanProcessor — controls when spans are handed to the exporter:

ProcessorBehaviorUse Case
SimpleSpanProcessorExports each span immediately, synchronouslyLocal dev, debugging
BatchSpanProcessorBuffers spans, exports in batches on a timerProduction (always)

BatchSpanProcessor tuning knobs (and their env vars):

ParameterEnv VarDefaultWhat It Does
maxQueueSizeOTEL_BSP_MAX_QUEUE_SIZE2048Buffer capacity before dropping
scheduledDelayMillisOTEL_BSP_SCHEDULE_DELAY5000Flush interval in ms
maxExportBatchSizeOTEL_BSP_MAX_EXPORT_BATCH_SIZE512Spans per export call

SpanExporter — controls where spans go:

ExporterProtocolNotes
OTLP (gRPC)grpc on port 4317Default, preferred
OTLP (HTTP)http/protobuf on port 4318Firewall-friendly
ConsolestdoutDebugging only
JaegerThrift/gRPCLegacy — use OTLP instead
ZipkinHTTP/JSONLegacy — use OTLP instead

Exam tip: OTLP is the only exporter the spec mandates SDKs implement. Jaeger and Zipkin exporters are community add-ons.

┌──────────────────────────────────────────────────────────────────┐
│ MeterProvider │
│ │
│ Resource: { service.name="cart-svc" } │
│ │
│ ┌────────┐ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Meter │───▶│ MetricReader │───▶│ MetricExporter │ │
│ │ │ │ │ │ │ │
│ │counter │ │ PeriodicReader │ │ OTLPMetricExporter │ │
│ │histo │ │ (push, 60s) │ │ ConsoleExporter │ │
│ │gauge │ │ │ │ │ │
│ │ │ │ PrometheusReader │ │ (pull: /metrics) │ │
│ └────────┘ └──────────────────┘ └──────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

MetricReader — controls how metric data is collected:

ReaderModelUse Case
PeriodicExportingMetricReaderPush (exports on interval)OTLP backends, Collector
PrometheusMetricReaderPull (exposes /metrics)Prometheus scrape targets
┌──────────────────────────────────────────────────────────────────┐
│ LoggerProvider │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────┐ │
│ │ Log Bridge │───▶│ LogProcessor │───▶│ LogExporter │ │
│ │ │ │ │ │ │ │
│ │ Python │ │ SimpleProcessor │ │ OTLP │ │
│ │ stdlib │ │ BatchProcessor │ │ Console │ │
│ │ Java Log4j │ │ │ │ │ │
│ │ .NET ILogger │ │ │ │ │ │
│ └──────────────┘ └──────────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Key concept: OTel does NOT replace your logging library. A log bridge connects existing loggers (Python logging, Java Log4j/SLF4J) to the OTel pipeline. You keep writing logger.info("order placed") — the bridge enriches log records with trace context and routes them through the OTel exporter.

1.4 Signal Maturity — Know This for the Exam

Section titled “1.4 Signal Maturity — Know This for the Exam”
SignalSpec StatusSDK StatusNotes
TracesStableStable in all major languagesSafe for production
MetricsStableStable in most languagesGA since 2023
LogsStable (data model)Mixed (bridge APIs vary)Bridges are the integration point
BaggageStableStablePropagation mechanism
ProfilingExperimentalEarly developmentNot on exam yet

A span is not just “a timer with a name.” The exam expects you to know every field.

Span {
trace_id: "abc123..." ← 16-byte, shared across all spans in trace
span_id: "def456..." ← 8-byte, unique to this span
parent_id: "789abc..." ← links to parent span (or empty for root)
name: "GET /api/cart" ← operation name
kind: SERVER ← see below
start_time: 1711234567.123 ← nanosecond precision
end_time: 1711234567.456
status: { code: OK } ← Unset, Ok, Error
attributes: { "http.request.method": "GET", "http.response.status_code": 200 }
events: [ { name: "cache.miss", timestamp: ... } ]
links: [ { trace_id: "other_trace", span_id: "other_span" } ]
resource: { "service.name": "cart-svc", "host.name": "node-1" }
}
KindWho Creates ItExample
INTERNALDefault, in-process operationsBusiness logic, helper functions
SERVERThe service receiving a requestHTTP handler, gRPC server method
CLIENTThe service making a requestHTTP client call, DB query
PRODUCEREnqueuing a message (async)Kafka producer, SQS send
CONSUMERProcessing a queued messageKafka consumer, SQS handler

Exam trap: A span for “calling the database” is CLIENT, not INTERNAL. The span kind describes the role in the communication, not whether it is “internal to your code.”

This distinction trips people up constantly.

AttributesResources
ScopePer-span (or per-metric data point)Per-Provider (applies to ALL telemetry)
Set whenDuring span/metric creationAt SDK initialization
Examplehttp.request.method=GETservice.name=cart-svc
ChangesDifferent per operationSame for entire process lifetime

Events are timestamped annotations within a span. The most common use is recording exceptions:

from opentelemetry import trace
span = trace.get_current_span()
try:
process_order(order_id)
except Exception as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e) # adds an event with exception.type, exception.message, exception.stacktrace
raise

record_exception is syntactic sugar — it creates an event named exception with attributes from semantic conventions.

Links connect spans across separate traces. Unlike parent-child (which is within one trace), links say “this span is related to that other span.”

Use cases:

  • A batch job that processes items from multiple incoming requests
  • Fan-in: a span that aggregates results from several independent traces
  • Retries where each attempt creates a new trace

Synchronous instruments are called inline with your code — you call .add() or .record() at the point of measurement.

Asynchronous (observable) instruments register a callback. The SDK calls your function at collection time to read the current value.

InstrumentSync/AsyncMonotonic?Example
CounterSyncYes (only increases)requests_total, bytes_sent
UpDownCounterSyncNo (increases and decreases)active_connections, queue_depth
HistogramSyncN/A (records distributions)request_duration_ms, response_size
Observable CounterAsyncYesCPU time (read from /proc)
Observable UpDownCounterAsyncNoCurrent memory usage
Observable GaugeAsyncNo (point-in-time value)Temperature, current CPU %

When to use async: When the value already exists somewhere (OS counter, connection pool stats) and you just want to observe it at collection time. Do NOT poll in a loop — let the SDK invoke your callback.

This is one of the most tested concepts in Domain 2.

TemporalityWhat It ReportsWho Prefers It
CumulativeTotal since process start (e.g., counter = 1042)Prometheus, OTLP default for counters
DeltaChange since last collection (e.g., counter += 17)StatsD, some commercial backends

Why it matters: If your exporter expects delta but the SDK sends cumulative, you get nonsense numbers. The SDK’s MetricReader determines temporality preference.

Cumulative: |---100---|---250---|---430---| (running total)
Delta: |---100---|---150---|---180---| (per-interval change)

The SDK aggregates raw measurements before export:

InstrumentDefault Aggregation
CounterSum
HistogramExplicit Bucket Histogram
GaugeLast Value

3.4 Exemplars — Linking Metrics to Traces

Section titled “3.4 Exemplars — Linking Metrics to Traces”

Exemplars attach a trace_id and span_id to a metric data point. When you see a latency spike on a dashboard, exemplars let you click through to the exact trace that caused it.

histogram bucket [200ms-500ms]: count=47, exemplar={trace_id="abc123", value=423ms}

Exemplars are configured on the SDK level and typically sampled (not every data point gets one).


4.1 How Context Crosses Service Boundaries

Section titled “4.1 How Context Crosses Service Boundaries”
Service A Service B
┌──────────────┐ HTTP Request ┌──────────────┐
│ │ │ │
│ span-A │ headers: │ span-B │
│ trace=abc │──────────────────▶│ trace=abc │
│ │ traceparent: │ parent=A │
│ │ 00-abc...-A.. │ │
└──────────────┘ └──────────────┘

The SDK uses propagators to inject context into carriers (HTTP headers, message attributes) and extract it on the receiving side.

The traceparent header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
version trace-id span-id flags
(sampled)

The tracestate header carries vendor-specific data (e.g., sampling decisions for a particular backend).

Baggage propagates arbitrary key-value pairs across service boundaries. Unlike trace context (which carries IDs), baggage carries business data.

from opentelemetry import baggage, context
ctx = baggage.set_baggage("user.tier", "premium")
# attach ctx — now all downstream services can read user.tier

Warning: Baggage is sent in headers to every downstream service. Do NOT put sensitive data (PII, tokens) in baggage. It is visible on the wire.


5.1 Environment Variables (Exam Favorites)

Section titled “5.1 Environment Variables (Exam Favorites)”
VariablePurposeExample
OTEL_SERVICE_NAMESets service.name resource attributecart-service
OTEL_EXPORTER_OTLP_ENDPOINTBase URL for all OTLP signalshttp://collector:4317
OTEL_EXPORTER_OTLP_PROTOCOLWire protocolgrpc, http/protobuf
OTEL_EXPORTER_OTLP_HEADERSAuth headersapi-key=secret123
OTEL_TRACES_EXPORTERTrace exporter typeotlp, console, none
OTEL_METRICS_EXPORTERMetric exporter typeotlp, prometheus, none
OTEL_LOGS_EXPORTERLog exporter typeotlp, console, none
OTEL_TRACES_SAMPLERSampling strategyalways_on, traceidratio
OTEL_TRACES_SAMPLER_ARGSampler parameter0.1 (10% sampling)
OTEL_RESOURCE_ATTRIBUTESAdditional resource attrsdeployment.env=prod,service.version=2.1

Exam tip: Env vars take lowest precedence. Programmatic config overrides env vars. Signal-specific env vars (e.g., OTEL_EXPORTER_OTLP_TRACES_ENDPOINT) override the general one (OTEL_EXPORTER_OTLP_ENDPOINT).

5.2 Semantic Conventions (Stable Naming Standards)

Section titled “5.2 Semantic Conventions (Stable Naming Standards)”

The exam tests whether you know the correct attribute names. These are the most common:

ConventionAttributeExample Value
HTTPhttp.request.methodGET
HTTPhttp.response.status_code200
HTTPurl.fullhttps://api.example.com/cart
Serviceservice.namecart-service
Serviceservice.version1.4.2
Databasedb.systempostgresql
Databasedb.statementSELECT * FROM orders
Messagingmessaging.systemkafka
Messagingmessaging.operationpublish

Watch out: The old convention http.method was renamed to http.request.method. The exam uses the current stable names from semconv v1.23+.


Zero-code (auto) instrumentation wraps library calls without changing your source.

LanguageMechanismCommand
JavaBytecode manipulation via java.lang.instrument agentjava -javaagent:opentelemetry-javaagent.jar -jar myapp.jar
PythonMonkey-patching libraries at import timeopentelemetry-instrument python myapp.py
.NETCLR profiler API and startup hooksVia env vars: CORECLR_ENABLE_PROFILING=1
Node.jsRequire hooks / module patchingnode --require @opentelemetry/auto-instrumentations-node myapp.js

Auto-instrumentation covers libraries, not business logic:

  • HTTP servers and clients (Flask, requests, net/http)
  • Database drivers (psycopg2, JDBC, database/sql)
  • Messaging (Kafka clients, RabbitMQ)
  • gRPC

You still need manual instrumentation for:

  • Custom business spans (“process-order”, “validate-cart”)
  • Custom metrics (order value, queue depth)
  • Adding attributes specific to your domain

7.1 Python — Manual Tracing with TracerProvider

Section titled “7.1 Python — Manual Tracing with TracerProvider”
trace_example.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
# 1. Create a Resource (identifies this service)
resource = Resource.create({"service.name": "order-service", "service.version": "1.0.0"})
# 2. Build the pipeline: Provider → Processor → Exporter
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
# 3. Register globally
trace.set_tracer_provider(provider)
# 4. Get a tracer (scoped by instrumentation library name + version)
tracer = trace.get_tracer("order.instrumentation", "0.1.0")
# 5. Create spans
with tracer.start_as_current_span("process-order", kind=trace.SpanKind.INTERNAL) as span:
span.set_attribute("order.id", "ORD-12345")
span.set_attribute("order.item_count", 3)
with tracer.start_as_current_span("validate-payment", kind=trace.SpanKind.CLIENT) as child:
child.set_attribute("payment.provider", "stripe")
# ... call payment API
child.set_status(trace.StatusCode.OK)
span.add_event("order.confirmed", {"order.total": 59.99})
# 6. Flush before exit
provider.shutdown()
metrics_example.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader, ConsoleMetricExporter
from opentelemetry.sdk.resources import Resource
resource = Resource.create({"service.name": "order-service"})
reader = PeriodicExportingMetricReader(ConsoleMetricExporter(), export_interval_millis=5000)
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("order.instrumentation", "0.1.0")
# Sync Counter — total orders placed (only goes up)
order_counter = meter.create_counter("orders.placed", unit="1", description="Total orders placed")
# Sync Histogram — order processing duration
duration_hist = meter.create_histogram("orders.duration", unit="ms", description="Order processing time")
# Sync UpDownCounter — items currently in carts
cart_items = meter.create_up_down_counter("cart.items", unit="1", description="Active cart items")
# Async Gauge — current queue depth (observed, not recorded)
def read_queue_depth(options):
depth = get_queue_depth() # your function that reads the real value
options.observe(depth, {"queue.name": "orders"})
meter.create_observable_gauge("queue.depth", callbacks=[read_queue_depth])
# Usage
order_counter.add(1, {"order.type": "standard"})
duration_hist.record(142.5, {"order.type": "standard"})
cart_items.add(3, {"user.id": "u-123"}) # customer adds 3 items
cart_items.add(-1, {"user.id": "u-123"}) # customer removes 1 item

7.3 Go — Manual Tracing with Context Propagation

Section titled “7.3 Go — Manual Tracing with Context Propagation”
package main
import (
"context"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/propagation"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
)
func initTracer() *sdktrace.TracerProvider {
exporter, _ := stdouttrace.New(stdouttrace.WithPrettyPrint())
res, _ := resource.New(context.Background(),
resource.WithAttributes(semconv.ServiceName("order-service")),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{}, propagation.Baggage{},
))
return tp
}
func handleOrder(w http.ResponseWriter, r *http.Request) {
// Extract incoming context from headers (continues the trace from upstream)
ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracer := otel.Tracer("order.instrumentation")
ctx, span := tracer.Start(ctx, "handle-order", trace.WithSpanKind(trace.SpanKindServer))
defer span.End()
span.SetAttributes(attribute.String("order.id", "ORD-12345"))
callPaymentService(ctx)
}
func callPaymentService(ctx context.Context) {
tracer := otel.Tracer("order.instrumentation")
ctx, span := tracer.Start(ctx, "call-payment", trace.WithSpanKind(trace.SpanKindClient))
defer span.End()
// Inject context into outgoing request — payment-svc continues the trace
req, _ := http.NewRequestWithContext(ctx, "POST", "http://payment-svc/charge", nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
http.DefaultClient.Do(req)
}
func recordMetrics() {
meter := otel.Meter("order.instrumentation")
orderCounter, _ := meter.Int64Counter("orders.placed",
metric.WithUnit("1"), metric.WithDescription("Total orders placed"))
durationHist, _ := meter.Float64Histogram("orders.duration",
metric.WithUnit("ms"), metric.WithDescription("Order processing time"))
// Async Gauge — SDK calls this callback at collection time
meter.Float64ObservableGauge("queue.depth",
metric.WithFloat64Callback(func(_ context.Context, o metric.Float64Observer) error {
o.Observe(42.0, metric.WithAttributes(attribute.String("queue.name", "orders")))
return nil
}),
)
ctx := context.Background()
orderCounter.Add(ctx, 1, metric.WithAttributes(attribute.String("order.type", "standard")))
durationHist.Record(ctx, 142.5, metric.WithAttributes(attribute.String("order.type", "standard")))
}

MistakeWhy It’s WrongFix
Using SimpleSpanProcessor in productionExports synchronously on the hot path, adds latencyAlways use BatchSpanProcessor in prod
Setting service.name as a span attributeIt’s a resource attribute, not per-spanSet it on the Resource passed to the Provider
Using INTERNAL kind for outgoing HTTP callsMisrepresents the span’s role in the traceUse CLIENT for outgoing requests, SERVER for incoming
Putting PII in BaggageBaggage is propagated in cleartext HTTP headersOnly put non-sensitive routing/context data in Baggage
Confusing cumulative and delta temporalityBackends interpret the numbers differentlyMatch temporality to your backend’s expectation
Calling record_exception without setting Error statusException is logged as event but span shows OKAlways call set_status(ERROR) alongside record_exception
Using deprecated http.method attribute nameOld semconv, will fail exam questionsUse http.request.method (semconv v1.23+)
Creating a new TracerProvider per requestMassive overhead, breaks batchingCreate ONE provider at startup, share it globally

Test yourself — these mirror OTCA exam style questions.

Q1: Which SpanProcessor should you use in production and why?

BatchSpanProcessor. It buffers spans and exports them asynchronously in batches, avoiding the latency overhead of synchronous per-span exports that SimpleSpanProcessor introduces.

Q2: A service receives an HTTP request and then calls a database. What span kinds are correct for (a) the HTTP handler span and (b) the database call span?

(a) SERVER — the service is receiving an inbound request. (b) CLIENT — the service is making an outbound call to the database.

Q3: What is the difference between span attributes and resources?

Resources are set once at SDK initialization and apply to ALL telemetry from that provider (e.g., service.name, host.name). Attributes are per-span (or per-metric data point) and describe the specific operation (e.g., http.request.method, db.statement).

Q4: You set OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://traces-collector:4317. Which endpoint do traces go to?

http://traces-collector:4317. Signal-specific env vars override the general OTEL_EXPORTER_OTLP_ENDPOINT.

Q5: Name three differences between a Counter and a Histogram instrument.
  1. Counter records monotonically increasing sums (e.g., total requests); Histogram records distributions of values (e.g., latency).
  2. Counter uses .add(); Histogram uses .record().
  3. Counter aggregates to a Sum; Histogram aggregates to bucket counts, sum, min, max.
Q6: What does a log bridge do, and why doesn't OTel replace existing logging libraries?

A log bridge connects existing language-native loggers (Python logging, Java Log4j) to the OTel LoggerProvider pipeline. OTel does not replace loggers because applications already have extensive logging — the bridge enriches existing logs with trace context and routes them to OTel exporters without requiring code changes.

Q7: Explain cumulative vs delta temporality for a Counter that receives: +10, +20, +5 over three collection intervals.

Cumulative reports the running total: 10, 30, 35. Delta reports the per-interval change: 10, 20, 5.

Q8: What header does W3C TraceContext use for propagation, and what are its four fields?

The traceparent header. Its four fields are: version (2 hex chars, currently 00), trace-id (32 hex chars), parent-id/span-id (16 hex chars), and trace-flags (2 hex chars, 01 = sampled).

Q9: A Java application needs OpenTelemetry instrumentation but the team cannot modify source code. What approach should they use?

Use the OpenTelemetry Java agent (-javaagent:opentelemetry-javaagent.jar). It uses bytecode manipulation at class load time to automatically instrument supported libraries (HTTP, gRPC, JDBC, etc.) without any source code changes.

Q10: What are exemplars and when would you use them?

Exemplars are sample trace_id/span_id references attached to metric data points. They link metrics to traces — when you see a latency spike in a histogram bucket on your dashboard, exemplars let you jump directly to a representative trace that contributed to that bucket.


Hands-On Exercise: Build an Instrumented Python Service

Section titled “Hands-On Exercise: Build an Instrumented Python Service”

Goal: Wire up a complete OTel pipeline with traces and metrics, then verify the output.

Terminal window
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
  1. Create a Python script that initializes a TracerProvider with a BatchSpanProcessor and ConsoleSpanExporter.
  2. Create a MeterProvider with a PeriodicExportingMetricReader (5-second interval) and ConsoleMetricExporter.
  3. Set the resource to service.name=exercise-svc.
  4. Create a parent span called handle-request with kind SERVER.
  5. Inside it, create a child span call-database with kind CLIENT and attribute db.system=postgresql.
  6. Record a counter increment for requests.total with attribute http.request.method=GET.
  7. Record a histogram observation for request.duration with value 142.5 ms.
  8. In the child span, simulate an exception: call record_exception and set status to ERROR.
  9. Flush and verify you see both spans and metrics in console output.
  • Console output shows two spans with correct parent-child relationship (same trace_id, child’s parent_id matches parent’s span_id)
  • Parent span kind is SPAN_KIND_SERVER, child is SPAN_KIND_CLIENT
  • Child span has status ERROR and an exception event
  • Metrics output shows requests.total counter and request.duration histogram
  • Resource on all telemetry shows service.name=exercise-svc

  1. All three signals follow the same pattern: Provider → Processor/Reader → Exporter. Learn one, understand all.
  2. BatchSpanProcessor for production, always. SimpleSpanProcessor is for debugging only.
  3. Resources describe the service, attributes describe the operation. Never put service.name on a span attribute.
  4. Span kind describes the communication role, not whether code is “internal.” Outgoing calls are CLIENT.
  5. OTLP is the standard export protocol. Jaeger and Zipkin exporters exist but OTLP is preferred and spec-mandated.
  6. Env vars configure everything but programmatic config wins. Signal-specific vars override general ones.
  7. Auto-instrumentation covers libraries, not business logic. You still need manual spans for domain-specific operations.
  8. Cumulative vs delta temporality must match your backend. Getting this wrong produces garbage data.

Next Module: Module 2: OTel Collector Architecture — how to receive, process, and export telemetry at scale.