Skip to content

Module 1.1: OTel API & SDK Deep Dive

Complexity: [COMPLEX] - Core domain, 46% of OTCA exam weight

Time to Complete: 90-120 minutes

Prerequisites: Basic familiarity with distributed systems, HTTP request flow, service-to-service calls, and either Python or Go


After completing this module, you will be able to:

  1. Design an OpenTelemetry SDK pipeline that routes traces, metrics, and logs through the correct providers, processors, readers, and exporters for a production service.
  2. Debug broken trace continuity by analyzing span kinds, parent-child relationships, W3C TraceContext headers, propagators, and baggage usage.
  3. Implement manual instrumentation that adds business-relevant spans, attributes, events, exceptions, metrics, and resources without duplicating what auto-instrumentation already provides.
  4. Evaluate tradeoffs between synchronous and asynchronous metric instruments, cumulative and delta temporality, console and OTLP exporters, and simple and batch processing.
  5. Refactor reference-style SDK snippets into runnable input-to-solution instrumentation patterns that you can adapt for real services and OTCA exam scenarios.

Hypothetical scenario: a platform team at a payment company has a familiar observability problem: every service emits telemetry, but no two services describe the same operation the same way. The Java checkout service used a vendor-specific tracer, the Python fraud service logged request IDs by hand, and the Go payment gateway exported Prometheus metrics with labels that did not match the trace names. When a production incident crossed all three services, the on-call engineer spent more time translating telemetry than debugging the customer-facing failure.

Their first attempted fix was to standardize on a single backend. That helped dashboards look consistent, but it did not solve the deeper problem because instrumentation was still coupled to one vendor’s library and one export format. When the company later moved from one trace backend to another, teams had to touch application code, redeploy services, and re-check every custom integration. The migration risk came from a design mistake: the telemetry pipeline was embedded in application logic instead of being treated as a configurable SDK boundary.

OpenTelemetry changes that boundary. Application code creates spans, metrics, logs, attributes, and context using a stable API, while the SDK decides how telemetry is sampled, batched, aggregated, and exported. That separation is the reason a team can keep business instrumentation in source code while changing exporters, collectors, sampling policies, or backends through configuration. For OTCA, this is not trivia; it is the mechanism behind almost every scenario question in the API and SDK domain.

This module teaches the SDK from the inside out. You will first build the mental model for provider pipelines, then inspect spans and metrics as data structures, then connect that data across service boundaries through propagation. After that, you will turn reference snippets into worked examples where an uninstrumented service becomes a traceable and measurable service step by step. The goal is not to memorize class names; the goal is to recognize where telemetry is created, enriched, buffered, transformed, and shipped.

A senior practitioner does not ask only, “Can I emit a span?” They ask, “Will this span help the next engineer isolate the failure without reading the source?” They also ask whether a metric will aggregate correctly, whether a resource attribute belongs on every signal, whether baggage leaks sensitive data, and whether exporter configuration can change without rebuilding the service. That is the level this module targets. Although the OTCA exam is not a Kubernetes operations test, most production OpenTelemetry deployments you meet in this curriculum will run beside Kubernetes 1.35+ workloads, collectors, sidecars, gateways, and admission-controlled namespaces. That matters because SDK decisions made inside application code become platform behaviors once hundreds of pods inherit the same environment variables and export to the same collector fleet. If one service chooses unstable metric attributes or leaks baggage, the damage is not limited to a local library choice; it becomes a cluster-wide cost, privacy, and incident-response problem. This module therefore treats the SDK as the first control point in a larger observability system.


OpenTelemetry has many language-specific classes, but the model is deliberately repetitive. For each signal, application code talks to an API object, the SDK provider owns configuration, a processor or reader prepares data, and an exporter sends data somewhere else. Once you recognize that shape, a new language SDK becomes easier to read because the names change less than the responsibilities. The provider is the boundary between “my code creates telemetry” and “the SDK manages telemetry.”

┌────────────────────────────────────────────────────────────────────────────┐
│ One OpenTelemetry SDK Boundary │
│ │
│ Application code │
│ creates telemetry │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ API object │─────▶│ SDK provider │─────▶│ Processor or reader │ │
│ │ tracer/meter │ │ owns resource │ │ batches or collects │ │
│ └──────────────┘ └──────────────────┘ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Exporter sends data │ │
│ │ to console, OTLP, │ │
│ │ Prometheus, backend │ │
│ └──────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘

The API is intentionally small because instrumentation code should not know much about backend plumbing. A handler should be able to start a span, set an attribute, record a metric, and keep doing business work. The SDK provider then applies resource identity, sampling, aggregation, batching, and export policy. This is the design that lets the same instrumented code run in a laptop demo, a staging namespace, or a production cluster with different export destinations.

SignalAPI Object Used by CodeSDK OwnerMiddle ComponentExport Destination Examples
TracesTracerTracerProviderSpanProcessorConsole, OTLP, trace backend
MetricsMeterMeterProviderMetricReaderConsole, OTLP, Prometheus scrape
LogsLogger bridgeLoggerProviderLogRecordProcessorConsole, OTLP, log backend

The trace pipeline is usually the first one learners understand because a span feels concrete. Your application starts and ends spans, the provider attaches a resource and sampling behavior, the processor decides when ended spans are exported, and the exporter serializes them to a destination. The most important production choice is almost always the processor, because exporting on the request path can turn observability into latency. That is why batch processing is the normal production default.

┌────────────────────────────────────────────────────────────────────────────┐
│ TracerProvider │
│ │
│ Resource: service.name="checkout", deployment.environment="prod" │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Tracer │─────▶│ SpanProcessor │─────▶│ SpanExporter │ │
│ │ │ │ │ │ │ │
│ │ start span │ │ Simple: sync │ │ Console: local debug │ │
│ │ set attrs │ │ Batch: async │ │ OTLP: collector │ │
│ │ add events │ │ │ │ Backend: vendor API │ │
│ │ end span │ │ │ │ │ │
│ └──────────────┘ └──────────────────┘ └──────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Span ProcessorHow It BehavesPractical UseRisk If Misused
SimpleSpanProcessorExports each ended span immediately on the caller pathLocal debugging and tiny demosAdds exporter latency to application requests
BatchSpanProcessorQueues ended spans and exports them asynchronously in groupsProduction services and load testsDrops spans if the process crashes before flushing
Custom filtering processorDrops or changes spans before exportReducing cost or removing unsafe attributesCan hide failures if filtering is too broad

A batch processor is not a magic lossless queue. It has a queue size, a flush interval, and a maximum export batch size, so the tuning question is really about tradeoffs. A larger queue absorbs bursts but uses more memory, a shorter delay gives fresher traces but increases export overhead, and a larger batch improves throughput but may create bigger export spikes. For OTCA, focus on the behavior first: batch means asynchronous buffering; simple means synchronous export.

Batch SettingCommon Environment VariableWhat You TuneFailure Mode When Wrong
Queue sizeOTEL_BSP_MAX_QUEUE_SIZEHow many ended spans can wait for exportHigh-cardinality bursts drop spans
Flush delayOTEL_BSP_SCHEDULE_DELAYHow often the processor attempts exportIncidents appear late in the backend
Batch sizeOTEL_BSP_MAX_EXPORT_BATCH_SIZEHow many spans go in one export callExport calls become too small or too bursty

Metrics use the same provider idea but a different middle component. A meter creates instruments, instruments record measurements, and the reader decides when collection happens and which temporality the exporter prefers. This distinction matters because metrics are not shipped one measurement at a time; the SDK aggregates many recordings into sums, histograms, last values, or other forms. When a metric looks wrong in a backend, the bug is often in instrument choice, aggregation, label cardinality, or temporality rather than in the exporter.

┌────────────────────────────────────────────────────────────────────────────┐
│ MeterProvider │
│ │
│ Resource: service.name="checkout", service.version="2.4.1" │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Meter │─────▶│ MetricReader │─────▶│ MetricExporter │ │
│ │ │ │ │ │ │ │
│ │ counter │ │ Periodic: push │ │ OTLP: collector │ │
│ │ histogram │ │ Prometheus: pull │ │ Console: local debug │ │
│ │ observable │ │ │ │ Prometheus endpoint │ │
│ └──────────────┘ └──────────────────┘ └──────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Metric ReaderCollection ModelBest FitDesign Implication
PeriodicExportingMetricReaderPush on an intervalOTLP exporters and collector pipelinesApplication initiates exports on schedule
Prometheus readerPull through a scrape endpointPrometheus-native environmentsPrometheus controls scrape timing
Manual collection readerExplicit collection triggerTests and specialized integrationsCaller must remember to collect

Logs complete the three-signal picture, but they are easy to misunderstand. OpenTelemetry usually does not replace the logging API that application teams already use; instead, a bridge connects existing log records into the OTel log pipeline. That bridge can attach trace context so a log line written inside a request span carries the same trace and span identifiers as the trace data. This lets logs, metrics, and traces point at the same incident without forcing every team to abandon familiar logging libraries.

┌────────────────────────────────────────────────────────────────────────────┐
│ LoggerProvider │
│ │
│ ┌──────────────────┐ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Existing logger │───▶│ OTel log bridge │───▶│ LogRecordProcessor │ │
│ │ │ │ │ │ │ │
│ │ Python logging │ │ adds trace context │ │ simple or batch │ │
│ │ Java Log4j │ │ maps severity │ │ filtering possible │ │
│ │ .NET ILogger │ │ preserves message │ │ │ │
│ └──────────────────┘ └────────────────────┘ └─────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────┐ │
│ │ LogExporter │ │
│ │ console or OTLP │ │
│ └────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
SignalMature Concept to KnowWhat Usually Goes WrongDebugging Question
TracesParent-child span relationships and span kindsBroken propagation or missing server spansDo downstream spans share the same trace ID?
MetricsInstrument choice, aggregation, temporality, cardinalityCounters reset unexpectedly or histograms lack useful bucketsIs the backend interpreting the temporality correctly?
LogsBridge behavior and trace correlationLogs arrive but cannot be joined to tracesWas the log emitted while a span was current?
BaggageBusiness context propagated as headersSensitive data leaks downstreamWould this value be safe on the wire?

Stop and think: Your team says “OpenTelemetry is slow” after adding console export to every request in a high-traffic service. Before blaming tracing itself, identify which component in the trace pipeline is on the request path and which processor choice would remove most of that overhead.

The answer should point at the simple processor and console exporter combination. Tracing APIs do add some work, but synchronous export is a much larger problem because it makes every request wait for serialization and output. A batch processor moves export off the hot path, and an OTLP exporter to a collector usually behaves more like a production architecture. This is the difference between debugging a symptom and diagnosing the pipeline design.

A span is a structured record of work, not just a timer with a name. It has identity fields that connect it to a trace, timing fields that define duration, a kind that describes communication role, attributes that describe the operation, events that mark notable moments, and status that summarizes the outcome. If any of those fields are wrong, the trace can still appear in the backend but mislead the engineer reading it. High-quality instrumentation is therefore a design task, not a decoration task.

┌────────────────────────────────────────────────────────────────────────────┐
│ Span │
│ │
│ Identity: trace_id, span_id, parent_span_id │
│ Operation: name, kind, start_time, end_time │
│ Outcome: status code and optional description │
│ Detail: attributes and timestamped events │
│ Context: links to other traces when parent-child is not correct │
│ Resource: service, host, process, deployment, runtime identity │
└────────────────────────────────────────────────────────────────────────────┘

A trace ID groups spans that belong to the same distributed operation. A span ID identifies one unit of work inside that operation, and a parent span ID describes the tree. When Service A receives a request and calls Service B, Service B should usually create a server span with the same trace ID and the Service A client span as its parent. When that continuity breaks, the backend shows two separate traces and the incident timeline becomes harder to reconstruct.

FieldScopeExample ValueDesign Question
trace_idWhole distributed operationOne checkout request across servicesAre related spans grouped together?
span_idOne spanThe database query spanCan another span name this one as parent?
parent_span_idParent-child treeHTTP client span as parent of server spanDoes the timeline show cause and effect?
nameOperation labelPOST /checkout or charge-cardWould this name group similar work?
kindCommunication roleSERVER, CLIENT, PRODUCER, CONSUMER, INTERNALDoes the span describe a boundary crossing correctly?

Span kind is one of the most useful small details in a trace because it tells readers what role the service played. A server span means the service received work, a client span means it called another service or dependency, a producer span means it put work onto a queue, and a consumer span means it processed work from a queue. Internal spans are still valuable, but they should represent in-process work rather than outbound communication. Mislabeling outbound calls as internal makes dependency maps and latency breakdowns less useful.

Span KindUse It WhenExampleCommon Mislabel
SERVERThe service receives a requestHTTP handler or gRPC methodMarking an inbound handler as INTERNAL
CLIENTThe service calls another dependencyHTTP client, database query, cache callMarking a database call as INTERNAL
PRODUCERThe service enqueues or publishes workKafka publish or queue sendTreating publish as a generic client call
CONSUMERThe service receives work from messagingWorker processing a queue messageLosing the relationship to the producer
INTERNALWork stays inside the processValidation, pricing calculation, template renderingUsing it for every custom span

Attributes and resources are different because they answer different questions. A resource describes the entity producing telemetry, such as the service, deployment environment, version, host, process, or Kubernetes pod. An attribute describes one operation or one metric data point, such as HTTP method, route, status code, database system, order type, or queue name. Putting service.name on every span attribute instead of on the resource creates noisy telemetry and breaks resource-based grouping.

QuestionUse a Resource WhenUse an Attribute When
Does this describe the process or service?Yes, set it on the provider resourceNo, avoid repeating it per span
Does it change per request or operation?Usually noUsually yes
Does every signal from this SDK share it?YesNo
Exampleservice.name=checkouthttp.request.method=POST
Exampledeployment.environment=prodhttp.response.status_code=200

Events are useful when a span needs a timeline inside the timeline. For example, a checkout span may add events for cart.validated, payment.authorized, and inventory.reserved, especially when those steps are too small or too numerous to deserve separate spans. Exception recording is a special case of events: the SDK records exception type, message, and stack trace as an event. However, recording an exception event does not automatically prove the span failed in every SDK and configuration; setting error status makes the outcome visible to trace readers and alert rules.

Span DetailBest UsePoor UseBetter Alternative
AttributeStable dimensions used for filtering and groupingUnique IDs with unbounded cardinality everywhereUse selective attributes and logs for high-cardinality values
EventMoment inside the span timelineReplacing every child span with eventsUse child spans for meaningful nested work
StatusFinal operation resultSetting error for every handled retrySet error when the span’s operation failed
LinkRelationship across tracesReplacing normal parent-child propagationUse parent-child when one operation directly caused another

Links are the right tool when a parent-child tree would lie. A batch worker may process messages that came from several independent requests, so one consumer span cannot have all of those producer spans as parents. A span link lets the worker say, “this processing is related to those earlier spans,” while keeping its own trace structure. Links are especially important in fan-in, batching, and retry designs where causal relationships are real but not tree-shaped.

Pause and predict: A checkout service receives an HTTP request, creates a SERVER span, then calls PostgreSQL and Redis. If both dependency calls are marked INTERNAL, what will the trace viewer probably hide or misrepresent when someone investigates slow checkout requests?

The trace viewer may fail to show those operations as outbound dependencies, and service maps may understate how much checkout depends on PostgreSQL and Redis. The spans might still show duration, but the communication role is wrong, so readers lose the boundary-crossing signal. The correct design is a server span for the inbound request and client spans for the outbound database and cache calls. Internal spans should be reserved for meaningful in-process work such as pricing or validation.

Metrics are easy to emit and easy to ruin. A counter with the wrong attributes can create too many time series, a gauge used where a histogram belongs can hide latency distribution, and a temporality mismatch can make a healthy service look broken. The SDK helps by giving you instrument types with specific semantics, but it cannot decide your measurement intent for you. The first design question is always, “What decision should this metric help someone make?”

Synchronous instruments are called when your code knows that something happened. A counter increments when a request completes, a histogram records the duration when a handler finishes, and an up-down counter changes when a connection opens or closes. Asynchronous instruments are callbacks that observe a value that already exists somewhere else, such as queue depth, memory usage, or connection pool size. Using an asynchronous callback to poll a remote service on every collection interval is a performance and reliability smell.

InstrumentSync or AsyncCan DecreaseGood ExampleBad Example
CounterSynchronousNoTotal successful checkoutsCurrent queue depth
UpDownCounterSynchronousYesActive WebSocket connectionsTotal requests since start
HistogramSynchronousNot applicableRequest duration or response sizeCurrent CPU percent
Observable CounterAsynchronousNoCPU time read from the OSBusiness event count in request code
Observable UpDownCounterAsynchronousYesCurrent worker pool sizeLatency distribution
Observable GaugeAsynchronousYesQueue depth or temperatureTotal completed orders

Histograms deserve extra attention because they answer questions that averages cannot answer. If the average checkout duration is acceptable but a small group of customers sees extreme latency, a histogram can preserve that distribution while a simple gauge cannot. A well-named histogram such as http.server.request.duration or orders.processing.duration lets dashboards show percentiles, bucket counts, and exemplars. For performance debugging, histograms are often the bridge from “something got slower” to “which traces should I inspect?”

Measurement NeedRecommended InstrumentWhy It FitsExample Attribute
Count business eventsCounterEvents only move forwardorder.type=standard
Track active workUpDownCounterValue increases and decreasesworker.pool=checkout
Measure latencyHistogramDistribution mattershttp.route=/checkout
Observe queue depthObservable GaugeValue exists outside request flowqueue.name=orders
Observe total CPU timeObservable CounterOS counter increases over timecpu.state=user

Temporality defines what a metric value means over time. Cumulative temporality reports the total since a starting point, while delta temporality reports the change since the last collection. If a counter records increments of ten, twenty, and five across three collection windows, cumulative reports ten, thirty, and thirty-five, while delta reports ten, twenty, and five. Neither is universally better; the backend and reader must agree on interpretation.

┌────────────────────────────────────────────────────────────────────────────┐
│ Counter Temporality Example │
│ │
│ Raw increments: +10 +20 +5 │
│ │
│ Cumulative export: 10 30 35 │
│ Delta export: 10 20 5 │
│ │
│ Reader and backend must agree on which row the exported values represent. │
└────────────────────────────────────────────────────────────────────────────┘
TemporalityWhat the Exported Value MeansBackend Preference ExampleDebugging Symptom
CumulativeTotal since start or resetPrometheus-style counter interpretationReset handling matters during restarts
DeltaChange since previous collectionSome StatsD-like or vendor pipelinesValues look too small if read as totals
Backend conversionCollector or backend converts meaningMixed estate during migrationsRate graphs disagree between tools

Aggregation is the SDK’s way of turning many raw measurements into exportable data. Counters commonly aggregate as sums, gauges as last values, and histograms as bucketed distributions. This is why a metric instrument is not just a method name; it determines what math will be possible later. If you record request duration in a counter, no downstream backend can recover a useful latency distribution because the raw shape was lost at collection time.

InstrumentTypical AggregationUseful Dashboard ViewRisk to Watch
CounterSumRequest rate, error rate, throughputHigh-cardinality attributes explode series
UpDownCounterSum with positive and negative changesActive sessions or in-flight workMissing decrement leaves value inflated
HistogramExplicit buckets or exponential bucketsPercentiles and latency heatmapsPoor buckets hide important ranges
Observable GaugeLast valueQueue depth or current utilizationCallback latency affects collection

Exemplars connect metrics back to traces. When a histogram bucket contains a latency spike, an exemplar can attach a sampled trace ID and span ID to one measurement in that bucket. That lets an engineer click from “the p99 checkout latency spiked” to “here is one trace that contributed to that spike.” Exemplars are not a replacement for traces or metrics; they are a bridge between aggregate behavior and individual request evidence.

┌────────────────────────────────────────────────────────────────────────────┐
│ Histogram With Exemplar │
│ │
│ orders.duration bucket 250ms-500ms │
│ count=18 │
│ exemplar: value=392ms, trace_id=0af7651916cd43dd8448eb211c80319c │
│ │
│ Dashboard question: "Which trace explains this slow bucket?" │
└────────────────────────────────────────────────────────────────────────────┘

A metric naming decision should survive operational use. Names should describe what is measured, units should be explicit, and attributes should be bounded enough to aggregate. For example, orders.duration with unit ms and attribute order.type is useful, while adding user.id to every duration measurement can create one series per user. High cardinality is not merely expensive; it can make dashboards slow, alerts noisy, and backends drop data.

A distributed trace only works when context crosses process boundaries. The upstream service must inject context into a carrier such as HTTP headers or message attributes, and the downstream service must extract that context before starting its own span. When either side forgets its half, the trace tree breaks even though both services may be individually instrumented. This is why propagation bugs often show up as “we have spans, but they are in separate traces.”

┌──────────────────────────┐ HTTP request ┌────────────────────┐
│ Service A: checkout │──────────────────────────▶│ Service B: payment │
│ │ traceparent header │ │
│ client span: call-payment│ │ server span: POST │
│ trace_id: same value │ │ trace_id: same │
│ span_id: parent candidate│ │ parent: A client │
└──────────────────────────┘ └────────────────────┘

The W3C TraceContext format is the default propagation standard in modern OpenTelemetry configurations. The traceparent header carries a version, trace ID, parent span ID, and trace flags, while tracestate carries vendor-specific state. In a debugging scenario, you rarely need to recite the header grammar; you need to recognize whether the same trace ID survived the boundary and whether the downstream service used the extracted parent. That practical interpretation matters more than memorizing field widths.

┌────────────────────────────────────────────────────────────────────────────┐
│ traceparent Header Shape │
│ │
│ traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 │
│ │ │ │ │ │
│ │ │ │ └ flags │
│ │ │ └ parent span ID │
│ │ └ trace ID shared across the distributed operation │
│ └ version │
└────────────────────────────────────────────────────────────────────────────┘
Propagation PieceWhat It CarriesWhy It MattersDebugging Check
traceparentTrace ID, parent span ID, flagsEstablishes distributed trace continuitySame trace ID across services
tracestateVendor or platform-specific statePreserves backend-specific decisionsPresent when a backend requires it
PropagatorInjects and extracts contextConnects SDK to carrier formatSame propagator configured on both sides
CarrierHTTP headers or message metadataMoves context over a boundaryHeader or metadata visible on the wire

Baggage is separate from trace identity. It carries application context as key-value pairs, and that context travels downstream with requests. This can be useful for routing, experiment cohort, tenant tier, or other non-sensitive operational hints. The danger is that baggage is transmitted in headers or metadata, so it must be treated as data visible to downstream systems and infrastructure.

Baggage CandidateGood or BadReasonSafer Alternative
tenant.tier=enterpriseGood if non-sensitiveUseful for routing or sampling policyKeep value low-cardinality
user.email=person@example.comBadPersonal data travels downstreamUse internal user category instead
auth.token=secretBadCredential leakage riskNever propagate secrets as baggage
experiment.group=BGood if boundedUseful for analysis and debuggingDocument allowed values
cart.value=123.45DependsCould be sensitive or high-cardinalityRecord as span attribute selectively

Propagation also behaves differently across synchronous and asynchronous boundaries. An HTTP call has an obvious request and response, so a client span and downstream server span usually form a clear parent-child relationship. A message queue may decouple producer and consumer in time, may batch messages, or may retry delivery, so links may be a better representation than a direct parent in some designs. A strong OTCA answer explains the relationship, not just the mechanism.

Stop and think: A queue consumer processes one batch that contains messages created by several different user requests. Would a single parent span accurately represent the batch, or would span links preserve the relationship more honestly?

Span links usually preserve the relationship more honestly because the batch work is related to multiple earlier traces. Choosing one parent would imply a single causal chain and hide the fan-in nature of the workload. The consumer can still create a span for batch processing, but links let it reference the producer contexts attached to each message. This is a good example of trace design serving the reader rather than forcing every workflow into a tree.

Part 5: SDK Configuration and Export Choices

Section titled “Part 5: SDK Configuration and Export Choices”

OpenTelemetry SDKs are configurable through code and environment variables. Code is useful when an application must construct providers, processors, instruments, or resources directly. Environment variables are useful when platform teams need consistent behavior across many services without rebuilding each application. The practical rule is to keep business instrumentation in code and keep deployment-specific export decisions in configuration whenever possible.

Configuration AreaProgrammatic ExampleEnvironment ExamplePrefer Environment When
Service identityResource attributes in SDK setupOTEL_SERVICE_NAMEEach deployment sets its own name
OTLP endpointExporter constructor argumentOTEL_EXPORTER_OTLP_ENDPOINTEndpoint differs by environment
Export protocolExporter configurationOTEL_EXPORTER_OTLP_PROTOCOLFirewall or collector policy differs
Trace samplingSampler objectOTEL_TRACES_SAMPLERPlatform owns sampling policy
Signal enablementProvider setupOTEL_TRACES_EXPORTERService needs quick disable or console debug

Signal-specific configuration generally overrides general configuration. For example, a platform may set a general OTLP endpoint for all signals but send traces to a specialized trace collector during a migration. This precedence lets teams change one signal without disturbing the others. In exam scenarios, look for the more specific variable when two settings appear to conflict.

Environment VariablePurposeExample ValueOperational Meaning
OTEL_SERVICE_NAMESets service.name resource attributecheckoutGroups telemetry by service
OTEL_RESOURCE_ATTRIBUTESAdds resource attributesdeployment.environment=prod,service.version=2.4.1Enriches every signal from the process
OTEL_EXPORTER_OTLP_ENDPOINTGeneral OTLP endpointhttp://collector:4317Default destination for OTLP signals
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTTrace-specific OTLP endpointhttp://trace-collector:4317Overrides the general endpoint for traces
OTEL_EXPORTER_OTLP_PROTOCOLOTLP transport encodinggrpc or http/protobufMust match collector receiver configuration
OTEL_TRACES_SAMPLERTrace sampling strategyalways_on or traceidratioControls which traces are sampled
OTEL_TRACES_SAMPLER_ARGSampler argument0.10Ten percent sampling for ratio sampler

OTLP is the standard export protocol you should expect to see in modern OTel designs. The exporter may send directly to a backend, but many production architectures send to an OpenTelemetry Collector first. The collector can receive telemetry, batch it, filter it, enrich it, and fan it out to one or more backends. This module focuses on the SDK side, but the export choice should already make you think about the collector architecture in the next module.

Exporter ChoiceGood FitTradeoffSenior-Level Question
Console exporterLocal learning and debuggingNot suitable for production throughputDid we accidentally leave this in a hot path?
OTLP exporterStandard production pathRequires collector or backend endpointIs protocol and endpoint configured consistently?
Prometheus reader/export pathPrometheus-native metric scrapingPull model differs from trace exportDoes this service expose a scrape endpoint securely?
Legacy Jaeger or Zipkin exporterOlder estates during migrationLess portable than OTLPCan the collector translate OTLP instead?
No-op exporterTests or temporary disablementTelemetry disappearsIs this intentionally disabled or misconfigured?

Semantic conventions are naming standards that make telemetry portable. They help dashboards, alert rules, and queries work across libraries and languages. For HTTP spans, current stable conventions use names such as http.request.method and http.response.status_code. Old names may still appear in older dashboards or examples, but exam scenarios expect you to understand that semantic conventions evolve and that consistent naming is part of portability.

DomainRecommended AttributeExample ValueWhy It Helps
HTTP requesthttp.request.methodPOSTGroups requests by method
HTTP responsehttp.response.status_code201Supports error-rate analysis
URLurl.fullhttps://api.example.test/checkoutPreserves complete request target when safe
Serviceservice.namecheckoutIdentifies the emitting service
Serviceservice.version2.4.1Correlates telemetry with releases
Databasedb.systempostgresqlIdentifies dependency technology
Messagingmessaging.systemkafkaIdentifies messaging backend

The most important semantic convention habit is consistency. If half the fleet uses http.method and the other half uses http.request.method, a dashboard filter may silently miss data. If one team records service.name as a span attribute and another sets it as a resource, resource-based service views become incomplete. Instrumentation is not just about creating telemetry; it is about creating telemetry that can be queried with confidence.

Part 6: Auto-Instrumentation, Manual Instrumentation, and the Boundary Between Them

Section titled “Part 6: Auto-Instrumentation, Manual Instrumentation, and the Boundary Between Them”

Auto-instrumentation wraps supported libraries without requiring application developers to edit every call site. In Java, an agent can modify bytecode at class load time; in Python, instrumentation commonly patches libraries during import; in Node.js, require hooks can wrap modules; in .NET, profiling and startup hooks can attach instrumentation. This is powerful because it gives teams baseline traces for HTTP frameworks, clients, databases, messaging libraries, and gRPC quickly. It is not enough for business observability because libraries do not know what your checkout, renewal, refund, or fraud decision means.

LanguageCommon Auto-Instrumentation MechanismTypical Command or SetupWhat It Usually Captures
JavaJava agent and bytecode instrumentationjava -javaagent:opentelemetry-javaagent.jar -jar app.jarHTTP, JDBC, gRPC, messaging libraries
PythonImport-time monkey patchingopentelemetry-instrument .venv/bin/python app.pyFlask, FastAPI, requests, database clients
.NETCLR profiler and startup hooksEnvironment variables and runtime hooksASP.NET, HTTP clients, database libraries
Node.jsRequire hooks and instrumentation packagesnode --require @opentelemetry/auto-instrumentations-node/register app.jsExpress, HTTP, database clients

Manual instrumentation should fill the semantic gaps that libraries cannot see. A library can tell that an HTTP request happened, but it cannot know that reserve-inventory is the critical business step or that order.type=subscription is the dimension operations needs. A database library can record a query span, but it cannot decide whether a failed payment authorization should mark the parent checkout span as error. Manual spans, metrics, attributes, and events should therefore describe the domain, not duplicate what auto-instrumentation already creates.

SituationAuto-Instrumentation Enough?Add Manual Instrumentation?Reason
Basic HTTP server latencyOften yesMaybe add route naming if missingFramework instrumentation knows request boundaries
Payment authorization business stepNoYesLibrary cannot infer business meaning
Database query timingOften yesAdd attributes carefully if neededDriver instrumentation captures dependency call
Queue depth from broker APINoYes, usually async metricValue exists outside request flow
Feature flag branch takenNoYes, event or attributeBusiness context matters during incidents

A good instrumentation review asks whether the next engineer can debug a real incident from the telemetry. If auto-instrumentation creates ten spans but none of them reveal which business decision failed, manual instrumentation is needed. If manual instrumentation creates dozens of nested spans around every helper function, the trace becomes noisy and expensive. The best result is a layered trace: automatic spans show technical boundaries, while manual spans and metrics show business intent.

Design SmellWhat It Looks LikeRefactor Direction
Duplicate spansManual HTTP span wraps an auto-generated HTTP server span with the same nameKeep the auto span and add attributes or events
Missing business contextTrace shows database and HTTP calls but not checkout decision pointsAdd manual internal spans for domain steps
Attribute overloadEvery span carries user ID, email, cart ID, and request payloadMove sensitive or high-cardinality data out of span attributes
Broken correlationLogs have request IDs but no trace IDsConfigure log bridge and current span context
Backend lock-inApplication imports vendor tracing APIs directlyUse OTel API and SDK exporters

Part 7: Worked Examples From Input to Solution

Section titled “Part 7: Worked Examples From Input to Solution”

The audit failure called out a common curriculum problem: code snippets that sit on the page as reference material do not teach learners how to move from a problem to a solution. This section uses the input-to-solution pattern instead. Each example starts with an operational problem, shows the uninstrumented or incomplete input, states the desired telemetry, then walks through the solution. After the worked example, you will get a similar “your turn” task in the exercise.

7.1 Worked Example A: Turn an Unobservable Checkout Function Into a Trace

Section titled “7.1 Worked Example A: Turn an Unobservable Checkout Function Into a Trace”

The starting problem is intentionally small. A checkout function validates a cart, calls a payment function, records an order, and returns an order ID. When it fails in production, logs may show an exception, but there is no trace structure showing which step failed or how long each step took. The goal is to create one parent span for the checkout operation and child spans for meaningful business steps without instrumenting every helper line.

Input file: checkout_plain.py

import random
import time
def validate_cart(cart_id: str) -> None:
time.sleep(0.03)
if not cart_id:
raise ValueError("cart_id is required")
def charge_payment(cart_id: str) -> str:
time.sleep(0.05)
if random.random() < 0.20:
raise RuntimeError("payment gateway timeout")
return "PAY-1001"
def record_order(cart_id: str, payment_id: str) -> str:
time.sleep(0.02)
return f"ORD-{cart_id}-{payment_id}"
def checkout(cart_id: str) -> str:
validate_cart(cart_id)
payment_id = charge_payment(cart_id)
return record_order(cart_id, payment_id)
if __name__ == "__main__":
print(checkout("CART-123"))

The first design decision is where the root span belongs. The checkout function is the operation the business cares about, so it should be the parent span. The validation and payment steps are meaningful enough to become child spans because failures there lead to different operational actions. The record_order helper is short in this example, but it still crosses a persistence boundary in many real services, so a child span is reasonable if it represents a database write.

Design ChoiceDecisionReason
Parent span namecheckoutDescribes the business operation
Validation span kindINTERNALWork stays inside the process
Payment span kindCLIENTRepresents an outbound dependency
Order recording span kindCLIENT in real persistence pathDatabase writes are dependency calls
Cart ID handlingAttribute in demo onlyIn production, evaluate sensitivity and cardinality

The solution starts by creating one TracerProvider at process startup. The resource goes on the provider because every span from this process belongs to the same service. The batch processor is used even though the exporter is console because learners should practice the production shape early. Finally, exception handling records the exception event, sets error status, and re-raises so instrumentation does not swallow real failures.

Solution file: checkout_traced.py

import random
import time
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import SpanKind, Status, StatusCode
resource = Resource.create(
{
"service.name": "checkout-service",
"service.version": "1.0.0",
"deployment.environment": "local",
}
)
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("kubedojo.checkout", "0.1.0")
def validate_cart(cart_id: str) -> None:
with tracer.start_as_current_span("validate-cart", kind=SpanKind.INTERNAL) as span:
span.set_attribute("cart.present", bool(cart_id))
time.sleep(0.03)
if not cart_id:
error = ValueError("cart_id is required")
span.record_exception(error)
span.set_status(Status(StatusCode.ERROR, str(error)))
raise error
def charge_payment(cart_id: str) -> str:
with tracer.start_as_current_span("charge-payment", kind=SpanKind.CLIENT) as span:
span.set_attribute("payment.provider", "demo-gateway")
span.set_attribute("cart.id", cart_id)
time.sleep(0.05)
if random.random() < 0.20:
error = RuntimeError("payment gateway timeout")
span.record_exception(error)
span.set_status(Status(StatusCode.ERROR, str(error)))
raise error
span.set_status(Status(StatusCode.OK))
return "PAY-1001"
def record_order(cart_id: str, payment_id: str) -> str:
with tracer.start_as_current_span("record-order", kind=SpanKind.CLIENT) as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("payment.id", payment_id)
time.sleep(0.02)
order_id = f"ORD-{cart_id}-{payment_id}"
span.add_event("order.recorded", {"order.id": order_id})
return order_id
def checkout(cart_id: str) -> str:
with tracer.start_as_current_span("checkout", kind=SpanKind.INTERNAL) as span:
span.set_attribute("cart.id", cart_id)
try:
validate_cart(cart_id)
payment_id = charge_payment(cart_id)
order_id = record_order(cart_id, payment_id)
span.add_event("checkout.completed", {"order.id": order_id})
return order_id
except Exception as error:
span.record_exception(error)
span.set_status(Status(StatusCode.ERROR, str(error)))
raise
if __name__ == "__main__":
try:
print(checkout("CART-123"))
finally:
trace_provider.shutdown()

Run the solution in a clean virtual environment so the command matches the repository’s local Python convention. The console exporter prints span records after shutdown, so the finally block is part of the example rather than an optional cleanup detail. In a long-running service, shutdown would happen during process termination instead of after one function call. For a short script, failing to flush before exit can make learners think instrumentation failed.

Terminal window
.venv/bin/python -m pip install opentelemetry-api opentelemetry-sdk
.venv/bin/python checkout_traced.py

When you inspect the output, look for the same trace ID across the parent and child spans. The checkout span should be the parent, and the validation, payment, and record-order spans should appear as children. If the simulated payment failure occurs, both the payment span and parent checkout span should show error status. That propagation of failure status is a design choice: the dependency failed, and therefore the checkout operation failed.

7.2 Worked Example B: Add Metrics Without Turning Attributes Into a Cost Problem

Section titled “7.2 Worked Example B: Add Metrics Without Turning Attributes Into a Cost Problem”

The traced function tells you which request failed, but it does not answer aggregate questions such as “how many checkout attempts are failing?” or “how long does checkout usually take?” That is where metrics belong. The goal is to add a counter for attempts, a counter for failures, a histogram for duration, and an observable gauge for queue depth. The important constraint is to choose bounded attributes, because metric attributes define time series cardinality.

Metric design input

Operational QuestionMetric NeededInstrumentAttribute Plan
How many checkout attempts occurred?checkout.attemptsCountercheckout.channel only
How many failed?checkout.failuresCountererror.type with bounded exception names
How long did checkout take?checkout.durationHistogramcheckout.channel only
How deep is the queue now?checkout.queue.depthObservable Gaugequeue.name only

A tempting but poor metric design would attach cart.id or user.id to every metric. That might feel useful during one investigation, but it creates many time series and makes aggregate dashboards worse. Trace attributes can sometimes carry a request-level identifier when justified, and logs can hold more detailed request data. Metrics should usually use attributes that are bounded, stable, and useful for grouping.

Solution file: checkout_traces_metrics.py

import random
import time
from typing import Iterable
from opentelemetry import metrics, trace
from opentelemetry.metrics import Observation
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import SpanKind, Status, StatusCode
resource = Resource.create(
{
"service.name": "checkout-service",
"service.version": "1.0.0",
"deployment.environment": "local",
}
)
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(trace_provider)
metric_reader = PeriodicExportingMetricReader(
ConsoleMetricExporter(),
export_interval_millis=1000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
tracer = trace.get_tracer("kubedojo.checkout", "0.1.0")
meter = metrics.get_meter("kubedojo.checkout", "0.1.0")
checkout_attempts = meter.create_counter(
"checkout.attempts",
unit="1",
description="Total checkout attempts",
)
checkout_failures = meter.create_counter(
"checkout.failures",
unit="1",
description="Total failed checkout attempts",
)
checkout_duration = meter.create_histogram(
"checkout.duration",
unit="ms",
description="Checkout processing duration",
)
def current_queue_depth() -> int:
return 3
def observe_queue_depth(options) -> Iterable[Observation]:
return [Observation(current_queue_depth(), {"queue.name": "checkout"})]
meter.create_observable_gauge(
"checkout.queue.depth",
callbacks=[observe_queue_depth],
unit="1",
description="Current checkout queue depth",
)
def charge_payment() -> str:
with tracer.start_as_current_span("charge-payment", kind=SpanKind.CLIENT) as span:
span.set_attribute("payment.provider", "demo-gateway")
time.sleep(0.05)
if random.random() < 0.20:
error = RuntimeError("payment gateway timeout")
span.record_exception(error)
span.set_status(Status(StatusCode.ERROR, str(error)))
raise error
return "PAY-1001"
def checkout(cart_id: str, channel: str) -> str:
start = time.perf_counter()
checkout_attempts.add(1, {"checkout.channel": channel})
with tracer.start_as_current_span("checkout", kind=SpanKind.INTERNAL) as span:
span.set_attribute("cart.id", cart_id)
span.set_attribute("checkout.channel", channel)
try:
payment_id = charge_payment()
order_id = f"ORD-{cart_id}-{payment_id}"
span.add_event("checkout.completed", {"order.id": order_id})
return order_id
except Exception as error:
checkout_failures.add(1, {"error.type": type(error).__name__})
span.record_exception(error)
span.set_status(Status(StatusCode.ERROR, str(error)))
raise
finally:
elapsed_ms = (time.perf_counter() - start) * 1000
checkout_duration.record(elapsed_ms, {"checkout.channel": channel})
if __name__ == "__main__":
try:
for index in range(5):
try:
print(checkout(f"CART-{index}", "web"))
except RuntimeError as error:
print(f"checkout failed: {error}")
time.sleep(0.20)
time.sleep(1.50)
finally:
meter_provider.shutdown()
trace_provider.shutdown()

The metric example adds one more mental model: metrics record aggregate facts even when traces are sampled. If a production sampler keeps only a fraction of traces, counters and histograms can still represent the whole workload. That is why traces and metrics are complementary rather than redundant. During an incident, metrics usually tell you that something is wrong, and traces help you inspect representative examples.

Terminal window
.venv/bin/python checkout_traces_metrics.py

After running the file, inspect the console output for metric names and resource attributes. You should see counters and histograms associated with service.name=checkout-service. You should not see cart.id attached to the metric streams, because the example intentionally keeps that request-specific detail on the trace. That boundary is a senior-level instrumentation habit: put high-cardinality request evidence where it helps debugging, and keep metrics aggregatable.

7.3 Worked Example C: Continue a Trace Across an HTTP Boundary

Section titled “7.3 Worked Example C: Continue a Trace Across an HTTP Boundary”

The third worked example focuses on propagation rather than local spans. Imagine a gateway service receives an inbound request and calls a payment service. The gateway can create a client span, but the payment service will not join the trace unless the gateway injects context and the payment service extracts it. A broken propagator configuration creates two valid-looking traces that fail to connect.

Propagation design

BoundarySender ResponsibilityReceiver ResponsibilityFailure Symptom
Gateway to payment over HTTPInject context into request headersExtract context before starting server spanPayment trace appears separate
Producer to queueStore context in message metadataExtract or link context during consumeConsumer work loses upstream cause
Service to background threadCarry context into task executionStart span with carried contextChild work appears as a new root

The following Go example is complete enough to show the required imports and SDK setup, but it intentionally keeps the HTTP server small. The key lines are the propagator configuration, extraction from inbound headers, and injection into outbound headers. Without those lines, spans may still exist, but trace continuity across services will be broken. This is the difference between local instrumentation and distributed tracing.

package main
import (
"context"
"fmt"
"log"
"net/http"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
stdouttrace "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
if err != nil {
return nil, err
}
res, err := resource.New(
context.Background(),
resource.WithAttributes(attribute.String("service.name", "gateway-service")),
)
if err != nil {
return nil, err
}
provider := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
)
otel.SetTracerProvider(provider)
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
),
)
return provider, nil
}
func gatewayHandler(w http.ResponseWriter, r *http.Request) {
ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracer := otel.Tracer("kubedojo.gateway")
ctx, span := tracer.Start(ctx, "POST /checkout", trace.WithSpanKind(trace.SpanKindServer))
defer span.End()
req, err := http.NewRequestWithContext(ctx, http.MethodPost, "http://127.0.0.1:8081/pay", nil)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
clientSpanCtx, clientSpan := tracer.Start(ctx, "POST payment-service", trace.WithSpanKind(trace.SpanKindClient))
defer clientSpan.End()
_ = clientSpanCtx
time.Sleep(25 * time.Millisecond)
fmt.Fprintln(w, "checkout accepted")
}
func main() {
provider, err := initTracer()
if err != nil {
log.Fatal(err)
}
defer func() {
_ = provider.Shutdown(context.Background())
}()
http.HandleFunc("/checkout", gatewayHandler)
log.Fatal(http.ListenAndServe("127.0.0.1:8080", nil))
}

This example also shows why span kind matters during propagation. The gateway handler span is a server span because it receives a request. The outbound payment span is a client span because it calls another service. The payment service, if implemented separately, should extract the headers and create its own server span with the extracted context, causing the trace tree to show both sides of the boundary.

The safest OpenTelemetry SDK designs separate three concerns that are often tangled together in early instrumentation efforts: application code describes meaningful work, the SDK pipeline controls collection behavior, and deployment configuration chooses export destinations. When those concerns stay separate, a service can keep stable business spans while the platform changes collectors, sampling ratios, or backend vendors through environment configuration. When those concerns are mixed, every backend migration becomes an application release, every local debugging choice risks leaking into production, and every service invents a slightly different telemetry dialect.

PatternWhen to Use ItWhy It WorksScaling Consideration
Provider-owned resource identityEvery service, job, worker, and CLI that emits telemetryOne resource attaches shared identity to traces, metrics, and logsStandardize service.name, service.version, and deployment.environment across Kubernetes 1.35+ workloads
Batch traces, bounded metricsProduction paths with real request volumeExport work moves off the request path while metrics remain aggregatableTune queue size, export interval, and metric attributes before adding traffic
Auto plus manual layeringServices with supported frameworks and important business stepsLibraries capture technical boundaries while manual spans capture domain intentReview traces for duplicate spans after enabling auto-instrumentation
Environment-owned export policyMultiple environments, collectors, or backendsApplications avoid hard-coded endpoints and protocolsDocument precedence between general and signal-specific OTLP variables

The provider-owned resource pattern looks ordinary, but it removes a surprising amount of operational friction. If every service sets service.name differently, dashboards fragment even when spans are valid. If a deployment writes version information as a span attribute instead of as a resource, metrics and logs may not carry the same release identity as traces. A good review asks whether a person could filter every signal from the same pod, rollout, and service without knowing which language SDK produced it.

The batch-traces and bounded-metrics pattern keeps observability from competing with the workload it is supposed to explain. Trace export can tolerate asynchronous buffering because the request already finished when a span ends, while metrics need disciplined attributes because every new label combination becomes a new time series. This is why a batch processor and an attribute review often belong in the same pull request. One protects latency, and the other protects the backend from a cardinality problem that only appears after real traffic arrives.

The auto-plus-manual pattern is the one most teams eventually converge on. Auto-instrumentation is excellent at showing inbound HTTP requests, outbound HTTP calls, database queries, and messaging operations, but it cannot name the business decision that matters during an incident. Manual instrumentation should add the missing business layer without replacing the library layer. If a trace already contains a framework server span named POST /checkout, a manual span with the same name is noise; a manual span named fraud-decision or reserve-inventory is evidence.

Anti-PatternWhat Goes WrongWhy Teams Fall Into ItBetter Alternative
Exporter in business logicBackend changes require source edits and redeploysThe first demo hard-codes console or vendor exportersKeep exporter choice in SDK setup and environment variables
Metric attributes copied from logsSeries count grows with users, carts, orders, or request IDsEngineers want per-request debugging from aggregate dataUse traces and logs for request evidence, metrics for bounded grouping
Baggage as a hidden context bagHeaders carry sensitive or high-cardinality values downstreamBaggage feels like convenient distributed storageAllow only documented, non-sensitive, low-cardinality baggage keys
Manual spans around every helperTraces become long, expensive, and hard to readMore spans feel like more observabilityInstrument meaningful operations and dependency boundaries

These anti-patterns share a common mistake: they optimize for the first person writing instrumentation instead of the next person debugging with it. The first person may want a quick exporter, every local variable as an attribute, or a span around each function to prove the SDK works. The next person needs a trace that tells a coherent story, a metric that aggregates across thousands of requests, and a log that correlates without exposing secrets. Good SDK design is therefore a reader-centered discipline.

Use this framework when you review an instrumentation change, answer an OTCA scenario, or refactor a reference snippet into a runnable service pattern. Start with the signal question, then work outward to pipeline behavior and deployment control. That order prevents a common mistake where a team debates exporters before deciding what the telemetry should mean.

┌────────────────────────────────────────────────────────────────────────────┐
│ SDK Design Decision Flow │
│ │
│ 1. What operational question must be answered? │
│ │ │
│ ▼ │
│ 2. Is the evidence per operation, aggregate, or log-like narrative? │
│ │ │
│ ├── Per operation ─────▶ trace span, event, status, or link │
│ ├── Aggregate ────────▶ counter, histogram, observable instrument │
│ └── Narrative ────────▶ log bridge with current trace context │
│ │ │
│ ▼ │
│ 3. Which attributes are safe, bounded, and semantically consistent? │
│ │ │
│ ▼ │
│ 4. Which SDK component controls delivery: processor, reader, exporter? │
│ │ │
│ ▼ │
│ 5. Which settings belong in code, and which belong in environment config? │
└────────────────────────────────────────────────────────────────────────────┘

The first branch separates traces, metrics, and logs by the job they perform. If the question is “which step failed for this checkout request?”, a span, event, status, or linked trace context is the right shape because the evidence belongs to one operation. If the question is “are failures increasing for the checkout workload?”, a counter or histogram is the right shape because the evidence must aggregate across many operations. If the question is “what message did the application write while this span was current?”, a log bridge with trace correlation is the right shape because the evidence is narrative and timestamped.

Decision PointChoose ThisWhen the Scenario SaysWatch For
Span processorBatchSpanProcessorProduction service, latency-sensitive path, OTLP exportShutdown flush and queue overflow behavior
Span processorSimpleSpanProcessorLocal demo, unit test, one-shot script with console outputDo not carry this into high-traffic request paths
Metric instrumentCounterCount only increases, such as attempts or errorsAvoid current-state values such as queue depth
Metric instrumentHistogramDistribution matters, especially latency or sizePick units and attributes that preserve meaning
Propagation designParent-childOne operation directly causes the next operationSame trace ID and correct span kind across boundary
Propagation designLinksBatch, fan-in, retry, or async work relates to multiple causesDo not force one arbitrary parent
Configuration locationEnvironment variablesEndpoint, protocol, sampler, service name differ by deploymentSignal-specific variables may override general variables
Configuration locationCodeBusiness spans, metric names, and resource defaults are part of the serviceAvoid hard-coding backend endpoints

Pause and predict: if you move OTEL_EXPORTER_OTLP_ENDPOINT from the deployment manifest into application code, what happens when a staging collector changes hostnames but the service image is already built? The service now needs a new build or a code-specific override, even though the telemetry destination is a deployment concern. That is an avoidable coupling between source code and platform routing. Keeping export configuration outside the business logic lets a Kubernetes rollout change collector topology without changing the instrumentation that describes checkout behavior.

The same framework helps you refactor reference-style snippets. A snippet that only creates a tracer and prints a span is not yet a production pattern because it lacks resource identity, shutdown behavior, error status, and a deployment-owned export path. To turn it into a reusable pattern, ask which provider owns the signal, which processor or reader controls delivery, which attributes are safe to query, and which final check proves the signal answers the original operational question. That refactoring habit is what makes SDK knowledge useful outside exam flashcards.

  • OpenTelemetry separates API from SDK intentionally. Libraries can depend on the API to create telemetry without forcing applications to use a specific exporter, processor, sampler, or backend.
  • The Collector is optional from the SDK perspective. An SDK can export directly to a backend, but production teams often use the Collector for routing, batching, filtering, and backend migration flexibility.
  • Semantic conventions are operational contracts. A small naming mismatch such as using an old HTTP attribute can break dashboards even when spans are technically being exported.
  • Auto-instrumentation is a starting point, not a complete observability strategy. It captures common library boundaries, while manual instrumentation captures business meaning and incident-specific evidence.

MistakeWhy It HappensHow to Fix It
Using SimpleSpanProcessor in production servicesExport happens synchronously and can add backend or console latency to request handlingUse BatchSpanProcessor and flush during shutdown
Setting service.name as a span attributeService identity belongs to the resource and should apply to all emitted telemetrySet service.name on the provider resource or through OTEL_SERVICE_NAME
Marking outbound database or HTTP calls as INTERNALDependency maps and trace readers lose the communication-role signalUse CLIENT for outbound dependency calls and SERVER for inbound handlers
Recording exceptions without setting span error statusThe exception event exists, but the span may not summarize the operation as failedRecord the exception and set StatusCode.ERROR when the operation fails
Putting user identifiers, tokens, or emails in baggageBaggage propagates downstream through headers or metadata and may leak sensitive dataUse bounded, non-sensitive context or keep details in protected logs
Adding high-cardinality request IDs to metric attributesEach unique value can create a separate time series and overload dashboards or backendsKeep metrics aggregatable and put request-specific evidence in traces or logs
Confusing cumulative and delta temporalityBackend graphs can show misleading rates, resets, or totalsMatch reader/exporter temporality with backend expectations
Duplicating auto-instrumented spans manuallyTraces become noisy, costs rise, and readers see two spans for the same operationKeep auto spans for technical boundaries and add manual spans for business steps

Test yourself with scenario-based OTCA-style questions. Each question asks you to apply the SDK model to a realistic situation, not to recite a definition.

Q1: Your team added OpenTelemetry to a high-traffic API, and p95 latency increased after enabling console trace export with a simple processor. What do you change first, and why?

Change the trace pipeline to use a BatchSpanProcessor, and stop treating console export as a production destination. The simple processor exports ended spans synchronously on the request path, so exporter latency becomes application latency. A batch processor buffers spans and exports asynchronously, which preserves trace emission while removing most export work from request handling.

Q2: A checkout trace shows an inbound HTTP span, but the payment-service spans appear as separate root traces. Both services have OTel installed. What do you inspect next?

Inspect propagation at the service boundary. The gateway must inject context into outbound HTTP headers, and payment-service must extract context before starting its server span. Check that both sides use compatible propagators such as W3C TraceContext, then verify that the same trace ID appears on both sides of the request.

Q3: A dashboard for checkout latency is useless because every metric series includes `cart.id`. What instrumentation change should you recommend?

Remove cart.id from metric attributes and keep the metric dimensions bounded, such as checkout.channel or order.type if those values have controlled sets. Request-specific identifiers can belong in trace attributes or protected logs when justified. Metrics should preserve aggregate behavior, and high-cardinality identifiers create too many time series.

Q4: A Python service records exceptions with `span.record_exception(error)`, but trace search does not show failed operations reliably. What is missing?

The code should also set the span status to error when the operation fails, for example with Status(StatusCode.ERROR, str(error)). Recording an exception adds a timestamped event, but status summarizes the outcome of the span. A trace reader or alert often relies on the status field to identify failed spans quickly.

Q5: A batch worker processes messages that came from several different checkout requests. A learner wants to choose one incoming message as the parent span. How would you evaluate that design?

Choosing one message as the parent misrepresents the fan-in relationship because the batch work is related to multiple earlier traces. A better design is to create a consumer or batch-processing span and add links to the contexts carried by the messages. Parent-child relationships are best for direct causal chains, while links preserve relationships across batching and fan-in.

Q6: Auto-instrumentation gives a service HTTP and database spans, but incident responders still cannot tell whether failures happen during fraud checks or inventory reservation. What should the team add?

They should add manual instrumentation around meaningful business steps, such as fraud-check and reserve-inventory, with carefully chosen attributes and events. Auto-instrumentation captures library boundaries, but it cannot infer domain intent. Manual spans should complement the automatic spans rather than duplicate HTTP or database spans already created by the instrumentation library.

Q7: A service sends cumulative counters to a backend that expects delta values, and request-rate graphs look wrong after every restart. What part of the SDK/export path should you investigate?

Investigate the metric reader, exporter, collector conversion, and backend temporality expectations. Cumulative values represent totals since start or reset, while delta values represent changes since the previous collection. If the backend interprets one as the other, rates and resets can look misleading even though the application increments the counter correctly.

Q8: A teammate copied a reference-style SDK snippet into a service: it creates one span, hard-codes an OTLP endpoint, omits resource attributes, and never shuts down the provider. The deployment also sets `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`. How would you refactor the snippet and handle endpoint precedence?

Start by writing the operational input: which request, dependency, or business step should the telemetry explain. Then create a provider with resource attributes, use a production-shaped processor or reader, move endpoint and protocol choices to environment configuration, and add shutdown or flush behavior so short-lived processes export data. For OTLP traces, the trace-specific endpoint should override the general OTLP endpoint, while other signals can keep the general endpoint unless they also have signal-specific overrides. Finally, add a validation step that inspects trace identity, span kinds, metric names, bounded attributes, and the resolved exporter destination, because a refactor is not complete until the output proves it answers the original debugging question.


Hands-On Exercise: Build and Diagnose an Instrumented Checkout Script

Section titled “Hands-On Exercise: Build and Diagnose an Instrumented Checkout Script”

Goal: Convert a small checkout script into a complete OpenTelemetry SDK example that emits useful traces and metrics, then diagnose the output as if you were reviewing a teammate’s instrumentation.

Your team has a checkout job that sometimes fails when the payment gateway times out. The current script prints success or failure, but it does not show where time is spent, which step failed, or whether failures are increasing. You need to add tracing and metrics in a way that would still make sense if the script became a long-running service. Use console exporters for the exercise so you can inspect the output locally.

Create or reuse the repository virtual environment, then install the packages required for console trace and metric export.

Terminal window
.venv/bin/python -m pip install opentelemetry-api opentelemetry-sdk

Create otel_checkout_exercise.py with this uninstrumented input.

import random
import time
def validate_cart(cart_id: str) -> None:
time.sleep(0.02)
if not cart_id:
raise ValueError("cart_id is required")
def charge_payment() -> str:
time.sleep(0.04)
if random.random() < 0.25:
raise RuntimeError("payment gateway timeout")
return "PAY-1001"
def reserve_inventory() -> None:
time.sleep(0.03)
def checkout(cart_id: str, channel: str) -> str:
validate_cart(cart_id)
payment_id = charge_payment()
reserve_inventory()
return f"ORD-{cart_id}-{payment_id}"
if __name__ == "__main__":
for index in range(6):
try:
print(checkout(f"CART-{index}", "web"))
except RuntimeError as error:
print(f"checkout failed: {error}")
  1. Add a Resource with service.name=checkout-exercise, service.version=0.1.0, and deployment.environment=local.
  2. Configure one TracerProvider at startup with a BatchSpanProcessor and ConsoleSpanExporter.
  3. Configure one MeterProvider at startup with a PeriodicExportingMetricReader and ConsoleMetricExporter.
  4. Create a parent span named checkout for the business operation and child spans named validate-cart, charge-payment, and reserve-inventory.
  5. Use INTERNAL for validation and inventory reservation unless your implementation simulates a real dependency call.
  6. Use CLIENT for payment because the payment gateway represents an outbound dependency.
  7. Add a counter named checkout.attempts with a bounded attribute such as checkout.channel.
  8. Add a counter named checkout.failures with a bounded attribute such as error.type.
  9. Add a histogram named checkout.duration with unit ms and record elapsed checkout time.
  10. When an exception occurs, record it on the active span, set error status, update the failure counter, and re-raise or handle it deliberately.
  11. Flush both providers before the script exits so console output is visible.
  12. Review the output and write down which field proves parent-child trace continuity.
  • The script runs with .venv/bin/python otel_checkout_exercise.py and exits without import errors.
  • Console trace output shows a checkout parent span and child spans for validation, payment, and inventory reservation.
  • Payment failures show an exception event and error status on the payment span.
  • The parent checkout span also reflects failure when checkout cannot complete.
  • All spans include the resource attribute service.name=checkout-exercise.
  • Metric output includes checkout.attempts, checkout.failures, and checkout.duration.
  • Metric attributes do not include cart.id, user.id, email addresses, tokens, or other high-cardinality or sensitive values.
  • You can explain whether each span kind describes in-process work, inbound work, outbound work, producer work, or consumer work.

After running the script, inspect one successful trace and one failed trace. Confirm that child spans share the parent trace ID and that their parent span ID points back to the checkout span. If a span appears as a separate root, revisit where the span was started and whether current context was preserved. For this single-process exercise, current context should flow automatically through nested start_as_current_span blocks.

Now inspect the metrics. The attempt counter should increase for every checkout attempt, the failure counter should increase only when checkout fails, and the duration histogram should record both successful and failed attempts. If the histogram only records successes, move the duration recording into a finally block so failed requests are included. This is a realistic production concern because excluding failures can make the slowest or most important requests disappear from latency data.

Finally, review your attributes. If you added cart.id to metrics, remove it and explain why it belongs in trace evidence, not aggregate metric dimensions. If you put service.name on every span manually, move it to the resource and explain why provider-level identity is the correct scope. If you used SimpleSpanProcessor, switch to batch processing and explain how request-path export changes latency behavior.


Next Module: Module 2: OTel Collector Architecture — how to receive, process, route, and export telemetry at scale.