Module 3.3: Instrumentation Principles
Complexity:
[MEDIUM]Time to Complete: 30-35 minutes
Prerequisites: Module 3.2: The Three Pillars
Track: Foundations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design instrumentation that captures high-cardinality dimensions without creating cardinality explosions or unsustainable cost
- Apply the RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) methods to determine what to instrument in a service
- Implement structured logging and context propagation patterns that make cross-service debugging possible
- Evaluate whether a service’s instrumentation is sufficient to diagnose production incidents without redeploying code
The Startup That Went Blind at the Worst Possible Moment
Section titled “The Startup That Went Blind at the Worst Possible Moment”November 2019. A Fast-Growing Fintech Startup. Series B Funding Just Closed.
The engineering team had built fast. Really fast. In 18 months, they went from zero to 2 million users, handling $40 million in transactions monthly. The codebase was a masterpiece of velocity—features shipped weekly, bugs fixed within hours.
But they had cut one corner: observability. Every service logged to stdout with basic print statements. No structured data. No trace IDs. No metrics beyond “is the server up?” When things worked, nobody noticed. When things broke, they guessed.
November 15th, 3:47 PM. Their payment processor reports a critical vulnerability. They need to identify every transaction processed through a specific code path in the last 72 hours. Regulatory deadline: 4 hours.
The senior engineer opens their logging system. Searches for “payment.” Gets 847 million results. Unstructured text. No way to filter by transaction ID. No way to identify the affected code path. No correlation between services.
Hour 1: They try regex patterns on raw logs. The patterns match too much. False positives everywhere.
Hour 2: They dump logs to CSV and manually filter in Excel. The file is 47GB. Excel crashes.
Hour 3: They start reading logs line by line, trying to reconstruct transaction flows manually. Ten engineers working in parallel.
Hour 4: They miss the deadline.
The Fallout:
- Regulatory fine: $2.3 million
- Emergency audit: 6 weeks of engineering time diverted
- Customer notification required: 847,000 users informed of potential data exposure
- Reputation damage: Two enterprise deals cancelled
- Investor confidence: Series C delayed by 9 months
Total cost: $18.7 million in direct losses, plus immeasurable opportunity cost.
The root cause: Not the vulnerability itself—that was a common library issue affecting thousands of companies. The root cause was invisible instrumentation. They couldn’t see what their systems had done.
THE INSTRUMENTATION BLINDNESS TRAP═══════════════════════════════════════════════════════════════════════════════
WHAT THEY HAD WHAT THEY NEEDED───────────────────────────────────── ─────────────────────────────────────print("processing payment") Structured: {"event": "payment_start", "transaction_id": "T-12345", "user_id": "U-6789", "trace_id": "abc-123"}
No metrics Counters: payments_processed_total Histograms: payment_duration_seconds Gauges: active_transactions
No tracing Spans: payment-service → fraud-check → processor-api → ledger-write
TIME TO ANSWER "What transactions used code path X?"
With their setup: 4+ hours (and failed)With proper setup: 47 seconds (WHERE code_path = 'X' AND timestamp > '3 days ago')After the incident, they spent 3 months instrumenting properly. The same regulatory request—which came again during the next quarterly audit—took 2 minutes to answer.
Why This Module Matters
Section titled “Why This Module Matters”You understand logs, metrics, and traces. But where do they come from? Instrumentation—the code that generates telemetry.
Bad instrumentation means missing data when you need it most. Over-instrumentation means noise, performance overhead, and massive bills. Good instrumentation means the right data at the right time, with enough context to debug any issue.
This module teaches you instrumentation principles that apply regardless of language or tool. The goal isn’t to make you an expert in any specific SDK, but to give you the judgment to instrument effectively.
The Security Camera Analogy
Security cameras need to be placed strategically. Too few and you have blind spots. Too many and you’re drowning in footage with no budget to store it. You need cameras at entry points, high-value areas, and places where things go wrong. Instrumentation is the same: strategic placement at service boundaries, business operations, and error-prone code.
What You’ll Learn
Section titled “What You’ll Learn”- What to instrument (and what not to)
- Where instrumentation belongs (service boundaries, business operations)
- Instrumentation patterns for logs, metrics, and traces
- The cost of instrumentation
- Context propagation and correlation
Part 1: What to Instrument
Section titled “Part 1: What to Instrument”1.1 The Instrumentation Decision
Section titled “1.1 The Instrumentation Decision”Not everything needs instrumentation. Focus on:
INSTRUMENTATION PRIORITY PYRAMID═══════════════════════════════════════════════════════════════
▲ ╱ ╲ ╱ ╲ ERRORS ╱ ███ ╲ Everything that fails ╱───────╲ (highest priority) ╱ ╲ ╱ █████ ╲ BOUNDARIES ╱─────────────╲ Incoming/outgoing requests ╱ ╲ (most value) ╱ ███████ ╲ ╱───────────────────╲ BUSINESS OPERATIONS ╱ ╲ Key domain events ╱ ███████████ ╲ (checkout, signup) ╱─────────────────────────╲ ╱ ╲ INTERNAL OPERATIONS ╱ █████████████ ╲ Database queries, cache ╱───────────────────────────────╲ (selective) ╱ ╲ ╱ ░░░░░░░░░░░░░ ╲ LOW-LEVEL LOOPS ╱─────────────────────────────────────╲ Skip unless debugging╱ ╲ (usually not needed)1.2 Must-Instrument: Service Boundaries
Section titled “1.2 Must-Instrument: Service Boundaries”Every request entering or leaving your service should be instrumented:
SERVICE BOUNDARY INSTRUMENTATION═══════════════════════════════════════════════════════════════
INCOMING (Inbound)─────────────────────────────────────────────────────────────→ HTTP request received Log: request_id, method, path, headers, user_id Metric: http_requests_total, http_request_duration Trace: Create/continue span
→ Message consumed (Kafka, RabbitMQ) Log: message_id, topic/queue, consumer_group Metric: messages_consumed_total, processing_duration Trace: Extract trace context from message headers
OUTGOING (Outbound)─────────────────────────────────────────────────────────────→ HTTP request to another service Log: request_id, target_service, path, response_status Metric: outbound_requests_total, outbound_request_duration Trace: Create child span, propagate context
→ Database query Log: query (sanitized), duration, rows_affected Metric: db_query_duration, db_connections_active Trace: Child span with db.system, db.statement tags
→ Cache access Log: cache_key, hit/miss Metric: cache_hits_total, cache_misses_total Trace: Child span with cache.hit tag1.3 Must-Instrument: Business Operations
Section titled “1.3 Must-Instrument: Business Operations”Key domain events that matter to the business:
BUSINESS OPERATION EXAMPLES═══════════════════════════════════════════════════════════════
E-commerce:├── user_signup├── product_viewed├── add_to_cart├── checkout_started├── payment_processed├── order_completed└── refund_issued
SaaS:├── account_created├── user_invited├── subscription_started├── feature_used├── limit_reached├── subscription_cancelled└── account_deleted
For each business operation:- Log with full context (who, what, when, outcome)- Metric for volume and success rate- Span if part of a larger flow1.4 Must-Instrument: Errors
Section titled “1.4 Must-Instrument: Errors”Every error should be captured with context:
ERROR INSTRUMENTATION═══════════════════════════════════════════════════════════════
GOOD ERROR LOG{ "timestamp": "2024-01-15T10:32:15.789Z", "level": "error", "message": "Payment processing failed", "trace_id": "abc-123", "user_id": "12345", "order_id": "ORD-789", "amount": 99.99, "payment_method": "credit_card", "error_code": "CARD_DECLINED", "error_message": "Insufficient funds", "stack_trace": "...", "retry_count": 2, "service": "payment-api", "version": "2.3.1"}
WHAT TO INCLUDE├── Context: Who? (user_id), What? (order_id), When? (timestamp)├── Error details: code, message, stack trace├── Correlation: trace_id, request_id├── Environment: service, version, region└── State: retry_count, previous_attemptsTry This (2 minutes)
List 3 business operations in your system that should definitely be instrumented:
For each: Are they currently instrumented with logs, metrics, and traces?
Part 2: Instrumentation Patterns
Section titled “Part 2: Instrumentation Patterns”2.1 Logging Patterns
Section titled “2.1 Logging Patterns”Structured Logging
# BAD: Unstructured, hard to querylogger.info(f"User {user_id} placed order {order_id} for ${amount}")
# GOOD: Structured, queryablelogger.info("order_placed", extra={ "user_id": user_id, "order_id": order_id, "amount": amount, "item_count": len(items), "trace_id": get_trace_id()})Log Levels
| Level | When to Use | Example |
|---|---|---|
| DEBUG | Development/troubleshooting | Variable values, flow details |
| INFO | Normal operations | Requests received, operations completed |
| WARN | Unexpected but handled | Retry succeeded, fallback used |
| ERROR | Failures requiring attention | Request failed, exception caught |
| FATAL | System cannot continue | Startup failure, critical dependency down |
2.2 Metrics Patterns
Section titled “2.2 Metrics Patterns”USE Method (Utilization, Saturation, Errors):
USE METHOD METRICS═══════════════════════════════════════════════════════════════
For every resource (CPU, memory, connections, queues):
UTILIZATION: How busy is it? cpu_usage_percent memory_usage_bytes connection_pool_usage_ratio
SATURATION: How overloaded is it? request_queue_length thread_pool_pending connection_wait_time
ERRORS: How often does it fail? connection_errors_total timeout_errors_total oom_kills_totalRED Method (Rate, Errors, Duration):
RED METHOD METRICS═══════════════════════════════════════════════════════════════
For every service (user-facing):
RATE: How many requests per second? http_requests_total (rate over time)
ERRORS: How many requests fail? http_requests_total{status="5xx"} error_rate = errors / total
DURATION: How long do requests take? http_request_duration_seconds (histogram) p50, p90, p99 latencies2.3 Tracing Patterns
Section titled “2.3 Tracing Patterns”Span Naming
SPAN NAMING CONVENTIONS═══════════════════════════════════════════════════════════════
GOOD span names (low cardinality, descriptive): HTTP GET /api/users/{id} DB SELECT users QUEUE process order CACHE get session
BAD span names (high cardinality, will explode): HTTP GET /api/users/12345 ← includes ID DB SELECT * FROM users WHERE id = 12345 ← includes query GET user_12345 ← includes ID
Put variable data in TAGS, not span names: Span: HTTP GET /api/users/{id} Tags: http.user_id=12345Span Attributes
| Attribute | Type | Example |
|---|---|---|
| http.method | string | ”GET”, “POST” |
| http.status_code | int | 200, 500 |
| http.url | string | ”/api/users/12345” |
| db.system | string | ”postgresql”, “redis” |
| db.statement | string | ”SELECT…” (sanitized) |
| error | boolean | true |
| error.message | string | ”Connection refused” |
Part 3: Context Propagation
Section titled “Part 3: Context Propagation”3.1 Why Context Matters
Section titled “3.1 Why Context Matters”Without context, each service starts fresh. With context, you can:
CONTEXT PROPAGATION VALUE═══════════════════════════════════════════════════════════════
WITHOUT CONTEXT────────────────────────────────────────Service A logs: "Request received from user 123"Service B logs: "Processing order"Service C logs: "Database query executed"
Q: Did Service C's query relate to Service A's request?A: No idea. No connection between them.
WITH CONTEXT────────────────────────────────────────Service A logs: "Request received" trace_id=abc-123 user_id=123Service B logs: "Processing order" trace_id=abc-123Service C logs: "Database query" trace_id=abc-123
Q: Did Service C's query relate to Service A's request?A: Yes! Same trace_id. Can reconstruct full flow.3.2 What Context to Propagate
Section titled “3.2 What Context to Propagate”CONTEXT PROPAGATION═══════════════════════════════════════════════════════════════
STANDARD (W3C Trace Context)─────────────────────────────────────────────────────────────Header: traceparentFormat: {version}-{trace_id}-{parent_span_id}-{flags}Example: 00-abc123def456-span001-01
Header: tracestateFormat: vendor-specific dataExample: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7
APPLICATION CONTEXT (Baggage)─────────────────────────────────────────────────────────────Header: baggageFormat: key=value pairsExample: user_id=12345,tenant=acme,feature_flag=new_checkout
Use baggage for:- User/tenant ID (for filtering logs)- Feature flags (for understanding behavior)- Request source (for debugging)3.3 Propagation Implementation
Section titled “3.3 Propagation Implementation”PROPAGATION FLOW═══════════════════════════════════════════════════════════════
1. RECEIVE REQUEST ┌─────────────────────────────────────────────────────────┐ │ Extract context from incoming headers: │ │ traceparent → trace_id, parent_span_id │ │ baggage → user_id, tenant │ │ │ │ Store in thread-local/async context │ └─────────────────────────────────────────────────────────┘
2. DO WORK ┌─────────────────────────────────────────────────────────┐ │ Log with context: logger.info("...", trace_id=ctx.id) │ │ Create spans as children of parent_span │ │ Record metrics (optionally tag with trace_id) │ └─────────────────────────────────────────────────────────┘
3. CALL DOWNSTREAM ┌─────────────────────────────────────────────────────────┐ │ Inject context into outgoing headers: │ │ Set traceparent, baggage on HTTP client │ │ Set trace headers on message/queue │ └─────────────────────────────────────────────────────────┘
4. DOWNSTREAM REPEATS ┌─────────────────────────────────────────────────────────┐ │ Extract → Work → Inject → ... │ │ Chain continues through all services │ └─────────────────────────────────────────────────────────┘Gotcha: Broken Propagation
Context propagation breaks when:
- A service doesn’t extract/inject headers
- An async queue drops headers
- A third-party library doesn’t propagate
- Manual HTTP clients skip instrumentation
Test propagation by checking: “Can I see the full trace?” If spans are disconnected, propagation is broken somewhere.
Part 4: The Cost of Instrumentation
Section titled “Part 4: The Cost of Instrumentation”4.1 Performance Overhead
Section titled “4.1 Performance Overhead”INSTRUMENTATION OVERHEAD═══════════════════════════════════════════════════════════════
LOGGING─────────────────────────────────────────────────────────────- CPU: Serializing log data (JSON encoding)- Memory: Buffering logs before flush- I/O: Writing to disk/network- Latency: Typically <1ms per log if async
Mitigation:- Use async logging (don't block request)- Sample verbose logs- Use appropriate log levels
METRICS─────────────────────────────────────────────────────────────- CPU: Incrementing counters (negligible)- Memory: Storing histogram buckets- Network: Scraping/pushing (batched)- Latency: Typically microseconds
Mitigation:- Keep cardinality bounded- Use appropriate histogram buckets- Batch metric updates
TRACING─────────────────────────────────────────────────────────────- CPU: Creating spans, encoding- Memory: Storing span data until export- Network: Exporting spans (batched)- Latency: 1-5ms overhead typical
Mitigation:- Sample traces (not all requests)- Use head-based or tail-based sampling- Batch exports4.2 Storage and Cost
Section titled “4.2 Storage and Cost”COST CONSIDERATIONS═══════════════════════════════════════════════════════════════
LOGS1 request = ~500 bytes of logs1M requests/day = 500 MB/day = 15 GB/monthAt $0.50/GB storage + query costs = $$$
METRICS1 metric × 100 label combinations × 15s scrape = moderateHigh cardinality = storage explosionAt $0.10/million samples = manageable
TRACES1 request = 5 spans × 200 bytes = 1KB1M requests/day sampled at 1% = 10GB/monthFull traces at 100% = 1TB/month = $$$$
Sampling is essential for traces at scale.4.3 Sampling Strategies
Section titled “4.3 Sampling Strategies”SAMPLING STRATEGIES═══════════════════════════════════════════════════════════════
HEAD-BASED SAMPLING─────────────────────────────────────────────────────────────Decide at the START whether to trace
Request arrives → Random: 1% → Trace → Random: 99% → Don't trace
Pros: Simple, predictable costCons: Might miss interesting requests
TAIL-BASED SAMPLING─────────────────────────────────────────────────────────────Decide at the END based on what happened
Request completes → Error? → Keep trace → Slow (>1s)? → Keep trace → Normal → Sample at 0.1%
Pros: Keeps interesting tracesCons: Must buffer all traces temporarily, more complex
ADAPTIVE SAMPLING─────────────────────────────────────────────────────────────Adjust rate based on traffic
Low traffic → Sample 100% High traffic → Sample 1% Error spike → Increase error sampling
Pros: Cost control with coverageCons: Complex to implement correctlyTry This (2 minutes)
For your system:
- Current trace sampling rate: ____%
- Estimated monthly trace storage cost: $_____
- If you sampled at 1%, would you still catch the important issues?
Part 5: Instrumentation Best Practices
Section titled “Part 5: Instrumentation Best Practices”5.1 The Golden Rules
Section titled “5.1 The Golden Rules”INSTRUMENTATION GOLDEN RULES═══════════════════════════════════════════════════════════════
1. INSTRUMENT AT BOUNDARIES Every request in, every request out Don't trust others to instrument for you
2. INCLUDE CONTEXT trace_id in every log user_id, request_id where relevant Correlation is key
3. NAME CONSISTENTLY Same field names across services user_id everywhere, not user_id/userId/uid
4. KEEP CARDINALITY BOUNDED Logs: high cardinality OK Metrics: low cardinality only Traces: low cardinality span names
5. SAMPLE APPROPRIATELY 100% of errors Sample normal requests Adjust based on value vs. cost
6. TEST YOUR INSTRUMENTATION Can you find logs for a trace? Can you see the full request flow? Do metrics match reality?5.2 Common Patterns
Section titled “5.2 Common Patterns”Request Timing
# Pattern: Measure duration of operationsstart_time = time.time()try: result = do_operation() metrics.histogram("operation_duration_seconds", time.time() - start_time, labels={"status": "success"})except Exception as e: metrics.histogram("operation_duration_seconds", time.time() - start_time, labels={"status": "error"}) raiseError Counting
# Pattern: Count errors with contexttry: result = call_external_service()except TimeoutError: metrics.counter("external_service_errors_total", labels={"error_type": "timeout", "service": "payment"}) logger.error("External service timeout", extra={ "service": "payment", "timeout_ms": 5000, "trace_id": get_trace_id() }) raiseWar Story: The $50,000/Month Instrumentation Disaster
2021. A Series A SaaS Startup. 50 Engineers.
The VP of Engineering declared: “We will never be caught blind by an incident. Instrument everything.”
The team took it literally. Every function got timing metrics. Every variable assignment got logged. Every code branch got a span. They were proud of their 847,000 active time series and 2.3TB of daily logs.
Month 1: Observability bill: $12,000. “Worth it for visibility.”
Month 3: Bill: $28,000. Dashboards taking 45 seconds to load. “We’ll optimize later.”
Month 6: Bill: $52,000. Queries timing out. Engineers avoiding the observability stack because it’s too slow. Alert fatigue from 400+ daily alerts.
The Breaking Point: A P0 incident occurs. The on-call engineer opens their dashboard. 47-second load time. They search for the error in logs. Query times out after 5 minutes. They try to find relevant metrics among 847,000 series. Needle in a haystack.
The incident that should have taken 15 minutes to debug took 3 hours. Their “comprehensive observability” was actually anti-observability—so much noise that signal was invisible.
The Recovery:
Before After 847,000 metric series 12,400 series 2.3 TB daily logs 180 GB daily logs Every function instrumented Boundaries + business ops + errors $52,000/month $4,200/month 45-second dashboard load 1.2-second dashboard load 400 daily alerts (90% noise) 23 daily alerts (95% actionable) What they removed:
- Loop iteration metrics (who cares how many times a for-loop ran?)
- Internal function timing (nobody debugs at this granularity)
- Debug-level logs in production (useful in dev, noise in prod)
- Metrics with unbounded cardinality (user_id as a label = disaster)
What they kept:
- Every HTTP request in/out (boundaries)
- Every database query (boundaries)
- Business operations (signup, payment, order)
- All errors with full context
The Lesson: “Instrument everything” is not a strategy. It’s an anti-pattern. The goal isn’t maximum data—it’s maximum signal. Every datapoint you add should answer a question you’ll actually ask.
Did You Know?
Section titled “Did You Know?”-
Facebook instruments 10 billion events per second and samples aggressively. They use machine learning to decide which traces to keep based on predicted interestingness.
-
Prometheus was designed for pull-based metrics (scraping), while StatsD uses push. Pull is simpler operationally (no service discovery at client), but push handles short-lived processes better.
-
The OpenTelemetry project merged OpenTracing and OpenCensus in 2019. It’s now the second-largest CNCF project (after Kubernetes) by contributors.
-
LinkedIn discovered that 1% of their traces consumed 50% of their tracing storage. These “mega-traces” from long-running batch jobs needed special handling. The lesson: not all requests are equal—design your sampling strategy accordingly.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Logging sensitive data | PII in logs, compliance issues | Sanitize before logging, use allow-lists |
| High-cardinality metrics | Storage explosion, slow queries | Move high-cardinality to logs |
| No sampling | Massive costs, performance impact | Implement sampling (keep 100% of errors) |
| Inconsistent field names | Can’t correlate across services | Define and enforce naming conventions |
| Forgetting async boundaries | Context lost in async code | Use context propagation libraries |
| Not testing instrumentation | Missing data when you need it | Verify telemetry works before incidents |
-
Why should you instrument at service boundaries rather than deep inside business logic?
Answer
Service boundaries are where:
- Requests enter and leave - This is what users experience
- Context changes - You cross trust/network boundaries
- Failures occur - Network, timeout, authentication issues
- Dependencies are called - External services, databases, caches
Instrumenting boundaries captures the most valuable data (request flow, latencies, errors) with the least code changes. Internal business logic can be traced when needed, but boundaries are always valuable.
Additionally, boundary instrumentation is often automatic (HTTP middleware, database drivers) while internal instrumentation requires manual effort.
-
What’s the difference between head-based and tail-based sampling?
Answer
Head-based sampling decides at the START of a request whether to trace it:
- Simple: Random decision at entry point
- Predictable: Fixed sampling rate
- Problem: Might miss interesting (error, slow) requests
Tail-based sampling decides at the END based on what happened:
- Smart: Keep all errors, all slow requests, sample normal ones
- Complex: Must buffer all traces until decision point
- Better coverage: Important traces are kept
Trade-off: Head-based is simpler and cheaper; tail-based captures more valuable data but requires more infrastructure (collectors that can buffer and decide).
-
Why is consistent field naming important across services?
Answer
Consistent field names enable:
- Cross-service queries:
WHERE user_id = 123works everywhere - Correlation: Same trace_id field links logs across services
- Dashboards: One dashboard works for all services
- Automation: Alert rules apply uniformly
Inconsistent naming (user_id vs userId vs uid) means:
- Manual mapping for every query
- Separate dashboards per service
- Can’t easily correlate related events
- Increased cognitive load for operators
Define a naming convention (snake_case, camelCase) and enforce it across teams.
- Cross-service queries:
-
How do you decide what to log vs. what to make a metric?
Answer
Use LOGS when:
- You need rich context (error messages, stack traces)
- Data has high cardinality (user_id, request details)
- You’re debugging specific events
- Events are infrequent
Use METRICS when:
- You need aggregations (rate, count, average)
- You’re alerting on thresholds
- Data has low cardinality (endpoint, status code)
- You need efficient time-series storage
- You’re building dashboards
Rule of thumb: If you’ll ask “how many” or “what’s the average,” use metrics. If you’ll ask “what happened to this specific thing,” use logs. Use both when appropriate—they complement each other.
-
A service has 100 endpoints, 50 status codes, and 1 million users. An engineer wants to add a metric with labels for endpoint, status, and user_id. Calculate the potential cardinality explosion and explain why this is problematic.
Answer
Calculation: 100 endpoints × 50 status codes × 1,000,000 users = 5 billion potential time series
Why this is problematic:
- Storage: Each time series needs ~3KB of memory for efficient querying. 5B × 3KB = 15TB of memory
- Query performance: Scanning 5B series for a dashboard = timeout
- Cost: At $0.10 per 1000 active series, that’s $500,000/month
- Prometheus limits: Default recommendation is <10M series
The fix:
- Remove user_id from metric labels (move to logs instead)
- Keep endpoint × status = 5,000 series (manageable)
- Log user_id with structured fields for per-user debugging
Rule: Metrics = low cardinality (bounded dimensions). Logs = high cardinality welcome.
-
Your team uses head-based sampling at 1%. A critical bug affects 0.1% of requests. Calculate the probability of catching this bug in your traces. What sampling strategy would help?
Answer
With head-based 1% sampling:
- Bug affects 0.1% of requests
- You sample 1% of all requests randomly
- Probability of sampling a buggy request: 1% × 0.1% = 0.001% of total traffic
- If you have 1M requests/day: 1M × 0.00001 = 10 buggy traces/day
The problem: You might catch 10 examples, but:
- If the bug is intermittent, you might not even notice the pattern
- If you need to analyze 100+ cases to understand the bug, you’re waiting 10+ days
Better strategy: Tail-based sampling:
- Keep 100% of error traces (not sampled)
- Bug causes errors → all buggy requests are traced
- Analysis: immediate, comprehensive
Even better: Adaptive sampling:
- Normal requests: 1% sampling
- Errors: 100% retention
- Slow requests (>p99): 100% retention
- New code paths: elevated sampling for first 24 hours
-
An engineer adds this log line:
logger.info(f"Processing request for user {user.email} with password {user.password}"). List all the problems with this instrumentation.Answer
Critical problems:
- PII exposure: Email is personally identifiable information (PII)
- Credential leak: Password in logs is a severe security vulnerability
- Compliance violation: GDPR, CCPA, SOC2 all prohibit logging credentials
- Unstructured format: f-string isn’t queryable
- Missing context: No trace_id, request_id, timestamp
- Audit liability: Logs may be retained for years—credentials persist
The fix:
logger.info("request_processing_started", extra={"user_id": user.id, # ID, not email"trace_id": get_trace_id(),"request_id": request.id,# NEVER log: password, tokens, SSN, credit card, email (use user_id)})Best practice: Use allow-lists (explicitly list what CAN be logged) rather than deny-lists (trying to catch everything that shouldn’t be logged). Assume everything is sensitive unless proven otherwise.
-
Your service calls a third-party API. Traces show the span but context propagation breaks—downstream spans aren’t connected. List three possible causes and how to debug each.
Answer
Possible causes and debugging:
-
HTTP client not instrumented
- Debug: Check if traceparent header is being sent
- Fix: Use OTel-instrumented HTTP client or add manual header injection
Terminal window # Verify header propagationcurl -v yourservice.com 2>&1 | grep -i traceparent -
Third-party API doesn’t propagate context
- Debug: Check if third-party returns/forwards trace context
- Fix: You can’t fix their code, but you can:
- End your span when you make the call
- Start a new trace when response returns (link to original via span attributes)
-
Async boundary drops context
- Debug: Check if context is preserved across await/callback
- Fix: Use context propagation utilities:
# Python example with OTelfrom opentelemetry import contextctx = context.get_current()# ... in callback ...context.attach(ctx)
Debugging workflow:
- Add debug logging at call site: log trace_id before and after
- Inspect outgoing HTTP headers
- Check if response includes trace context
- Verify context storage across async boundaries
-
Key Takeaways
Section titled “Key Takeaways”INSTRUMENTATION ESSENTIALS CHECKLIST═══════════════════════════════════════════════════════════════════════════════
WHAT TO INSTRUMENT (PRIORITY ORDER)☑ 1. Errors - EVERYTHING that fails, with full context☑ 2. Boundaries - every request in, every request out☑ 3. Business operations - signup, payment, order, key events☑ 4. Dependencies - database, cache, external APIs☑ 5. Internal operations - SELECTIVE, not exhaustive
INSTRUMENTATION RULES☑ Structured logs with consistent field names☑ trace_id in EVERY log line (non-negotiable)☑ Low-cardinality metric labels only☑ Span names should be low-cardinality (variables in tags)☑ NEVER log passwords, tokens, PII, credentials
COST AWARENESS☑ Every metric costs storage and query time☑ Every log line costs storage and search time☑ Every span costs memory and export bandwidth☑ Sample traces strategically (100% for errors, % for normal)☑ Calculate cardinality BEFORE adding labels
CONTEXT PROPAGATION☑ Use W3C Trace Context standard (traceparent header)☑ Test propagation across ALL service boundaries☑ Include context in async operations (queues, callbacks)☑ Use baggage for application-level context (user_id, tenant)
COMMON ANTI-PATTERNS TO AVOID☑ "Instrument everything" → instrument strategically☑ High-cardinality metrics → move to logs☑ Unstructured logs → structured JSON with fields☑ Missing trace_id → always include correlation☑ Sensitive data in logs → use allow-listsHands-On Exercise
Section titled “Hands-On Exercise”Task: Instrument a sample service endpoint with all three pillars.
Part 1: Define Instrumentation Points (10 minutes)
For this endpoint: POST /api/orders (create order)
Flow:1. Receive HTTP request2. Validate user authentication3. Validate order data4. Check inventory (DB call)5. Reserve inventory (DB call)6. Create order record (DB call)7. Publish order event (message queue)8. Return responseFor each step, define what to instrument:
| Step | Log Event | Metric | Trace Span |
|---|---|---|---|
| 1. Receive | ”request_received” + method, path, user_id | http_requests_total | HTTP POST /api/orders |
| 2. Auth | ”auth_validated” or “auth_failed” | auth_duration_seconds | auth.validate |
| 3. Validate | |||
| 4. Check inventory | |||
| 5. Reserve inventory | |||
| 6. Create order | |||
| 7. Publish event | |||
| 8. Response |
Part 2: Write Pseudo-Code (15 minutes)
Write instrumentation code for steps 4-6 (database operations):
# Your pseudo-code here
def check_inventory(items, trace_context): # Create span # Log start # Call database # Log result # Record metric (duration, success/failure) # Return result pass
def reserve_inventory(items, order_id, trace_context): # Similar pattern pass
def create_order(order_data, trace_context): # Similar pattern passPart 3: Error Handling (10 minutes)
What happens when step 5 (reserve inventory) fails because an item is out of stock?
Define the instrumentation:
| Signal | What to Capture |
|---|---|
| Log | level=WARN, message=?, fields=? |
| Metric | which metric? labels? |
| Trace | span status? error tags? |
How would you find all “out of stock” errors in the last hour?
Success Criteria:
- All 8 steps have defined instrumentation (logs, metrics, traces)
- Database operations include: duration metrics, child spans, contextual logs
- Error case instrumentation is comprehensive
- Query for finding errors is defined
Further Reading
Section titled “Further Reading”-
“Observability Engineering” - Charity Majors et al. Chapters on instrumentation practices.
-
OpenTelemetry Instrumentation Docs - Language-specific instrumentation guides.
-
“Distributed Tracing in Practice” - Austin Parker et al. Deep dive into tracing instrumentation.
Next Module
Section titled “Next Module”Module 3.4: From Data to Insight - Using observability data: alerting, debugging, and building understanding.