Fundamentals
The model, the wire, and the moving parts. Everything that's true regardless of language.
1.0The tracing model
A trace is the recorded path a unit of work takes through a distributed system. It is a directed acyclic graph of spans, each span representing a bounded operation — an HTTP request handler, a database query, a function call. A trace has no separate object on the wire; it exists only as the set of spans sharing a trace_id.
The relationship between spans is captured by parent_span_id. There is exactly one root span per trace (the one with no parent), and any number of children. Sibling spans may overlap in time, encode concurrency, or run sequentially.
Two design properties of this model are worth internalizing:
- Spans are emitted independently. Each service emits its own spans to the backend. There is no central coordinator. The backend reassembles the trace by grouping on
trace_id. This means a trace can be partial — if one service drops its spans, the rest still arrive. - Trace context is carried in-band. The
trace_idand parent span ID travel with the request itself (HTTP headers, gRPC metadata, Kafka record headers). This is what stitches the graph together.
Everything downstream — sampling, storage layout in Tempo, the TraceQL query model — falls out of these two facts.
- Traces are reassembled by the backend from independently emitted spans — partial traces are normal and useful.
- Context (trace_id + parent_span_id) travels in-band with the request; that is the only thing that connects the graph.
- Span kind is not decorative — it tells the UI and sampling logic how to treat a span relative to network boundaries.
2.0Spans: anatomy and lifecycle
A span is the unit of work. Its complete shape on the wire (OTLP encoding) is:
| Field | Type | Notes |
|---|---|---|
trace_id | 16 bytes | 128-bit. Generated at root. Encoded as 32-char hex. |
span_id | 8 bytes | 64-bit. Unique within trace. 16-char hex. |
parent_span_id | 8 bytes | Empty for root spans. |
name | string | Low-cardinality. Use route templates, not raw URIs. |
kind | enum | SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL. |
start_time | uint64 nanos | Unix epoch. |
end_time | uint64 nanos | Unix epoch. |
attributes | map | Key-value, typed. Bounded (default 128). |
events | list | Timestamped annotations within the span. |
links | list | References to other spans (often other traces). |
status | enum | UNSET (default), OK, ERROR. |
2.1Span kind
Span kind is more semantically loaded than it looks. It tells the backend how to interpret the span in relation to network boundaries:
- SERVER The span represents handling an incoming request. There is a corresponding CLIENT span on the caller's side.
- CLIENT Outgoing synchronous call. Its child on the remote side is the SERVER span.
- PRODUCER Asynchronous send (message queue, event bus). The corresponding CONSUMER may be created later or never linked by parent.
- CONSUMER Async receive. Often uses span links rather than a direct parent relationship.
- INTERNAL Default. Local operation with no network boundary.
Tempo's metrics-generator and service graph depend on this: it builds the graph by pairing CLIENT spans with SERVER spans across services. Mis-kinded spans produce broken graphs.
2.2Attributes vs events vs links
Three places to put information, easy to mix up:
- Attributes describe the span itself. They are queryable in TraceQL. Use them for things that characterize the whole operation:
http.route,db.statement,user.id,messaging.kafka.partition. - Events are timestamped annotations within the span — discrete things that happened during the operation. Use sparingly; an event is "I got a cache miss at t=12ms" or a recorded exception. Events are not first-class queryable in most backends.
- Links reference other spans. Two main uses: (1) a CONSUMER span linking to the PRODUCER span that originated the work, when the consumer's parent is something else (the batch processor); (2) a batch operation that fans in from many traces.
2.3Span lifecycle
A span has three observable states:
- Recording. Between
startandend. Attribute mutations are accepted.IsRecording()returns true. - Ended. Past the end call. Any further mutation is dropped silently (this is a common source of bugs).
- Exported. Picked up by the SpanProcessor and serialized. Once here, it is past the SDK's reach.
A span that is not sampled at creation may still be created as a "non-recording" placeholder — important for context propagation, because the trace ID still needs to travel downstream. IsRecording() is the right gate around expensive attribute computation.
Setting attributes after end() is a no-op and does not raise. In long codepaths where a span end and a late attribute set are not visually colocated (e.g., exception handlers, finally blocks), it is easy to lose data. Always set attributes before end, or use events for late annotations.
2.4Recording exceptions
An exception is recorded as a span event with the conventional name exception and three attributes: exception.type, exception.message, exception.stacktrace. The SDK exposes a record_exception() helper that produces this canonical shape. Recording an exception does not automatically mark the span as ERROR — you must also set status. The conventional pattern:
python# Always do BOTH on failure span.record_exception(exc) span.set_status(Status(StatusCode.ERROR, str(exc)))
3.0Context and propagation
The thing that turns a pile of spans into a trace is context propagation: the trace ID and current span ID must travel with the work, both within a process (across function calls, async boundaries) and across processes (HTTP, gRPC, Kafka).
3.1In-process: the Context object
OTel has a generic Context abstraction — an immutable map of values, with a "current" context per execution. Setting the current context returns a token; restoring it requires the token. Almost every language SDK wires this into the language's native async primitive:
- Python:
contextvars.ContextVar. Propagates acrossawait,asyncio.create_task,asyncio.gather. Does not propagate acrossThreadPoolExecutor.submitunless you wrap the callable incontext.run. - Node.js:
AsyncLocalStorage(the modern path). Propagates across Promise chains,await,setImmediate,setTimeout. Does not propagate into Worker threads.
This is why "the active span" works at all when you call tracer.start_as_current_span() deep in a call stack — the context has been installed at the request boundary by the framework's middleware.
3.2Cross-process: W3C Trace Context
OTel defaults to the W3C Trace Context standard. Two HTTP headers do the work:
httptraceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│ └──────── trace-id (32 hex) ───────┘ └─ parent-id ─┘ │
│ │
version flags
Decoded:
version:00— the only version in use today.trace-id: the 128-bit trace ID. Must be non-zero.parent-id: the span ID of the caller's currently active span. The receiver uses this asparent_span_idon its new SERVER span.flags: 8-bit. Currently onlysampled(bit 0) is defined.01= sampled,00= not sampled.
A companion tracestate header carries vendor-specific data (typically used by SaaS observability vendors; you can usually ignore it in self-hosted setups).
3.3Baggage: a separate channel
Baggage is propagated via the baggage HTTP header and carries arbitrary key-value pairs that flow with the request. It is not automatically copied to span attributes — that is intentional. Baggage is meant for things you want to make available downstream (e.g., a feature-flag context, tenant ID) without polluting every span. To put baggage values on a span, you do it explicitly.
Baggage propagates out of your service. Do not put secrets, PII, or anything you wouldn't print to a log into baggage. Many production setups strip baggage at trust boundaries (egress gateway).
3.4The sampled flag and consistency
The W3C sampled flag matters enormously. When a service receives a request with sampled=1, its tracer (under default parent-based sampling) will also sample its span. With sampled=0, it will not. This produces head-sampling consistency: either the whole trace is sampled or none of it is.
This is the right default — a partial trace with the middle missing is worse than no trace at all. But it also means the sampling decision at the edge service propagates through the entire system. If you want a different decision (e.g., always sample if there's an error) you do it via tail sampling in the Collector, which we'll come back to.
3.5Debugging broken traces — a systematic guide
When traces are broken you will see one or more of these symptoms in Tempo / Grafana:
- Orphan spans (root spans that are not the ones you expect, or many tiny roots for what should be one request).
- Multiple roots for a single logical operation.
- Missing children — a CLIENT span has no corresponding SERVER span on the downstream service (or the SERVER has no parent).
- Wrong parentage — a span that should be deep in the tree appears at the top level.
- Trace ID changes mid-request (rare but catastrophic when it happens).
Follow this decision tree:
- Is the trace ID the same across services? Look at any two spans that should be connected. If
trace_iddiffers, propagation failed at the network boundary between them. Check thetraceparentheader on the wire (curl -v, tcpdump, or your service mesh logs). - Is the parent_span_id correct on the receiving side? The receiving service must take the
parent-idfrom the incomingtraceparentand use it when it callstracer.start_span(..., context=extract(...))or lets auto-instrumentation do it. - Did an in-process boundary lose the context? The most common Python culprits are
ThreadPoolExecutor,asyncio.to_threadon older Python,concurrent.futures.ProcessPoolExecutor, Celery tasks, and any library that spawns its own thread pool without copying context (many DB drivers, some HTTP/2 clients, gRPC async under certain configurations). - Is the instrumentation actually running? The fastest check: set
OTEL_TRACES_EXPORTER=console(or the Python equivalent) and look for output. If you see spans but they have no parents or wrong trace IDs, the problem is propagation, not export. - Are you in a forked worker model? Gunicorn with pre-fork + early initialization is a classic. The Provider (and its processors/exporters) must be created after the fork.
In any service, temporarily add this at startup (remove before prod):
pythonfrom opentelemetry import trace
print("TracerProvider:", type(trace.get_tracer_provider()).__name__)
print("Sampler:", trace.get_tracer_provider().sampler)
If you see NoOpTracerProvider or ProxyTracerProvider that never got replaced, your setup_tracing() was never called or was called too late.
3.6Manual context injection and extraction
Auto-instrumentation handles 95% of cases. The other 5% (custom protocols, legacy systems, CLI tools that call your services, tests) require explicit code.
Python example (httpx manual + context)
pythonfrom opentelemetry.propagate import inject, extract
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry import context, trace
propagator = TraceContextTextMapPropagator()
async def call_downstream_with_manual_propagation(url, payload):
carrier = {}
# inject the *current* context (whatever span is active) into the carrier
inject(carrier, context=context.get_current(), setter=propagator) # or simply inject(carrier)
# The carrier now contains {"traceparent": "...", "tracestate": "..."}
async with httpx.AsyncClient() as client:
resp = await client.post(url, json=payload, headers=carrier)
# On the receiving side the other service will do:
# ctx = extract(resp.request.headers, getter=propagator)
# with tracer.start_as_current_span("...", context=ctx): ...
return resp
Node/TypeScript equivalent
tsimport { context, propagation, trace } from '@opentelemetry/api';
function callWithPropagation(url: string, body: any) {
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier); // puts traceparent etc.
return fetch(url, {
method: 'POST',
headers: carrier,
body: JSON.stringify(body),
});
}
// Receiving side (Express example)
app.post('/work', (req, res) => {
const ctx = propagation.extract(context.active(), req.headers);
const span = tracer.startSpan('handle-work', {}, ctx);
context.with(trace.setSpan(ctx, span), () => {
// your handler logic — active span is now correct
span.end();
});
});
Always prefer the high-level propagation.inject / propagation.extract over touching headers yourself — they are forward-compatible with future propagators (b3, jaeger, etc.) when you set OTEL_PROPAGATORS.
3.7Thread, executor, and worker boundaries (the places context dies)
Python — safe patterns
python# ThreadPoolExecutor — the single most common source of lost context
from concurrent.futures import ThreadPoolExecutor
from contextvars import copy_context
executor = ThreadPoolExecutor(max_workers=8)
def do_cpu_work(arg):
# This runs in a worker thread. We must explicitly run under the captured context.
...
def schedule_from_async(arg):
ctx = copy_context()
future = executor.submit(ctx.run, do_cpu_work, arg)
return await asyncio.wrap_future(future)
asyncio.to_thread (Python 3.9+) and asyncio.run_in_executor usually do the right thing in 3.11+, but on 3.9–3.10 you still need the copy_context().run wrapper for safety with some libraries.
Process boundaries (multiprocessing, Celery, RQ, Dramatiq, Prefect tasks, etc.): the new process has a completely fresh Python interpreter. You must serialize the traceparent string (or the whole carrier) as part of the task payload and extract it at the beginning of the worker function before any spans are created.
python# In the task producer
carrier = {}
inject(carrier)
task_payload = {"traceparent": carrier.get("traceparent"), "data": ...}
# In the Celery task (or equivalent)
@celery.task
def heavy_task(payload):
if payload.get("traceparent"):
ctx = extract({"traceparent": payload["traceparent"]})
with tracer.start_as_current_span("heavy-task", context=ctx):
...
else:
with tracer.start_as_current_span("heavy-task"):
...
Node.js / TypeScript
AsyncLocalStorage (the OTel context carrier) does not cross worker_threads or child_process. Same rule: pass the carrier in the message and re-activate it on the other side with propagation.extract + context.with.
If you ever see a span whose parent_span_id is empty when it should have a parent, or whose trace_id is different from the request that started it, 90% of the time the root cause is one of: thread pool, process fork, third-party library that uses its own executor, or a missing context.with / tracer.start_as_current_span(..., context=...) in a framework hook.
3.8Baggage — when and how to use it for real work
Baggage is the only OTel mechanism that flows arbitrary values across process boundaries without being tied to a single span. Use it for:
- Tenant / customer ID (so every downstream service can emit it on logs/metrics without you threading it through every function signature).
- Feature flag / experiment context ("variant=green") so A/B results can be correlated in traces.
- Request priority or "debug this request" flags that should influence sampling or logging verbosity downstream.
Never put anything you would not be happy to see in a log line or in a trace attribute that a contractor might query.
Practical pattern (Python)
pythonfrom opentelemetry.baggage import set_baggage, get_baggage, get_all_baggage
from opentelemetry.propagate import inject, extract
# At the edge (gateway / auth middleware)
ctx = set_baggage("tenant.id", tenant_id, context=context.get_current())
ctx = set_baggage("feature.experiment", "checkout-redesign-v2", context=ctx)
token = context.attach(ctx) # make it the current context for this request
# Later, anywhere in the same request (even in a background task that captured the context)
tenant = get_baggage("tenant.id")
# On any outgoing call the baggage header is automatically injected if you use the propagators that include baggage (the default)
carrier = {}
inject(carrier) # contains both traceparent and baggage
# Receiving service
ctx = extract(incoming_headers)
tenant = get_baggage("tenant.id", context=ctx)
In the Collector you can promote selected baggage keys to span attributes with the attributes processor if you want them queryable without manual work in every service.
3.9Propagation validation checklist (run this in every new service)
- Install the instrumentation packages for every HTTP client, DB driver, and messaging library you actually use.
- Verify with
OTEL_TRACES_EXPORTER=consolethat a single incoming request produces a connected tree (one root, children have the right parents). - Force a cross-thread or cross-process operation and confirm the child still has the correct parent.
- Send a request to a downstream service you control; confirm the downstream root span has the expected parent from your CLIENT span.
- Check that baggage you set at the edge is readable in the leaf service.
- Add the service to your golden dashboard and confirm exemplar links actually land on the correct trace.
Do this once per language stack. After that, the only things that usually break propagation are new thread pools or new "clever" background job libraries.
4.0Sampling
Sampling exists because traces are expensive to store and most are duplicates of each other. The question is where you sample and by what rule.
4.1Head sampling
The decision is made when the root span is created, before you know anything about the trace's outcome. It is fast, cheap, and consistent across the trace (because the flag propagates).
Common head samplers:
- AlwaysOn / AlwaysOff — for development or fully relying on tail sampling downstream.
- TraceIdRatioBased(p) — sample with probability p. The decision is computed from the trace ID itself, so it's deterministic per trace.
- ParentBased — defer to the parent's sampled flag if there is one; fall back to a configured root sampler if not. This is the OTel default and almost always what you want.
The composition that actually ships in most services is ParentBased(root=TraceIdRatioBased(0.1)): trust upstream's decision; if you're the edge, sample 10%.
4.2Tail sampling
The decision is made after the trace completes (or after a timeout), in the Collector. You see the full trace before deciding to keep it. This is far more expensive — the Collector must buffer all spans for the wait period, and all instances seeing spans of a given trace must coordinate (or use consistent hashing on trace ID, which is what Alloy and the OTel Collector tail-sampling processor do).
Why pay the cost: tail sampling lets you keep interesting traces — errors, latency outliers, traces touching specific endpoints — at near-100%, while dropping the bulk. A typical policy mix:
yaml (otelcol or alloy)processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: sample-rest
type: probabilistic
probabilistic: { sampling_percentage: 5 }
For tail sampling to work across multiple Collector instances, all spans for a given trace must land on the same instance. In Alloy you arrange this with a loadbalancing exporter using trace_id routing key, fronting a downstream pool that runs tail sampling.
4.3What to choose in practice
For most teams running Tempo on a budget:
- Set client-side sampling to AlwaysOn.
- Do tail sampling at the Collector tier with policies that keep all errors, all slow traces, and a few percent of the rest.
- Use TraceQL metrics (or the metrics-generator) to derive RED metrics from unsampled traffic, so dropping spans doesn't hide error rates.
This composition gets you both: full visibility on the bad stuff, manageable storage cost on the good.
The metrics-generator runs before tail sampling in the pipeline — that's the point. It sees every span and turns them into Prometheus-style RED metrics that go to Mimir, then sampling decides what spans actually reach Tempo storage. You get full-fidelity metrics from sampled traces.
5.0The SDK pipeline
Inside any OTel SDK, the journey of a span from creation to export goes through the same six pieces. Knowing them makes config and debugging tractable.
5.1TracerProvider
The single most important object. It owns the resource, the sampler, and the list of span processors. There is one per process. The naming pattern is to register it as the global provider and then ask the global API for tracers.
5.2SpanProcessor: Simple vs Batch
When a span ends, it is handed to all configured processors. Two implementations matter:
- SimpleSpanProcessor exports each span immediately, synchronously. Useful in tests and serverless environments where you might shut down before a batch flushes. Adds latency to every span end. Do not use in production.
- BatchSpanProcessor buffers spans and exports in batches, on a timer or when the queue hits a threshold. This is the production default. Its tuning knobs (in env-var form) are:
OTEL_BSP_MAX_QUEUE_SIZE— drop threshold (default 2048).OTEL_BSP_MAX_EXPORT_BATCH_SIZE— batch size (default 512).OTEL_BSP_SCHEDULE_DELAY— flush interval ms (default 5000).OTEL_BSP_EXPORT_TIMEOUT— per-export timeout (default 30000).
5.3Exporter
For Tempo, you want OTLP. The two transports:
| Transport | Default port | Notes |
|---|---|---|
| OTLP/gRPC | 4317 | Lower overhead, persistent connection, harder through some proxies. |
| OTLP/HTTP (protobuf) | 4318 | One request per batch, easier to debug, works through any HTTP-aware infra. |
| OTLP/HTTP (JSON) | 4318 | Same path, content-type application/json. Slow; use only for debug. |
In a service mesh with mTLS and good throughput, gRPC is usually the right choice. Through Istio with east-west gateways across clusters (your setup), OTLP/HTTP can be operationally simpler.
5.4Shutdown semantics
The SDK must be flushed on process shutdown. Both Python and Node.js will lose any buffered spans on hard exit unless you call shutdown() on the provider (which drains processors and exporters). FastAPI and Express both support lifecycle hooks; wire shutdown there. Don't trust process signals alone — Kubernetes' default terminationGracePeriod (30s) is plenty, but only if you actually flush.
6.0OTLP: the protocol
OTLP (OpenTelemetry Protocol) is the wire format that everything OTel speaks. There is exactly one schema, defined in Protocol Buffers, with three transports binding to it (gRPC, HTTP/protobuf, HTTP/JSON).
The relevant request type for tracing is ExportTraceServiceRequest. Its shape:
protobuf (conceptual)ExportTraceServiceRequest { ResourceSpans[] resource_spans } ResourceSpans { Resource resource // service.name, host, k8s.*, ... ScopeSpans[] scope_spans // grouped by instrumentation library } ScopeSpans { InstrumentationScope scope // e.g., "opentelemetry-instrumentation-fastapi" 0.46b0 Span[] spans }
The two-level grouping (Resource → Scope → Spans) means a single OTLP request can carry spans from many libraries within one service. The Collector preserves this structure as it processes.
6.1Endpoint configuration
Three environment variables are the canonical configuration surface and they work identically across languages:
| Variable | Default | Notes |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 (gRPC) | Applies to all signals. |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT | — | Traces-only override. |
OTEL_EXPORTER_OTLP_PROTOCOL | grpc | grpc · http/protobuf · http/json |
OTEL_EXPORTER_OTLP_HEADERS | — | For auth: api-key=abc,x-tenant=foo |
OTEL_EXPORTER_OTLP_COMPRESSION | none | gzip recommended for HTTP. |
For HTTP, the OTel spec says the SDK should not append the standard path (/v1/traces) if you set OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, but should append it if you set the generic OTEL_EXPORTER_OTLP_ENDPOINT. Different SDKs have historically gotten this slightly wrong. If you get 404s, set the per-signal endpoint with the full path.
7.0Resource and semantic conventions
The Resource is the set of attributes that describe the entity producing telemetry — the service, the host, the container, the cloud provider. It is attached at the TracerProvider level and stamped onto every span on export. You do not set it per-span.
The minimum useful resource is just service.name. Without it, the Collector will tag the data as "unknown_service" and your dashboards will be useless. Set it from OTEL_SERVICE_NAME or via OTEL_RESOURCE_ATTRIBUTES:
envOTEL_SERVICE_NAME=payments-api
OTEL_RESOURCE_ATTRIBUTES=service.namespace=checkout,service.version=2.14.3,deployment.environment=prod
7.1Resource detectors
SDKs ship with resource detectors that auto-populate common attributes from the environment:
- Process detector —
process.pid,process.runtime.name,process.runtime.version. - Host detector —
host.name,host.arch. - Container detector —
container.idfrom cgroup parsing. - Kubernetes detector — typically populated by the Downward API + collector enrichment, not in-SDK.
- Cloud detectors — AWS, GCP, Azure metadata services.
In Kubernetes, the cleanest pattern is to let the SDK detect the basics and let the k8sattributes processor in the Collector add the K8s-specific labels (pod, namespace, node, owner). This avoids putting cluster-API permissions in every application pod.
7.2Semantic conventions
Semantic conventions (semconv) are the canonical attribute names. Stick to them rigorously — every dashboard, every TraceQL query, every Collector processor expects them. The main families:
| Family | Key attributes |
|---|---|
| HTTP | http.request.method, http.route, http.response.status_code, url.full, url.path, server.address |
| RPC | rpc.system, rpc.service, rpc.method, rpc.grpc.status_code |
| Database | db.system.name, db.namespace, db.query.text, db.operation.name, db.collection.name |
| Messaging | messaging.system, messaging.destination.name, messaging.operation.type, messaging.kafka.partition |
| Errors | error.type, plus the exception event with exception.type, exception.message |
Semconv recently went through a renaming round (e.g., http.method → http.request.method, db.statement → db.query.text). Auto-instrumentation libraries emit the new names; older queries against historical data may need both. The OTEL_SEMCONV_STABILITY_OPT_IN environment variable can pin behavior during migration.
8.0Performance and cost
Tracing is not free. The two costs are runtime overhead in the application and storage cost at the backend. Both are tractable if you know what to watch.
8.1Runtime overhead
For typical web workloads with batch processors and reasonable sampling, you can expect 1–5% CPU overhead. The dominant cost is usually attribute serialization, not span creation. If you put a 10 KB JSON blob into db.query.text on every span, the SDK will pay for that on the export hot path.
Things that move the needle:
- Keep attribute count well below the default limit of 128 per span. Real spans rarely need more than 10–20.
- Avoid putting large strings (request bodies, full SQL with literals, stack traces in non-error paths) on spans.
- Use
span.is_recording()as a guard around any expensive attribute computation. Non-recording spans drop sets silently anyway, but the computation still runs. - BatchSpanProcessor queue depth is a stability lever. If your exporter or Collector slows down, the queue fills and spans get dropped. Monitor it.
8.2Cardinality
This is the big one. Span attributes are not a Prometheus label set — they don't blow up time series. But they do affect:
- Tempo storage size, since every attribute is indexed in vParquet.
- Service-graph and RED metrics generated from spans, which are Prometheus time series. The metrics-generator uses a configurable allowlist of attributes for dimensions; anything outside the list is dropped from metrics but preserved on the span. Watch this carefully.
- TraceQL query cost. Indexed attributes are fast; unindexed attributes scan.
The discipline is: high-cardinality attributes (user IDs, trace IDs of related work, request IDs) belong on spans. They are searchable, and you usually want them. They do not belong as dimensions on metrics.
8.3PII
Auto-instrumentation libraries are conservative by default — they capture HTTP routes (templated), method, status code, but not query strings, headers, or bodies. Most libraries expose hooks to opt in. Before you opt anything in, audit: query strings frequently contain tokens, bodies contain everything. The Collector's attributes and redaction processors can scrub or hash sensitive fields if upstream sanity fails.
9.0Trace correlation
A trace by itself is one of three signals. Its leverage comes from correlating it with the other two — logs and metrics — at query time. Tempo and the LGTM stack are built around this triangle.
9.1Trace ↔ Logs
The mechanism is simple: every log line emitted during a span carries trace_id and span_id as fields. The OTel logging instrumentations for both Python and Node.js do this automatically. In Grafana, you configure the Loki datasource with a derived field that turns a trace_id in a log line into a link to the Tempo trace view.
The reverse direction (trace → logs) is wired in the Tempo datasource: configure "Trace to logs" with the Loki datasource, and the trace view will offer a "View logs for this span" button that runs {service_name="x"} |= ".
9.2Trace ↔ Metrics: exemplars
An exemplar is a trace ID attached to a single Prometheus sample, saying "this measurement came from this trace". The metrics-generator in Tempo produces these automatically: every RED metric bucket can carry an exemplar pointing at a trace that contributed to it.
In Grafana, a histogram panel with exemplars enabled shows small markers on the chart. Click → jump straight to that trace in Tempo. This is the most direct path from "the latency is bad" to "here is a slow trace".
9.3Trace ↔ Profiles
The newest leg. If you run Pyroscope or a similar continuous profiler, profiles can be tagged with the active trace ID, and Tempo exposes a "trace to profiles" link. Less broadly deployed but worth knowing about; it lets you go from a slow span directly to a flame graph of what the CPU was doing during it.
Python & FastAPI
Library names, real configuration, and the gotchas that ship with async Python.
P.1Packages
The OTel Python ecosystem is split into three groups of packages, all on PyPI:
| Package | Purpose |
|---|---|
opentelemetry-api | The API surface. Tracer interface, no implementation. Libraries depend only on this. |
opentelemetry-sdk | The actual implementation. TracerProvider, samplers, processors. Your application depends on this. |
opentelemetry-exporter-otlp | Metapackage that pulls in both gRPC and HTTP OTLP exporters. |
opentelemetry-exporter-otlp-proto-grpc | gRPC OTLP exporter (preferred for high-throughput). |
opentelemetry-exporter-otlp-proto-http | HTTP OTLP exporter. |
opentelemetry-instrumentation | The auto-instrumentation runner (opentelemetry-instrument CLI) and the BaseInstrumentor abstraction. |
opentelemetry-instrumentation-fastapi | FastAPI instrumentation. |
opentelemetry-instrumentation-{httpx,requests,...} | One per supported library. |
opentelemetry-distro | The "batteries-included" default distro — configures BatchSpanProcessor + OTLP exporter from env vars. |
The pragmatic install for a FastAPI service that calls Postgres via SQLAlchemy and downstream services via httpx:
requirements.txtopentelemetry-distro
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
opentelemetry-instrumentation-httpx
opentelemetry-instrumentation-sqlalchemy
opentelemetry-instrumentation-asyncpg
opentelemetry-instrumentation-logging
The instrumentation libraries' versions follow the SDK version closely but with a -b0, -b1 suffix (they are released as pre-1.0). Pin them all to a known good combination — drift between instrumentation and SDK versions causes obscure failures.
P.2Bootstrap (manual)
The fully manual setup is verbose but every line is something you might want to tweak in production. Put this in a module that runs at the top of main.py before importing FastAPI:
python# telemetry.py from opentelemetry import trace from opentelemetry.sdk.resources import Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.sdk.trace.sampling import ParentBased, ALWAYS_ON from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.propagate import set_global_textmap from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator def setup_tracing(service_name: str, service_version: str, environment: str) -> None: resource = Resource.create({ "service.name": service_name, "service.version": service_version, "deployment.environment": environment, }) provider = TracerProvider( resource=resource, sampler=ParentBased(root=ALWAYS_ON), # tail-sample at Collector ) exporter = OTLPSpanExporter( endpoint="http://alloy.observability:4317", insecure=True, # mesh-internal; mTLS terminated by Istio ) provider.add_span_processor(BatchSpanProcessor( exporter, max_queue_size=4096, max_export_batch_size=512, schedule_delay_millis=5000, )) trace.set_tracer_provider(provider) set_global_textmap(TraceContextTextMapPropagator())
Note three intentional choices: ALWAYS_ON root sampler (deferring decisions to Collector tail sampling), insecure=True for an in-mesh endpoint (Istio handles encryption), and a larger queue than default (4096 vs 2048) for spike absorption.
P.3Auto-instrumentation
The opentelemetry-instrument CLI wraps your process and patches in instrumentation libraries before your code imports the underlying libraries. The simplest possible deployment:
shellOTEL_SERVICE_NAME=payments-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy.observability:4317 \
OTEL_TRACES_SAMPLER=always_on \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod,service.version=2.14.3 \
opentelemetry-instrument uvicorn main:app --host 0.0.0.0 --port 8000
This will autoload instrumentation for every library it can detect (FastAPI, httpx, SQLAlchemy, asyncpg, Redis, Celery, etc.) — installed packages are scanned, matched against installed instrumentation packages, and wired up. No code change required.
The CLI obeys the standard env vars and also accepts:
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS— comma list, e.g.,urllib,urllib3to skip noisy ones.OTEL_PYTHON_FASTAPI_EXCLUDED_URLS— regex for paths to skip (health checks).OTEL_PROPAGATORS— defaults totracecontext,baggage; addb3if you bridge to legacy.
P.3.1Hybrid: auto + manual
The common production pattern is: let auto-instrumentation handle framework and library spans, then add manual spans inside business logic for the operations you actually care about. The two coexist — they share the same TracerProvider — and the manual spans nest cleanly inside the auto-instrumented HTTP span.
pythonfrom opentelemetry import trace
tracer = trace.get_tracer(__name__)
async def settle_payment(order_id: str, amount: int) -> PaymentResult:
with tracer.start_as_current_span(
"settle_payment",
attributes={"order.id": order_id, "payment.amount_minor": amount},
) as span:
result = await psp_client.charge(order_id, amount)
span.set_attribute("psp.reference", result.reference)
if result.declined:
span.set_status(trace.Status(trace.StatusCode.ERROR, result.reason))
return result
P.4FastAPI specifics
Auto-instrumentation handles FastAPI well, but a few things are worth knowing.
P.4.1Manual FastAPI instrumentation
If you want explicit control (a common choice — auto magic can be hard to debug), instrument FastAPI yourself:
pythonfrom opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from fastapi import FastAPI app = FastAPI() def server_request_hook(span, scope): # Add custom attributes from the ASGI scope at request start. if span and span.is_recording(): if tenant := scope.get("headers"): for k, v in tenant: if k == b"x-tenant-id": span.set_attribute("tenant.id", v.decode()) FastAPIInstrumentor.instrument_app( app, excluded_urls="/health,/metrics,/ready", server_request_hook=server_request_hook, )
The excluded_urls argument takes a comma-separated regex list and is the correct place to drop health checks — they create high-volume, low-value spans.
P.4.2Route templates vs raw paths
FastAPI's instrumentation sets http.route to the route template (/orders/{order_id}), not the resolved path (/orders/A8742). This is what you want — it gives bounded cardinality on the span name and on derived RED metrics. If you see raw paths showing up, the instrumentation isn't matching the route correctly (often because of middleware ordering or a manually written ASGI handler).
P.4.3Dependencies and shared spans
FastAPI's dependency-injection system runs each dependency within the request span's context. You can grab the current span from anywhere in a dep and annotate it:
pythonfrom opentelemetry import trace
from fastapi import Depends, Request
def get_current_user(request: Request) -> User:
user = resolve_user_from_jwt(request.headers["authorization"])
trace.get_current_span().set_attribute("user.id", user.id)
return user
@app.get("/orders")
async def list_orders(user: User = Depends(get_current_user)):
...
P.4.4BackgroundTasks
FastAPI's BackgroundTasks runs after the response is sent. Critically, it runs in the same task with the request's context still installed, so spans you create from a background task will be parented to the request span — even though the request has already returned to the client. This is usually the right behavior; just be aware that the request span's duration will include the background task time. If you don't want that, create a fresh root span and use links to the original.
P.5Async, contextvars, lifespan, and the real world
OTel Python uses contextvars.ContextVar for the active context. In modern asyncio this mostly "just works" for await, create_task, gather, and TaskGroup. The hard part is everything else that real services do: lifespan startup/shutdown, background tasks, dependencies, thread pools, process pools, and third-party libraries that spawn their own concurrency.
P.5.1The one lifespan pattern you should copy
This is the minimal correct production-grade setup for a FastAPI service. Put the tracing bootstrap in a module that is imported before any other application code that might create spans.
python# telemetry.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource, OTEL_RESOURCE_ATTRIBUTES
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import ParentBased, ALWAYS_ON
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# ... other instrumentors you actually use
def setup_tracing(service_name: str, version: str, env: str) -> None:
resource = Resource.create({
"service.name": service_name,
"service.version": version,
"deployment.environment": env,
})
provider = TracerProvider(
resource=resource,
sampler=ParentBased(root=ALWAYS_ON), # tail sample later
)
exporter = OTLPSpanExporter(
endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://alloy:4317"),
insecure=True,
)
provider.add_span_processor(BatchSpanProcessor(
exporter,
max_queue_size=4096,
max_export_batch_size=512,
schedule_delay_millis=2000,
))
trace.set_tracer_provider(provider)
@asynccontextmanager
async def lifespan(app: FastAPI):
# 1. Setup happens in the parent process before any workers (if using gunicorn) or early
setup_tracing(
service_name=os.getenv("OTEL_SERVICE_NAME", "unknown"),
version=os.getenv("SERVICE_VERSION", "0.0.0"),
env=os.getenv("DEPLOYMENT_ENV", "dev"),
)
# 2. Instrument frameworks AFTER the provider exists
FastAPIInstrumentor.instrument_app(app, excluded_urls=r".*/(health|ready|metrics).*")
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=your_async_engine) # pass your engine if you have one
yield
# 3. CRITICAL: flush before the process exits
trace.get_tracer_provider().shutdown()
app = FastAPI(lifespan=lifespan)
Never call shutdown() inside a request handler. It must be in the lifespan so it runs exactly once when the worker is terminating.
P.5.2BackgroundTasks and exception handling
FastAPI BackgroundTasks run after the response is sent. The request span has already ended with status OK. If the background work fails, that error is invisible on the request span unless you do extra work.
pythonfrom fastapi import BackgroundTasks
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
@app.post("/start-long-job")
async def start_job(bg: BackgroundTasks):
span = trace.get_current_span()
def _job():
# We are in a different task context. Re-attach the parent span explicitly.
ctx = trace.set_span_in_context(span) # the original request span
with tracer.start_as_current_span("background-job", context=ctx) as job_span:
try:
do_expensive_thing()
except Exception as exc:
job_span.record_exception(exc)
job_span.set_status(Status(StatusCode.ERROR))
# Optional: also mark the original request span as having had a background failure
span.set_attribute("background_job.failed", True)
bg.add_task(_job)
return {"accepted": True}
If you use asyncio.create_task or TaskGroup inside the request, the context is usually propagated automatically — but be explicit with context=trace.set_span_in_context(current_span) when in doubt.
P.5.3Dependencies (Depends) and shared spans
FastAPI dependencies run in the same task as the route handler. The active span is the one created by the FastAPI instrumentation for the incoming request. If a dependency does heavy work you want to see as a child span, start one manually:
pythonfrom fastapi import Depends
from opentelemetry import trace
def get_db_session():
span = trace.get_current_span()
with tracer.start_as_current_span("db.get_session"):
# expensive pool checkout, auth check, etc.
session = ...
return session
@app.get("/user/{id}")
async def get_user(id: int, db=Depends(get_db_session)):
with tracer.start_as_current_span("user.fetch"):
...
Do not create a new root span in a dependency unless you have a very good reason — it will break the tree.
P.5.4SQLAlchemy 2.0 async, asyncpg, and connection pool noise
With opentelemetry-instrumentation-sqlalchemy + async engine you get a span for every execute / stream. At high volume this produces enormous cardinality on db.statement unless you normalize.
Recommended approach:
- Keep the instrumentation (you want to see N+1 queries).
- In the Collector, use the
attributesprocessor or a transform to replace the full SQL with a digest or the first 200 chars + "..." for normal queries. - For the 5-10 "golden" queries that matter, add a span attribute
db.query.namein your repository layer and filter on that in TraceQL.
asyncpg instrumentation is separate (opentelemetry-instrumentation-asyncpg). It gives you lower-level TCP connect / query spans. Many teams run only the SQLAlchemy layer and accept that raw asyncpg calls inside SQLAlchemy are not double-instrumented.
P.5.5Other common async gotchas
- arq / rq / dramatiq / huey workers: treat them like Celery — propagate the carrier in the job payload and extract at the start of the worker function.
- redis-py async: the official instrumentation exists but is young. For critical paths, wrap the calls in manual spans and inject the current context into any Lua scripts or pipeline metadata if you need cross-service correlation through Redis.
- greenlet / gevent / eventlet: These replace the Python scheduler. contextvars support is partial or requires monkey-patching. Test thoroughly; many teams simply avoid them when they need reliable tracing.
- uvicorn --reload in development: The reloader process itself does not run your app code. Spans only come from the child. This is expected.
If you ever manually create a task, submit work to an executor, or cross a process boundary, capture the current span/context and re-attach it explicitly on the other side. The automatic propagation only covers the paths the stdlib asyncio and contextvars team knew about. Everything else is on you.
P.6Outbound HTTP propagation
The instrumentation libraries do two things on outgoing requests: create a CLIENT span and inject traceparent into the outgoing headers. You don't have to do anything beyond installing them.
| Client library | Instrumentation package |
|---|---|
| httpx (sync + async) | opentelemetry-instrumentation-httpx |
| requests | opentelemetry-instrumentation-requests |
| aiohttp client | opentelemetry-instrumentation-aiohttp-client |
| urllib3 | opentelemetry-instrumentation-urllib3 |
| grpcio (client) | opentelemetry-instrumentation-grpc |
If your service calls a downstream that doesn't speak W3C Trace Context (a legacy Zipkin-B3 service, say), you can change the propagator globally:
envOTEL_PROPAGATORS=tracecontext,baggage,b3multi
The SDK runs all configured propagators on every request — inject all, extract whichever is present. The cost is small.
P.7Databases
Available instrumentations cover most of the mainstream:
| Library | Notes |
|---|---|
sqlalchemy | Wraps the engine. Captures db.statement with parameter placeholders (not values). Works for both sync and async engines. |
asyncpg | Direct Postgres async driver. |
psycopg (v3) | Captures connection & query timing. |
pymongo | Captures collection, operation, query shape. |
redis / aioredis | Per-command spans. Can be noisy; consider sampling. |
aiokafka | Producer + consumer instrumentation with link-based correlation. |
SQLAlchemy is the trickiest because the engine is usually created at import time. Instrument it explicitly:
pythonfrom opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine(DB_URL)
SQLAlchemyInstrumentor().instrument(engine=engine.sync_engine)
Auto-instrumentation captures the parameterized SQL (SELECT * FROM users WHERE id = $1) — not the parameter values. This is safe by default. If you have raw string-formatted SQL in your code, the literals will end up on spans. Audit.
P.8Messaging: Celery, Kafka, RabbitMQ
Messaging is where the abstraction matters most because the parent-child relationship is replaced by a producer-link-consumer one.
P.8.1Celery
opentelemetry-instrumentation-celery covers both the dispatch side (creates a PRODUCER span at .delay()/.apply_async()) and the worker side (creates a CONSUMER span when the task runs). It propagates trace context through the task headers automatically. The consumer span links back to the producer.
P.8.2Kafka (aiokafka / confluent-kafka)
Trace context is propagated via Kafka record headers (traceparent as a header value). On the producer side a PRODUCER span is created and the headers are injected. On the consumer side the instrumentation extracts the context, creates a CONSUMER span as a new root with a link to the producer.
Why a link and not a parent? Because the consumer often processes messages in batches, and a single processing operation might consume messages from many different producers — there's no single parent to choose. Links capture all of them without forcing a hierarchy.
P.8.3The custom worker pattern
If you write custom worker code (a script that pulls work from a queue), you need to extract context manually:
pythonfrom opentelemetry.propagate import extract from opentelemetry import context, trace tracer = trace.get_tracer(__name__) def handle_message(message): ctx = extract(message.headers) # headers as dict-like token = context.attach(ctx) try: with tracer.start_as_current_span( "process_message", kind=trace.SpanKind.CONSUMER, ) as span: span.set_attribute("messaging.system", "rabbitmq") do_work(message) finally: context.detach(token)
P.9Logging correlation
The opentelemetry-instrumentation-logging package adds a log record factory that injects otelTraceID, otelSpanID, and otelServiceName into every log record. You then reference these in your formatter:
pythonfrom opentelemetry.instrumentation.logging import LoggingInstrumentor
LoggingInstrumentor().instrument(set_logging_format=False)
import logging
logging.basicConfig(
format='%(asctime)s %(levelname)s [trace=%(otelTraceID)s span=%(otelSpanID)s] %(name)s: %(message)s',
)
For JSON logging with structlog or python-json-logger, add a processor/formatter that pulls the active span and writes its IDs as JSON fields. This is what Loki then picks up and what makes trace→log navigation work in Grafana.
In the Tempo datasource config in Grafana, enable "Trace to logs" with your Loki datasource, set filter by trace ID to true, and pick a tag like service.name to scope the query. The trace-view UI will then surface a "View logs" button per span that opens a filtered Loki query in a new pane.
P.10Common gotchas in Python
- Gunicorn pre-fork. If you initialize the TracerProvider in the parent process before fork, all workers share state in confusing ways. Initialize in a post-fork hook (
post_forkingunicorn.conf.py), or just use Uvicorn directly with its--workersflag, which forks before importing the app. - uvicorn's
--reload. The reloader spawns child processes; instrumentation works fine in the child but you'll see no spans from the parent (correctly — the parent does no real work). Don't be confused. - Pydantic's
model_validatefrom threads. If you're doing CPU-bound work inasyncio.to_thread, context propagation throughasyncio.to_threadworks in Python 3.11+. In earlier versions, you must usecopy_context().run(...). - Late SDK shutdown. Always wire
provider.shutdown()into FastAPI's lifespan to flush the BatchSpanProcessor:pythonfrom contextlib import asynccontextmanager @asynccontextmanager async def lifespan(app: FastAPI): setup_tracing(...) yield trace.get_tracer_provider().shutdown() app = FastAPI(lifespan=lifespan) - Exception in background task. If an unhandled exception escapes a FastAPI background task, the request span's status is already set to OK (the response succeeded). The background span will show the error, but a casual look at the request span won't. Consider explicitly setting status from the background task to bubble up via attribute propagation, or use a separate root.
P.11Production patterns — the complete recipe
This is the configuration that experienced teams converge on after the first six months of real production pain. Treat it as a checklist, not dogma.
P.11.1Application-side configuration (the 80% that matters)
| Concern | Recommended value / pattern | Why |
|---|---|---|
| Sampler at SDK | ParentBased(root=ALWAYS_ON) | Never drop at the edge if you can afford the CPU. Tail sampling + metrics-generator gives you the best of both worlds. |
| BatchSpanProcessor | max_queue_size=4096, max_export_batch=512, delay=2s | Absorbs the 10x traffic spikes you see during deploys or marketing events without dropping spans in memory. |
| Exporter | OTLP/gRPC to http://localhost:4317 or the node-local Alloy DaemonSet | Lowest latency, connection reuse, mTLS handled by service mesh if present. |
| Resource attributes | Via OTEL_RESOURCE_ATTRIBUTES + Downward API + k8sattributes processor | Never bake node name, pod name, zone into the image. Let the infrastructure layer enrich. |
| Excluded URLs | regex for /health, /ready, /metrics, /favicon, static assets | These can be 30-70% of your traffic. They drown real traces and explode cardinality. |
| Shutdown | provider.shutdown() in lifespan + terminationGracePeriodSeconds: 60 + preStop hook (sleep 5-10s) | The single most common cause of "last 5 seconds of traces are missing". |
P.11.2Concrete Kubernetes / deployment knobs
yaml# example Deployment snippet
spec:
template:
spec:
containers:
- name: app
env:
- name: OTEL_SERVICE_NAME
value: "payments-api"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://127.0.0.1:4317" # talks to local Alloy via hostNetwork or DaemonSet
- name: OTEL_TRACES_SAMPLER
value: "always_on"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=prod,service.version=2.14.3"
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 8"] # give the exporter a chance to flush
terminationGracePeriodSeconds: 60
On the Alloy side, run a lightweight DaemonSet that scrapes localhost:4317 (or the OTLP receiver) and forwards with batching + retry + tail sampling if you do it centrally.
P.11.3Collector / Alloy pipeline recommendations (the other half of the story)
A minimal but production-hardened tail-sampling + metrics pipeline usually looks like:
- Receivers: otlp (both gRPC and HTTP)
- Processors (in order):
memory_limiter,k8sattributes(if not done at edge),batch(big batches),attributes(redact PII, normalize SQL),tail_sampling(the policy that actually keeps the good stuff) - Exporters: loadbalancing (for tail sampling across instances) → otlp to Tempo, plus the metrics-generator for RED + exemplars to Mimir.
Example tail_sampling policy that most teams start with:
yamlprocessors:
tail_sampling:
decision_wait: 15s # how long to buffer a trace before deciding
num_traces: 200000 # memory budget — watch this metric
policies:
- name: errors
type: status_code
status_code: { status_codes: ["ERROR"] }
- name: slow-1s
type: latency
latency: { threshold_ms: 1000 }
- name: slow-500ms-tenant-critical
type: latency
latency: { threshold_ms: 500 }
# + attribute filter on tenant.id == "acme" if you have it in baggage/attributes
- name: keep-5pct
type: probabilistic
probabilistic: { sampling_percentage: 5 }
Tune decision_wait against your p99 trace duration. If you have 30-second traces, 15s is too short.
P.11.4Manual business spans — the real value
After you have HTTP + DB for free, the next 10x leap in usefulness comes from spans around the things your users and on-call engineers actually care about:
pythonwith tracer.start_as_current_span("payment.settle", attributes={
"payment.id": payment.id,
"payment.amount": str(amount),
"merchant.id": merchant.id,
}):
result = await ledger.settle(...)
if not result.success:
span = trace.get_current_span()
span.set_attribute("payment.decline_reason", result.reason)
span.set_status(Status(StatusCode.ERROR))
These are the spans you will be staring at in TraceQL when the CFO asks "why did the checkout conversion drop 4% last Tuesday between 14:00 and 16:00 for EU customers?"
P.11.5Logging correlation (the highest leverage 30 minutes you will ever spend)
Install opentelemetry-instrumentation-logging (or the manual LoggingInstrumentor). It adds trace_id, span_id, and trace_flags to every logging record.
Then in your logging config (or structlog / loguru processor):
pythonimport logging
from opentelemetry import trace
class TraceContextFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
if span and span.is_recording():
ctx = span.get_span_context()
record.trace_id = f"{ctx.trace_id:032x}"
record.span_id = f"{ctx.span_id:016x}"
return True
Now every log line that was emitted during a traced request carries the IDs. In Grafana you can click "View logs for this span" and land exactly on the right lines. This single integration removes 70% of "I have a trace but I don't know what the service was thinking" moments.
P.11.6Validation you should run in CI or on deploy
- Start the service with
OTEL_TRACES_EXPORTER=console+ a synthetic request that touches every integration point (HTTP out, DB, background job, messaging). Assert that the emitted JSON has exactly one root and that every expected child relationship exists. - Run the same flow against a real local Tempo (or the dev collector) and assert that TraceQL
{ resource.service.name = "X" } | count()returns the expected number of spans for that test trace. - Have a golden trace ID in your integration test that you can query after the fact.
If any of these checks fail, the deploy is red. Tracing is not "nice to have" — it is part of the contract the service has with the platform.
TypeScript & Node.js
Module loading order, AsyncLocalStorage, and why bundlers will ruin your day.
T.1Packages
The npm package layout mirrors Python's structure but with more granularity:
| Package | Purpose |
|---|---|
@opentelemetry/api | API only. Libraries depend on this, peer-dep style. |
@opentelemetry/sdk-trace-base | Core SDK — TracerProvider, processors, samplers. |
@opentelemetry/sdk-trace-node | Node-specific SDK with AsyncLocalStorage context manager. |
@opentelemetry/sdk-node | The all-in-one bootstrapper for Node services (recommended). |
@opentelemetry/resources | Resource construction. |
@opentelemetry/exporter-trace-otlp-grpc | gRPC OTLP exporter. |
@opentelemetry/exporter-trace-otlp-http | HTTP OTLP exporter. |
@opentelemetry/exporter-trace-otlp-proto | HTTP/protobuf OTLP exporter. |
@opentelemetry/auto-instrumentations-node | The "everything" instrumentation pack. |
@opentelemetry/instrumentation-{http,express,...} | Individual instrumentations. |
@opentelemetry/semantic-conventions | Typed constants for semconv attribute names. |
Minimal production install:
json (package.json deps)"@opentelemetry/api": "^1.9.0",
"@opentelemetry/sdk-node": "^0.55.0",
"@opentelemetry/auto-instrumentations-node": "^0.52.0",
"@opentelemetry/exporter-trace-otlp-grpc": "^0.55.0",
"@opentelemetry/resources": "^1.28.0",
"@opentelemetry/semantic-conventions": "^1.28.0"
The version split between the API (1.x) and the SDK/exporters (0.x) is intentional: the API is stable and follows semver; the SDK is still pre-1.0 but in practice very stable. Pin both.
T.2NodeSDK bootstrap
The crucial fact about Node instrumentation: it must be loaded before any of the libraries it instruments are imported. The instrumentation works by monkey-patching modules at require/import time. If http is already loaded into the module cache when you set up OTel, none of its requests will be traced.
This makes Node OTel setup a two-file pattern:
typescript// instrumentation.ts — must be loaded first import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { resourceFromAttributes } from '@opentelemetry/resources'; import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION, } from '@opentelemetry/semantic-conventions'; import { AlwaysOnSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-base'; const sdk = new NodeSDK({ resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? 'unknown', [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0', 'deployment.environment': process.env.ENV ?? 'dev', }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://alloy.observability:4317', }), sampler: new ParentBasedSampler({ root: new AlwaysOnSampler() }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, // extremely noisy '@opentelemetry/instrumentation-http': { ignoreIncomingRequestHook: (req) => req.url?.startsWith('/health') || req.url?.startsWith('/metrics') || req.url?.startsWith('/ready'), }, }), ], }); sdk.start(); process.on('SIGTERM', () => { sdk.shutdown() .then(() => process.exit(0)) .catch(() => process.exit(1)); });
Then your main.ts or server.ts imports nothing OTel-related — the SDK is loaded via a CLI flag:
shell# For CommonJS node --require ./dist/instrumentation.js dist/main.js # For ESM node --import ./dist/instrumentation.js dist/main.js
The --require/--import flag runs instrumentation.js before any other code, including your app's first import. This is what makes auto-instrumentation work.
If you use tsx or ts-node-dev in development, the require flag needs to point to the TS file directly: tsx --require ./src/instrumentation.ts src/main.ts. Watch for double-loading when both the CLI flag and a top-level import './instrumentation' exist.
T.3Auto-instrumentation
The auto-instrumentations-node package is a meta-package that includes every supported instrumentation. The default is "enable everything that can be enabled". To control:
typescriptgetNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
'@opentelemetry/instrumentation-dns': { enabled: false },
'@opentelemetry/instrumentation-net': { enabled: false },
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) => isHealthCheck(req.url),
requestHook: (span, request) => {
span.setAttribute('tenant.id', getTenant(request));
},
},
'@opentelemetry/instrumentation-pg': {
enhancedDatabaseReporting: true,
},
});
Three families to disable by default: fs, dns, net. They are extremely chatty (every file read becomes a span, every DNS lookup becomes a span) and rarely interesting. The HTTP, gRPC, and DB instrumentations carry their weight.
T.4Express, Fastify, NestJS
The HTTP instrumentation is the foundation — it intercepts at the http module level, before Express or Fastify even sees the request. The framework-specific instrumentations layer on top to give you route attribution.
With auto-instrumentation enabled, you get:
- Express: a root SERVER span at the HTTP level, plus child spans for each middleware and route handler. The route template (
/users/:id) shows up on the root span'shttp.route. - Fastify: similar, with the route plugin's path on
http.route. - NestJS: works via the underlying Express or Fastify adapter. The controller class/method shows up via the dedicated NestJS instrumentation if installed.
For Fastify specifically, the instrumentation needs to load before any route registration. If you're using ESM and lazy-loading routes, double-check ordering.
T.4.1Adding attributes from inside a handler
typescriptimport { trace } from '@opentelemetry/api';
import { Request, Response } from 'express';
app.get('/orders/:id', async (req: Request, res: Response) => {
const span = trace.getActiveSpan();
span?.setAttribute('order.id', req.params.id);
span?.setAttribute('user.id', req.user.id);
const order = await loadOrder(req.params.id);
res.json(order);
});
The trace.getActiveSpan() returns the span that the HTTP instrumentation created. Setting attributes on it puts them on the root SERVER span, which is what shows up in service-graph filtering and TraceQL searches.
T.5AsyncLocalStorage and context flow
Modern Node OTel uses AsyncLocalStorage from Node's async_hooks module. This API gives a key-value store that flows through async operations: await, Promise chains, setTimeout, setImmediate, process.nextTick, callbacks via EventEmitter. The NodeSDK installs an AsyncLocalStorageContextManager as the default.
What this gets you:
typescriptapp.get('/things', async (req, res) => { // Active span here is the SERVER span. await Promise.all([ fetchA(), // nested span will parent to SERVER span fetchB(), // same ]); setTimeout(() => { // Even here, the SERVER span is still active. trace.getActiveSpan()?.setAttribute('cleanup.done', true); }, 100); res.json({}); });
Where AsyncLocalStorage does not propagate:
- Worker threads (
worker_threadsmodule). A worker is a fresh V8 isolate; context doesn't cross. If you dispatch work to a worker, you must serialize the trace context (extracttraceparent, send it as a message, re-extract on the other side). - Child processes (
child_process.spawn). Same story. - Some EventEmitter patterns with detached listeners. Rare in practice but if you store an event handler and call it from outside any async hook chain, the context may be lost. The fix is to bind the handler:
context.bind(context.active(), handler).
T.6Manual spans in TypeScript
Getting a tracer and creating a span is straightforward:
typescriptimport { trace, SpanStatusCode, SpanKind } from '@opentelemetry/api';
const tracer = trace.getTracer('payments-domain', '2.14.3');
async function settlePayment(orderId: string, amount: number): Promise<PaymentResult> {
return tracer.startActiveSpan(
'settle_payment',
{ kind: SpanKind.INTERNAL, attributes: { 'order.id': orderId, 'payment.amount_minor': amount } },
async (span) => {
try {
const result = await pspClient.charge(orderId, amount);
span.setAttribute('psp.reference', result.reference);
if (result.declined) {
span.setStatus({ code: SpanStatusCode.ERROR, message: result.reason });
}
return result;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
},
);
}
startActiveSpan is the idiomatic API — it creates the span, makes it the active context for the duration of the callback, and gives you the span as an argument. Don't forget span.end(): TypeScript's type system doesn't enforce it. A common pattern is a small helper:
typescriptasync function traced<T>(
name: string,
attrs: Record<string, string | number | boolean>,
fn: (span: Span) => Promise<T>,
): Promise<T> {
return tracer.startActiveSpan(name, { attributes: attrs }, async (span) => {
try {
return await fn(span);
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
});
}
Then your business code stays readable: const r = await traced('settle_payment', { 'order.id': id }, async (span) => {...});
T.7Outbound HTTP propagation
The HTTP instrumentation intercepts both incoming and outgoing requests. For outgoing requests built on Node's built-in http/https module (which includes node-fetch, axios with the http adapter, and got), the traceparent header is injected automatically and a CLIENT span is created.
The native fetch (Node 18+, undici-based) has its own instrumentation:
typescriptimport { UndiciInstrumentation } from '@opentelemetry/instrumentation-undici'; // Already included in auto-instrumentations-node since 0.50.
If you see "no CLIENT span" for outgoing requests, it's almost always because:
- You're using global
fetchand the undici instrumentation is missing or not enabled. - The HTTP client is loaded before the SDK was set up (the require/import order problem).
- A bundler has rewired the imports such that the monkey-patch can't reach them. See T.11.
T.8Databases
Available instrumentations cover the common Node DB libraries:
| Library | Notes |
|---|---|
pg | node-postgres. Parameterized query on span; literals not captured by default. |
mysql / mysql2 | Both supported. mysql2 has better connection-pool support. |
mongodb | Captures collection and command shape. |
ioredis | Per-command. Noisy at scale. |
redis (v4+) | Same caveats as ioredis. |
prisma | Special: Prisma has its own tracing built in. You enable the tracing preview feature in the schema, and Prisma emits OTel spans natively. The @prisma/instrumentation package wires them in. |
Prisma deserves a note. Add the preview feature in schema.prisma:
prismagenerator client {
provider = "prisma-client-js"
previewFeatures = ["tracing"]
}
Then in your bootstrap, register Prisma's instrumentation:
typescriptimport { PrismaInstrumentation } from '@prisma/instrumentation';
new NodeSDK({
...
instrumentations: [
getNodeAutoInstrumentations(...),
new PrismaInstrumentation(),
],
});
You get spans for each Prisma query, including the underlying SQL (when available) and the engine round-trip time. Useful for diagnosing the "is Prisma slow or is Postgres slow" question.
T.9Messaging
The Node ecosystem has slimmer messaging coverage than Python's, but the major ones are there:
| Library | Package |
|---|---|
| kafkajs | @opentelemetry/instrumentation-kafkajs |
| amqplib (RabbitMQ) | @opentelemetry/instrumentation-amqplib |
| aws-sdk SQS/SNS | @opentelemetry/instrumentation-aws-sdk |
| BullMQ | @opentelemetry/instrumentation-bullmq (community) |
kafkajs is the most commonly used. The instrumentation handles header injection on the producer and extraction on the consumer, creating linked PRODUCER and CONSUMER spans. Make sure you're on a kafkajs version compatible with the instrumentation's range (it's strict).
T.10Logging correlation
Pino is the most common Node logger. Use @opentelemetry/instrumentation-pino:
typescriptimport { PinoInstrumentation } from '@opentelemetry/instrumentation-pino';
new NodeSDK({
instrumentations: [
getNodeAutoInstrumentations(),
new PinoInstrumentation({
logHook: (span, record) => {
record['resource.service.name'] = serviceName;
},
}),
],
});
The instrumentation modifies Pino's mixin to add trace_id and span_id fields to every log line. The fields names align with what Loki/Grafana expect for the trace-to-logs linking.
For Winston, the analogous package is @opentelemetry/instrumentation-winston. For Bunyan, it's @opentelemetry/instrumentation-bunyan.
console.log does not get trace IDs attached automatically. The official position is that console is not a structured logger. In practice, if your codebase has a lot of console.log calls and you want them correlated, either migrate to Pino or write a small wrapper that calls trace.getActiveSpan().spanContext() and prefixes the output.
T.11ESM, bundlers, and the patching problem
Node OTel's auto-instrumentation works by intercepting require() calls (CommonJS) or import-time hooks (ESM). It replaces the loaded module's exports with patched versions. Two situations break this.
T.11.1ESM
Pure ESM Node apps need the loader hook variant. Use --import instead of --require:
shellnode --import ./instrumentation.mjs ./main.mjs
Under the hood, the OTel SDK uses import-in-the-middle (the ESM equivalent of require-in-the-middle). Some instrumentations don't yet have full ESM support; check the per-instrumentation README. The ecosystem has converged on ESM support throughout 2024–2025 but a few edge cases remain.
T.11.2Bundlers
Webpack, esbuild, rollup, tsup — all of them, by default, will inline your dependencies into a single output file. After bundling, there is no node_modules/express/index.js to monkey-patch. The instrumentation can't find what to patch, and silently produces no spans.
Three resolutions:
- Don't bundle the server. The simplest and most common choice.
tscemit +node_modulesat runtime. Smaller image gains aren't worth the operational complexity. - Mark dependencies as external. Tell the bundler to leave
node_modulesalone for the instrumented libraries. esbuild:--external:express --external:pg. This works but requires maintaining the list. - Use a bundler-aware approach. Some teams have had success with
@vercel/ncc+ selective externals, but this is fragile. Avoid if you can.
NestJS' default nest build --webpack bundles dependencies. This will break OTel auto-instrumentation. Either use tsc output (the default nest build without --webpack) or carefully externalize. The NestJS docs recently added a note about this.
T.11.3tsx, ts-node, ts-node-dev
For development with TypeScript, tsx is now the de-facto choice. It supports --import and works with OTel. Older ts-node-dev has known issues with module patching due to its custom transformer; if you see no spans in dev, suspect this first.
T.12Production patterns (TypeScript)
Mirroring the Python list:
- Sampling:
AlwaysOnSamplerat the SDK. Same reasoning — tail-sample at the Collector. - Disable noise.
fs,dns,netinstrumentations off. Health/metric endpoints filtered in the HTTP instrumentation. - Resource attributes from env.
OTEL_SERVICE_NAMEandOTEL_RESOURCE_ATTRIBUTES. Don't put cluster-specific attributes in code. - Graceful shutdown.
sdk.shutdown()on SIGTERM. Critical for ephemeral pods; without it, the last few seconds of spans never reach Tempo. - Pino + Pino instrumentation. JSON logs with trace IDs. Aligns with Loki defaults.
- Manual spans for business operations. Same as Python — auto handles framework and library spans; you handle the domain.
- Avoid bundling. Or carefully manage externals if you must bundle.
- Use the semconv constants. Import from
@opentelemetry/semantic-conventionsrather than hand-writing attribute names. Type-checked and migration-resilient.
The Path to Tempo
From your application's exporter to a TraceQL panel in a Grafana dashboard.
IV.1The wire path
Your application emits spans via OTLP to an Alloy DaemonSet (one per node, scraped via localhost or the node IP). Alloy forwards to a central OTel Collector (or another Alloy tier) which does the heavy lifting — tail sampling, attribute scrubbing, metric generation — and then to Tempo's distributor.
For your existing Alloy + Tempo setup, the application-side concerns are exactly what this document has covered — set OTEL_EXPORTER_OTLP_ENDPOINT to the local Alloy, set OTEL_SERVICE_NAME and resource attributes correctly, sample at AlwaysOn, ship.
IV.2TraceQL for dashboards
TraceQL is Tempo's query language. The minimum mental model:
- Spansets. The result of a TraceQL query is a set of spans. The filter expression selects them. Curly braces wrap the filter:
{ resource.service.name = "payments-api" }. - Span-level attributes are unprefixed:
{ http.status_code = 500 }. - Resource-level attributes are prefixed with
resource.:{ resource.service.name = "payments" }. - Intrinsics are prefixed (or unprefixed depending on version):
name,duration,kind,status.{ duration > 500ms }.
IV.2.1Useful TraceQL idioms
traceql# All errored payment requests { resource.service.name = "payments-api" && status = error } # Slow checkouts that hit Postgres { name = "POST /checkout" && duration > 1s } >> { db.system = "postgresql" } # All spans of a tenant { tenant.id = "acme-corp" } # Traces where the payments service called the fraud service { resource.service.name = "payments-api" } >> { resource.service.name = "fraud-api" }
The >> operator is "descendant of" — left-hand-side spans that have a descendant matching the right-hand-side. > is "direct child of". && and || compose within a single spanset.
IV.2.2TraceQL metrics
TraceQL can also produce metrics on-the-fly:
traceql# Request rate per route, p95 latency { resource.service.name = "payments-api" } | rate() by (http.route) { resource.service.name = "payments-api" } | quantile_over_time(duration, 0.95) by (http.route)
These run against trace data directly (not against pre-aggregated metrics) which means they're slower than Prometheus queries but more flexible — you can dimension by any span attribute, not just what the metrics-generator pre-aggregated. Useful for one-off questions ("what's the p99 for tenant X on route Y in the last 30 minutes") that aren't worth a permanent metric.
IV.2.3Dashboard composition
A useful tracing dashboard for a service usually has:
- Top: RED metrics from the metrics-generator (Mimir) — request rate, error rate, p50/p95/p99 latency, by route. These are Prometheus queries, fast, with exemplars enabled.
- Middle: a TraceQL search panel — most recent errors, slowest traces in window.
- Bottom: service-graph panel showing dependencies and edges with error/latency annotations.
The exemplar links are what turns this from a static dashboard into something investigable: click a spike on the latency chart, land in the slow trace itself, see the bottleneck.
L.0Logs, Metrics & Logfire — completing the observability picture
The study projects in this workspace (observability-fastapi and observability-node on the feature/logfire-migration branch) were migrated from raw OpenTelemetry to Logfire (Pydantic) and then extended with production-grade logging and metrics. Logfire is a thin, opinionated wrapper that gives you a much nicer API while still emitting standard OTLP for traces, logs, and metrics with automatic context correlation.
L.0.1Why Logfire for these usecases?
- Trace usecases now covered: manual nested spans for business flows (checkout.process → validate_cart, payment.process with CLIENT flavor, order creation, db-style user.get_by_id), proper error recording via
logfire.error()/logfire.warning(), context in background tasks, health exclusion, graceful shutdown. - All of the concrete patterns in
main.pyandserver.ts(the /api/checkout, /api/users/:id, /api/slow, /api/error, background-job endpoints) are the living reference for the theory in sections 2–3 and P.4–P.11 / T.* above.
L.0.2Configuration (the one-time setup)
python# telemetry.py
logfire.configure(
token=os.getenv("LOGFIRE_TOKEN"),
service_name=service_name,
service_version=service_version,
environment=environment,
)
logfire.instrument_fastapi(app, excluded_urls=["/health", "/ready", "/metrics"])
logfire.instrument_system_metrics() # CPU, mem, GC, process metrics
ts// instrumentation.ts
logfire.configure({
token: process.env.LOGFIRE_TOKEN,
serviceName: process.env.OTEL_SERVICE_NAME || 'observability-node',
metrics: {}, // enable the metrics signal alongside traces
});
L.0.3Manual business spans + logs + metrics in one flow (the checkout usecase)
This is the exact pattern now running in both projects. Note the use of semantic-convention-friendly attribute names and low-cardinality metric labels.
pythoncheckout_attempts = logfire.metric_counter("checkout.attempts", unit="1")
payment_duration = logfire.metric_histogram("payment.duration", unit="ms")
with logfire.span("checkout.process", checkout_id=chk, user_id=uid) as span:
checkout_attempts.add(1, {"status": "started"})
logger.info("Checkout started", extra={"checkout_id": chk}) # correlated via Logfire logging hook
with logfire.span("checkout.validate_cart"): ...
t0 = time.perf_counter()
pay = await process_payment()
payment_duration.record((time.perf_counter()-t0)*1000, {"provider": "stripe"})
...
checkout_attempts.add(1, {"status": "success"})
logfire.info("Checkout completed")
In the Node/TS version the shape is logfire.span(name, attrs, async (span) => { ... }) + logfire.info(...) + OTel meter.createCounter(...).add(1, labels). The active span context makes every metric and log entry carry the trace linkage automatically.
L.0.4Semantic conventions across the three signals (what the demos actually use)
| Signal | Common attributes / names used in the projects |
|---|---|
| Traces (spans) | user.id, checkout.id, payment.id, order.id, db.system, db.operation, db.statement, db.rows.count, span.kind=client (or attr), cart.*, delay.seconds |
| Logs | severity via logfire.error/warning/info or stdlib logger.*; body + attributes; automatic trace_id / span_id injected by the SDK/Logfire handler |
| Metrics | checkout.attempts{status} (counter), payment.duration{provider} (hist), system.* from instrument_system_metrics (process.cpu.utilization, system.memory.utilization, cpython.gc.* etc.) — all with resource.service.* |
Follow the OTel semantic conventions catalog for the stable set (http.*, db.*, etc.). High-cardinality values (IDs, trace IDs) stay on spans/logs; low-cardinality dimensions go on metrics.
L.0.5Correlation that just works
Because Logfire (and the underlying OTel SDK) propagates the SpanContext via contextvars (Python) / AsyncLocalStorage (Node), any logfire.* call, stdlib log, or metric record made while the span is active automatically receives the current trace/span IDs. In Grafana this means:
- Loki logs get derived fields for
trace_id→ Tempo link - Prometheus metrics (or metrics-generator exemplars) carry the trace ID so a latency spike can jump you straight into the offending trace
No manual trace_id threading required for the common case.
Start the FastAPI or Node server (see each README), hit /api/checkout a few times, then look at the Logfire UI or your local Grafana (Tempo + Loki + Prometheus). You will see the full connected story: the manual spans, the log lines with trace context, the counter increments, the system metrics, and the histogram distribution — all linked.
V.0Closing notes & production checklist
Tracing rewards consistency. The most common reasons traces are unhelpful in production are not exotic — they are: missing service.name, raw paths in span names, broken context propagation across one async boundary, health checks drowning out real traffic, and shutdowns that don't flush. Every one of these is preventable and the failure modes are sharp: either it's right and your traces are excellent, or it's wrong and you have garbage.
The second-order recommendation is to push the discipline into platform-level conventions: a shared bootstrap module (or sidecar) per language stack, a baseline Collector pipeline with the right processors, dashboards built once and reused, opinions encoded as defaults. Individual teams shouldn't have to rediscover that fs instrumentation is too noisy or that bundlers break things.
And the third: the value of traces compounds with logs and metrics correlation. A trace alone is a clue; a trace linked to its logs and to the metric exemplar that took you there is a complete picture. The work to wire those links once is the single highest-leverage thing in the entire pipeline.
Check items as you implement. Your choices are saved locally — come back to this page anytime.