Tracing and OpenTelemetry
Distributed tracing follows a single request as it travels across services, recording how long each hop took and where it failed. In a system of many microservices, it answers the question metrics and logs alone struggle with: where in the call chain did this slow or failing request actually spend its time?
OpenTelemetry is the standard that unifies how traces (and metrics and logs) are produced, so instrumentation is no longer tied to one vendor's backend. Together they make a request's journey visible.
Why Traces
Metrics tell you the checkout service is slow; logs tell you what one instance did; neither easily tells you that a request spent 400ms waiting on a downstream inventory call that then timed out. A trace captures the whole path of one request across services as a tree of timed operations, so you can see the critical path and the failure point. For latency and dependency problems in distributed systems, tracing is the tool.
Spans, Traces, and Context
A span is one timed operation — a service handling part of the request. A trace is the tree of spans for one request, tied together by a shared trace ID. The mechanism that links them is context propagation: each service passes the trace context (IDs) in headers to the next, so spans created in different services join the same trace. Broken propagation — a service that drops the headers — produces orphan spans and incomplete traces, the most common tracing bug.
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for telemetry: a vendor-neutral API and SDK for instrumenting code, plus the OTel Collector that receives, processes, and exports telemetry to any backend. Instrument once against OTel and you can send traces to Jaeger, Tempo, or a cloud tracing service without changing the application. It has become the default way to instrument, replacing the fragmented, vendor-specific libraries that came before.
Sampling and Correlation
Tracing every request is expensive in overhead and storage, so traces are sampled — head sampling decides up front (keep 1%), tail sampling decides after seeing the trace (keep all errors and slow ones). Choosing a sampling strategy is the main cost lever. The payoff multiplies when telemetry is correlated: a trace ID in your structured logs (Topic 45) and exemplars linking metrics (Topic 46) to traces let you jump from a latency spike to the exact slow request. Traces are the third pillar, most valuable alongside the other two.
Metrics — aggregate numeric trends — what is happening and how much. Cheap, always-on.
Logs — discrete events from one component — what a service did.
Traces — one request across services — where time went and where it failed. Sampled.
- Sampling at 100% and paying unsustainable overhead and storage cost.
- Breaking context propagation (dropping trace headers), producing orphan spans and broken traces.
- Instrumenting blindly without correlating traces to logs and metrics, losing most of the value.
- Adopting a vendor-specific tracing library instead of OpenTelemetry and getting locked in.
- Treating traces as a replacement for metrics rather than the third, complementary pillar.
- Instrument with OpenTelemetry so you stay backend-neutral and future-proof.
- Ensure context propagation across every service so traces stay complete.
- Choose a sampling strategy (head or tail) deliberately as the main cost control.
- Correlate the three pillars — put the trace ID in logs and link metrics to traces.
- Reach for tracing specifically for cross-service latency and dependency problems.
Knowledge Check
What does a distributed trace capture that metrics and logs don't?
- The full path of one request across services, showing where time was spent and where it failed
- Aggregate CPU and memory usage trends rolled up over time
- The raw line-by-line text output streamed from a single container's stdout and stderr over time
- The cluster's desired state as recorded in each object's spec
What causes orphan spans and incomplete traces?
- Broken context propagation — a service not passing the trace context to the next
- Sampling every request at a full 100% head-based rate
- Adopting OpenTelemetry SDKs to consistently instrument every single service in the call path
- Attaching too many high-cardinality labels to metrics
What does OpenTelemetry provide?
- A vendor-neutral API/SDK and Collector so you instrument once and export to any backend
- A durable storage backend that indexes, retains, and lets you query all the collected traces
- A built-in autoscaler that scales Pods on trace latency
- A drop-in replacement for the Prometheus metrics stack
You got correct