Topic 47

Tracing and OpenTelemetry

TracingTelemetry

Distributed tracing follows a single request as it travels across services, recording how long each hop took and where it failed. In a system of many microservices, it answers the question metrics and logs alone struggle with: where in the call chain did this slow or failing request actually spend its time?

OpenTelemetry is the standard that unifies how traces (and metrics and logs) are produced, so instrumentation is no longer tied to one vendor's backend. Together they make a request's journey visible.

Why Traces

Metrics tell you the checkout service is slow; logs tell you what one instance did; neither easily tells you that a request spent 400ms waiting on a downstream inventory call that then timed out. A trace captures the whole path of one request across services as a tree of timed operations, so you can see the critical path and the failure point. For latency and dependency problems in distributed systems, tracing is the tool.

Spans, Traces, and Context

A trace — one request across services, timed

A span is one timed operation — a service handling part of the request. A trace is the tree of spans for one request, tied together by a shared trace ID. The mechanism that links them is context propagation: each service passes the trace context (IDs) in headers to the next, so spans created in different services join the same trace. Broken propagation — a service that drops the headers — produces orphan spans and incomplete traces, the most common tracing bug.

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for telemetry: a vendor-neutral API and SDK for instrumenting code, plus the OTel Collector that receives, processes, and exports telemetry to any backend. Instrument once against OTel and you can send traces to Jaeger, Tempo, or a cloud tracing service without changing the application. It has become the default way to instrument, replacing the fragmented, vendor-specific libraries that came before.

Sampling and Correlation

Tracing every request is expensive in overhead and storage, so traces are sampled — head sampling decides up front (keep 1%), tail sampling decides after seeing the trace (keep all errors and slow ones). Choosing a sampling strategy is the main cost lever. The payoff multiplies when telemetry is correlated: a trace ID in your structured logs (Topic 45) and exemplars linking metrics (Topic 46) to traces let you jump from a latency spike to the exact slow request. Traces are the third pillar, most valuable alongside the other two.

Metrics vs logs vs traces

Metrics — aggregate numeric trends — what is happening and how much. Cheap, always-on.

Logs — discrete events from one component — what a service did.

Traces — one request across services — where time went and where it failed. Sampled.

Common Mistakes

Sampling at 100% and paying unsustainable overhead and storage cost.
Breaking context propagation (dropping trace headers), producing orphan spans and broken traces.
Instrumenting blindly without correlating traces to logs and metrics, losing most of the value.
Adopting a vendor-specific tracing library instead of OpenTelemetry and getting locked in.
Treating traces as a replacement for metrics rather than the third, complementary pillar.

Best Practices

Instrument with OpenTelemetry so you stay backend-neutral and future-proof.
Ensure context propagation across every service so traces stay complete.
Choose a sampling strategy (head or tail) deliberately as the main cost control.
Correlate the three pillars — put the trace ID in logs and link metrics to traces.
Reach for tracing specifically for cross-service latency and dependency problems.

RelatedLogging — correlate via trace IDs (Topic 45)Metrics and monitoring — the other two pillars (Topic 46)Cloud tracing — X-Ray, Cloud Trace, Application Insights as backends

Knowledge Check

What does a distributed trace capture that metrics and logs don't?

The full path of one request across services, showing where time was spent and where it failed
Aggregate CPU and memory usage trends rolled up over time
The raw line-by-line text output streamed from a single container's stdout and stderr over time
The cluster's desired state as recorded in each object's spec

What causes orphan spans and incomplete traces?

Broken context propagation — a service not passing the trace context to the next
Sampling every request at a full 100% head-based rate
Adopting OpenTelemetry SDKs to consistently instrument every single service in the call path
Attaching too many high-cardinality labels to metrics

What does OpenTelemetry provide?

A vendor-neutral API/SDK and Collector so you instrument once and export to any backend
A durable storage backend that indexes, retains, and lets you query all the collected traces
A built-in autoscaler that scales Pods on trace latency
A drop-in replacement for the Prometheus metrics stack

You got correct