Monitoring Microservices with Distributed Tracing

October 9, 2025·14 min read·Cloud and DevOpsintermediate

Modern microservice architectures create complex request flows that are hard to debug without end-to-end visibility.

A developer laptop showing a microservices architecture diagram with colored request traces flowing between services and a trace timeline view

When you move from a single application to a microservice architecture, something that used to be a simple stack trace becomes a scattered conversation between services. A single user request might hop through an API gateway, an authentication service, a billing service, and a notification service before it returns. If that request slows down or fails, where do you even start? Logging tells you what happened at each service in isolation. Metrics tell you that something is wrong in aggregate. But tracing tells you the story of that one request, service call by service call, including the time spent in each hop and the context passed along the way.

I’ve been in war rooms where the dashboard screams “P99 latency is up,” but logs show everything “succeeded.” The bottleneck was a third-party API call that took 30 seconds inside one service, and we only found it by piecing together a trace that spanned five different teams. Distributed tracing isn’t a luxury in microservices; it’s the difference between guessing and knowing.

In this post, I’ll walk through why distributed tracing matters, how it works in practice, and how to add it to a realistic microservice setup. We’ll look at concrete patterns, code you can adapt, and tradeoffs I’ve learned the hard way.

Where distributed tracing fits today

Distributed tracing has matured from an academic concept to a production standard. The Cloud Native Computing Foundation’s OpenTelemetry project unified many previous efforts (like OpenTracing and OpenCensus) into a vendor-neutral set of APIs, SDKs, and protocols. That means you can instrument your code once and send traces to multiple backends, such as Jaeger, Tempo, Zipkin, or commercial platforms like Honeycomb, Datadog, or New Relic.

In real-world projects, tracing is used alongside logs and metrics to form a modern observability triad. It’s especially valuable in:

Microservice systems with deep call graphs.
Event-driven architectures using queues (Kafka, RabbitMQ).
Serverless functions where you don’t control the runtime but still need request context.
Mobile and web frontends that participate in the request path.

Who typically uses tracing? Platform teams provide the instrumentation libraries and collectors, while product teams add instrumentation to their services. SREs use traces to detect latency regressions; developers use them to debug specific user issues; support teams use them to validate reports from customers.

Compared to alternatives:

Logging gives you event records; you can embed a correlation ID and grep for it, but reconstructing timelines is manual and brittle.
Metrics give you aggregates; you’ll know if average latency increases but not which path caused it.
Profiling shows you CPU/memory hotspots within a single process; tracing shows cross-service timing and propagation.

Tracing is not a replacement for the others; it’s a complement. If you only have logs and metrics, you’ll often know something is wrong and roughly what, but not why or where exactly.

Core concepts and capabilities

Distributed tracing is about recording a causal path through your system. Here are the essentials:

Spans: A unit of work, like an HTTP handler, a database query, or a message processing loop. Spans have a start time, end time, and attributes (key-value pairs).
Traces: A collection of spans forming a single request path. Every span belongs to a trace. A trace has a unique trace ID.
Context propagation: How you pass trace and span IDs between services, typically via HTTP headers (traceparent, tracestate) or message metadata. Without propagation, you lose the thread between services.
Sampling: Because tracing can be voluminous, you often sample a percentage of requests. Head-based sampling decides at the entry point; tail-based sampling decides after the full trace is complete (useful for capturing only slow or errored traces).

OpenTelemetry is the most common way to instrument modern systems. In code, you create spans around operations and propagate context automatically with instrumented libraries. In a microservice, you’ll often have:

An entry service that receives external traffic.
Mid-tier services that call each other via HTTP or gRPC.
Downstream calls to databases, caches, or message queues.

A simple trace timeline might look like:

Service A receives a GET /orders/{id} request.
- Span: HTTP handler
- Span: DB query
Service A calls Service B via HTTP POST /payments/verify.
- Span: HTTP client call
Service B does work.
- Span: DB query
- Span: External API call

The trace viewer shows you a Gantt chart of these spans, highlighting latencies and attributes.

A realistic setup: Python services with OpenTelemetry

Let’s build a small example with two services and a Jaeger backend. We’ll simulate a user placing an order, which calls an order service and then a payment service.

Project structure:

tracing-demo/
├── docker-compose.yml
├── jaeger/
│   └── config/ (optional)
├── orders/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── app.py
├── payments/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── app.py

We’ll use Flask for simplicity, but the patterns apply to FastAPI, Django, or any framework.

1) Dependencies

requirements.txt for both services:

flask==3.0.0
requests==2.31.0
opentelemetry-api==1.22.0
opentelemetry-sdk==1.22.0
opentelemetry-instrumentation-flask==0.43b0
opentelemetry-instrumentation-requests==0.43b0
opentelemetry-exporter-jaeger==1.22.0

Note: OpenTelemetry versions move quickly; pin to what works in your environment.

2) Orders service (entry point)

orders/app.py:

import os
import requests
from flask import Flask, jsonify, request

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Initialize OpenTelemetry
resource = Resource.create({"service.name": "orders"})
provider = TracerProvider(resource=resource)
jaeger_endpoint = os.getenv("JAEGER_ENDPOINT", "http://jaeger:14268/api/traces")
jaeger_exporter = JaegerExporter(endpoint=jaeger_endpoint)
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument libraries
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

app = Flask(__name__)
tracer = trace.get_tracer(__name__)

@app.route("/orders/<order_id>", methods=["GET"])
def get_order(order_id):
    # A parent span is automatically created by Flask instrumentation
    with tracer.start_as_current_span("db.query") as span:
        span.set_attributes({
            "db.system": "postgres",
            "db.statement": "SELECT * FROM orders WHERE id = :id",
            "order.id": order_id,
        })
        # Simulate DB work
        # In real code, you would run the query here

    # Call payment service with propagated context via RequestsInstrumentor
    with tracer.start_as_current_span("call_payment_service") as span:
        payment_url = f"http://payments:5001/payments/verify/{order_id}"
        headers = {}
        # Requests instrumentation auto-injects trace context headers
        response = requests.get(payment_url, headers=headers, timeout=5)
        if response.status_code != 200:
            return jsonify({"error": "payment service failed"}), 500

    return jsonify({"order_id": order_id, "status": "processed"}), 200

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Key points:

The Flask and Requests instrumentation automatically creates spans and propagates trace context using W3C Trace Context headers (traceparent).
We added a manual span around a DB query to show how to record attributes and timing explicitly.
The Jaeger exporter sends spans to Jaeger. In production, you may use OTLP (OpenTelemetry Protocol) to send to an OpenTelemetry Collector.

3) Payments service

payments/app.py:

import os
from flask import Flask, jsonify

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Initialize OpenTelemetry
resource = Resource.create({"service.name": "payments"})
provider = TracerProvider(resource=resource)
jaeger_endpoint = os.getenv("JAEGER_ENDPOINT", "http://jaeger:14268/api/traces")
jaeger_exporter = JaegerExporter(endpoint=jaeger_endpoint)
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

FlaskInstrumentor().instrument()

app = Flask(__name__)
tracer = trace.get_tracer(__name__)

@app.route("/payments/verify/<order_id>", methods=["GET"])
def verify_payment(order_id):
    with tracer.start_as_current_span("db.query") as span:
        span.set_attributes({
            "db.system": "postgres",
            "db.statement": "SELECT * FROM payments WHERE order_id = :order_id",
            "order.id": order_id,
        })
        # Simulate DB work

    with tracer.start_as_current_span("external_payment_gateway") as span:
        # Simulate external API call
        span.set_attribute("http.url", "https://example-gateway.com/verify")
        # In real code, call the gateway and set status codes and errors

    return jsonify({"order_id": order_id, "verified": True}), 200

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5001)

4) Docker Compose

docker-compose.yml:

version: "3.8"

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # UI
      - "14268:14268"   # HTTP collector
      - "6831:6831/udp" # UDP agent (optional)

  orders:
    build: ./orders
    environment:
      - JAEGER_ENDPOINT=http://jaeger:14268/api/traces
    ports:
      - "5000:5000"
    depends_on:
      - jaeger

  payments:
    build: ./payments
    environment:
      - JAEGER_ENDPOINT=http://jaeger:14268/api/traces
    ports:
      - "5001:5001"
    depends_on:
      - jaeger

orders/Dockerfile and payments/Dockerfile are straightforward:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["python", "app.py"]

Run:

docker-compose up --build

Open Jaeger UI at http://localhost:16686. Trigger a request:

curl http://localhost:5000/orders/123

You should see a trace with spans for DB queries and the call from orders to payments.

What real-world instrumentation looks like

The example above is minimal. In production, you’ll need to handle a few additional concerns:

Context propagation beyond HTTP

If you use gRPC, the OpenTelemetry gRPC instrumentation propagates trace context automatically. If you use Kafka, you’ll propagate trace metadata in message headers. Here’s a conceptual example in Python using Kafka:

from opentelemetry import trace
from opentelemetry.propagate import inject
from kafka import KafkaProducer

tracer = trace.get_tracer(__name__)

def send_order_event(order_id: str):
    with tracer.start_as_current_span("produce_order_event") as span:
        span.set_attribute("messaging.system", "kafka")
        span.set_attribute("messaging.destination", "orders")
        span.set_attribute("messaging.operation", "publish")

        producer = KafkaProducer(bootstrap_servers="localhost:9092")
        headers = []
        # Inject trace context into headers
        inject(headers)
        producer.send("orders", key=str(order_id).encode(), value=b"order_created", headers=headers)
        producer.flush()

On the consumer side, extract context from headers and continue the trace:

from opentelemetry import trace
from opentelemetry.propagate import extract
from kafka import KafkaConsumer

tracer = trace.get_tracer(__name__)

consumer = KafkaConsumer("orders", bootstrap_servers="localhost:9092")
for msg in consumer:
    ctx = extract(msg.headers)
    with tracer.start_as_current_span("consume_order_event", context=ctx) as span:
        span.set_attribute("messaging.system", "kafka")
        span.set_attribute("messaging.destination", "orders")
        span.set_attribute("messaging.operation", "process")
        # Process the message

Error handling and span status

Set span status to reflect errors, and record exceptions. In OpenTelemetry Python:

from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("call_payment_service") as span:
    try:
        response = requests.get(payment_url, timeout=5)
        response.raise_for_status()
    except requests.RequestException as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise

This makes error traces easier to find and filter.

Sampling strategies

Head-based sampling is simpler and reduces volume, but it may drop slow or error-prone traces you care about. Tail-based sampling solves that by evaluating full traces before deciding. In production, many teams run an OpenTelemetry Collector to handle sampling and routing. A minimal collector config (otel-collector-config.yaml) might look like:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold: 500ms}

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [jaeger]

You would deploy the collector alongside your services and change your SDK exporter to send OTLP to the collector. This decouples instrumentation from backend details and allows centralized sampling policies.

Strengths, weaknesses, and tradeoffs

Strengths

Pinpoint latency and failures across services in one view.
Provide context for logs and metrics via trace IDs and span attributes.
Standardized via OpenTelemetry, reducing vendor lock-in.
Tail-based sampling captures the “interesting” traces without drowning in volume.

Weaknesses

Overhead: Tracing adds CPU, memory, and network costs. Sampling helps, but always test.
Noise: Over-instrumenting can produce clutter and cost. Start with automatic instrumentation, then add manual spans where it matters.
Data privacy: Be careful not to record PII in span attributes or logs. Use redaction at the collector level.
Operational complexity: You need to run and maintain a tracing backend or pay for a SaaS. Jaeger is easy to start with but requires care at scale.

When it’s a good choice

Microservice architectures with cross-service requests.
Event-driven systems where context must be propagated through messages.
Teams that need actionable visibility to ship faster and reduce MTTR.

When it’s less necessary

Single monolith with minimal internal calls; logs and metrics might suffice.
Extremely resource-constrained environments where tracing overhead is unacceptable (though you can sample aggressively).
Scenarios where data sensitivity prevents any storage of request data.

Personal experience: lessons learned

A few things I learned the hard way:

Start with automatic instrumentation. It gives you immediate spans for HTTP, database calls, and common libraries. Manual spans should target critical business logic and slow paths. Over-instrumenting early makes traces noisy.
Don’t forget context propagation across asynchronous boundaries. The most common gap I’ve seen is message queues; if you don’t propagate headers, your trace breaks. A missing trace context in Kafka is like losing a breadcrumb trail.
Choose the right sampling. Early on, we sampled 1% and missed rare errors. We switched to tail-based sampling (via a collector) to keep volume low while capturing slow and failed traces. That single change dramatically improved debugging speed.
Attributes matter. Record high-cardinality attributes carefully (e.g., user IDs or order IDs) because they can affect backend storage costs. Keep attributes bounded and meaningful.
Treat tracing as part of your domain model. Name spans after what the code does (“db.query,” “call_payment_service,” “process_kafka_message”) rather than generic labels. Future you will thank past you.

One moment stands out: a customer reported occasional timeouts for “checkout.” Logs showed success; metrics showed P99 latency creeping up. The trace revealed a single external call taking 15 seconds in one service, only when the user had a large number of saved payment methods. The fix was adding a timeout and cache. Tracing turned a vague complaint into a targeted fix.

Getting started: workflow and mental model

1) Decide what to trace

External ingress points (HTTP/gRPC handlers, message consumers).
Calls to databases and caches.
Calls to other services.
Any async boundaries (queues, streams).
Expensive computations.

2) Set up your backend

Start with Jaeger (local or Docker) for development.
In production, consider a collector (OTel Collector) to buffer and process spans before sending to a backend (Jaeger/Tempo/Zipkin or SaaS).
Plan retention and storage. Traces can be large; you’ll want TTLs and sampling.

3) Instrument your code

Add OpenTelemetry SDK and auto-instrumentation libraries for your language.
Ensure context propagation works for all outbound and inbound calls.
Add manual spans where auto-instrumentation misses business logic.
Record attributes that help debugging, not every possible field.

4) Integrate with logs and metrics

Inject trace and span IDs into your logs (structured logging).
Add span attributes that align with metric labels for correlation.
Use the same resource labels across signals for consistent grouping.

5) Iterate and tune

Review your sampling policy.
Monitor tracing system resource usage.
Periodically audit span names and attributes to keep them meaningful.

A minimal workflow in your repo might look like:

tracing-demo/
├── collector/
│   └── config.yaml
├── services/
│   ├── orders/
│   │   ├── app.py
│   │   └── Dockerfile
│   └── payments/
│       ├── app.py
│       └── Dockerfile
├── docker-compose.yml
└── README.md

In README, document:

How to run locally.
Which environment variables control endpoints and sampling.
How to view traces (Jaeger UI link).
How to add a new span or attribute.

What makes distributed tracing stand out

End-to-end visibility: Fewer finger-pointing sessions; you can see the whole path.
Developer experience: Traces provide a timeline view that’s intuitive for debugging.
Maintainability: With OpenTelemetry, you instrument once and keep backend flexibility.
Actionable insights: Slow spans and error paths jump out; you can set alerts on specific attributes.

Compared to ad hoc logging, tracing saves hours. Compared to metrics, it saves guesswork. The combination is powerful: metrics tell you there’s a fire, logs tell you what burned, and tracing tells you where it started and how it spread.

Free learning resources

OpenTelemetry documentation: https://opentelemetry.io/docs/
- Authoritative, language-agnostic reference for concepts and SDKs.
Jaeger documentation: https://www.jaegertracing.io/docs/
- Great for quick start with a backend and understanding trace storage.
OpenTelemetry Python examples: https://github.com/open-telemetry/opentelemetry-python
- Real-world code you can adapt for auto-instrumentation and manual spans.
CNCF Distributed Tracing WG: https://contribute.cncf.io/about-groups/working-groups/distributed-tracing/
- Background on standards and ecosystem.

For OpenTelemetry itself, https://opentelemetry.io is the canonical source; for Jaeger, https://jaegertracing.io is the official site.

Summary: who should use it and who might skip it

Use distributed tracing if:

You run microservices or event-driven systems.
You care about latency, reliability, and fast debugging.
You already have metrics and logs but need the connective tissue.

Consider skipping or deferring if:

Your system is a single, simple application with minimal internal calls.
You have extreme constraints on resource usage or data privacy that prevent storing request data.
Your team lacks the bandwidth to maintain a tracing backend; in that case, start with a managed SaaS or defer until you can support it.

Distributed tracing is a practical, high-signal tool. It won’t fix your code, but it will show you where to fix it. And when you’re staring at a production incident, a good trace is the calmest voice in the room.