Monitoring Microservices with Distributed Tracing
Modern microservice architectures create complex request flows that are hard to debug without end-to-end visibility.

When you move from a single application to a microservice architecture, something that used to be a simple stack trace becomes a scattered conversation between services. A single user request might hop through an API gateway, an authentication service, a billing service, and a notification service before it returns. If that request slows down or fails, where do you even start? Logging tells you what happened at each service in isolation. Metrics tell you that something is wrong in aggregate. But tracing tells you the story of that one request, service call by service call, including the time spent in each hop and the context passed along the way.
I’ve been in war rooms where the dashboard screams “P99 latency is up,” but logs show everything “succeeded.” The bottleneck was a third-party API call that took 30 seconds inside one service, and we only found it by piecing together a trace that spanned five different teams. Distributed tracing isn’t a luxury in microservices; it’s the difference between guessing and knowing.
In this post, I’ll walk through why distributed tracing matters, how it works in practice, and how to add it to a realistic microservice setup. We’ll look at concrete patterns, code you can adapt, and tradeoffs I’ve learned the hard way.
Where distributed tracing fits today
Distributed tracing has matured from an academic concept to a production standard. The Cloud Native Computing Foundation’s OpenTelemetry project unified many previous efforts (like OpenTracing and OpenCensus) into a vendor-neutral set of APIs, SDKs, and protocols. That means you can instrument your code once and send traces to multiple backends, such as Jaeger, Tempo, Zipkin, or commercial platforms like Honeycomb, Datadog, or New Relic.
In real-world projects, tracing is used alongside logs and metrics to form a modern observability triad. It’s especially valuable in:
- Microservice systems with deep call graphs.
- Event-driven architectures using queues (Kafka, RabbitMQ).
- Serverless functions where you don’t control the runtime but still need request context.
- Mobile and web frontends that participate in the request path.
Who typically uses tracing? Platform teams provide the instrumentation libraries and collectors, while product teams add instrumentation to their services. SREs use traces to detect latency regressions; developers use them to debug specific user issues; support teams use them to validate reports from customers.
Compared to alternatives:
- Logging gives you event records; you can embed a correlation ID and grep for it, but reconstructing timelines is manual and brittle.
- Metrics give you aggregates; you’ll know if average latency increases but not which path caused it.
- Profiling shows you CPU/memory hotspots within a single process; tracing shows cross-service timing and propagation.
Tracing is not a replacement for the others; it’s a complement. If you only have logs and metrics, you’ll often know something is wrong and roughly what, but not why or where exactly.
Core concepts and capabilities
Distributed tracing is about recording a causal path through your system. Here are the essentials:
- Spans: A unit of work, like an HTTP handler, a database query, or a message processing loop. Spans have a start time, end time, and attributes (key-value pairs).
- Traces: A collection of spans forming a single request path. Every span belongs to a trace. A trace has a unique trace ID.
- Context propagation: How you pass trace and span IDs between services, typically via HTTP headers (traceparent, tracestate) or message metadata. Without propagation, you lose the thread between services.
- Sampling: Because tracing can be voluminous, you often sample a percentage of requests. Head-based sampling decides at the entry point; tail-based sampling decides after the full trace is complete (useful for capturing only slow or errored traces).
OpenTelemetry is the most common way to instrument modern systems. In code, you create spans around operations and propagate context automatically with instrumented libraries. In a microservice, you’ll often have:
- An entry service that receives external traffic.
- Mid-tier services that call each other via HTTP or gRPC.
- Downstream calls to databases, caches, or message queues.
A simple trace timeline might look like:
- Service A receives a GET /orders/{id} request.
- Span: HTTP handler
- Span: DB query
- Service A calls Service B via HTTP POST /payments/verify.
- Span: HTTP client call
- Service B does work.
- Span: DB query
- Span: External API call
The trace viewer shows you a Gantt chart of these spans, highlighting latencies and attributes.
A realistic setup: Python services with OpenTelemetry
Let’s build a small example with two services and a Jaeger backend. We’ll simulate a user placing an order, which calls an order service and then a payment service.
Project structure:
tracing-demo/
├── docker-compose.yml
├── jaeger/
│ └── config/ (optional)
├── orders/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── app.py
├── payments/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── app.py
We’ll use Flask for simplicity, but the patterns apply to FastAPI, Django, or any framework.
1) Dependencies
requirements.txt for both services:
flask==3.0.0
requests==2.31.0
opentelemetry-api==1.22.0
opentelemetry-sdk==1.22.0
opentelemetry-instrumentation-flask==0.43b0
opentelemetry-instrumentation-requests==0.43b0
opentelemetry-exporter-jaeger==1.22.0
Note: OpenTelemetry versions move quickly; pin to what works in your environment.
2) Orders service (entry point)
orders/app.py:
import os
import requests
from flask import Flask, jsonify, request
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Initialize OpenTelemetry
resource = Resource.create({"service.name": "orders"})
provider = TracerProvider(resource=resource)
jaeger_endpoint = os.getenv("JAEGER_ENDPOINT", "http://jaeger:14268/api/traces")
jaeger_exporter = JaegerExporter(endpoint=jaeger_endpoint)
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Instrument libraries
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
app = Flask(__name__)
tracer = trace.get_tracer(__name__)
@app.route("/orders/<order_id>", methods=["GET"])
def get_order(order_id):
# A parent span is automatically created by Flask instrumentation
with tracer.start_as_current_span("db.query") as span:
span.set_attributes({
"db.system": "postgres",
"db.statement": "SELECT * FROM orders WHERE id = :id",
"order.id": order_id,
})
# Simulate DB work
# In real code, you would run the query here
# Call payment service with propagated context via RequestsInstrumentor
with tracer.start_as_current_span("call_payment_service") as span:
payment_url = f"http://payments:5001/payments/verify/{order_id}"
headers = {}
# Requests instrumentation auto-injects trace context headers
response = requests.get(payment_url, headers=headers, timeout=5)
if response.status_code != 200:
return jsonify({"error": "payment service failed"}), 500
return jsonify({"order_id": order_id, "status": "processed"}), 200
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Key points:
- The Flask and Requests instrumentation automatically creates spans and propagates trace context using W3C Trace Context headers (traceparent).
- We added a manual span around a DB query to show how to record attributes and timing explicitly.
- The Jaeger exporter sends spans to Jaeger. In production, you may use OTLP (OpenTelemetry Protocol) to send to an OpenTelemetry Collector.
3) Payments service
payments/app.py:
import os
from flask import Flask, jsonify
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
# Initialize OpenTelemetry
resource = Resource.create({"service.name": "payments"})
provider = TracerProvider(resource=resource)
jaeger_endpoint = os.getenv("JAEGER_ENDPOINT", "http://jaeger:14268/api/traces")
jaeger_exporter = JaegerExporter(endpoint=jaeger_endpoint)
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
FlaskInstrumentor().instrument()
app = Flask(__name__)
tracer = trace.get_tracer(__name__)
@app.route("/payments/verify/<order_id>", methods=["GET"])
def verify_payment(order_id):
with tracer.start_as_current_span("db.query") as span:
span.set_attributes({
"db.system": "postgres",
"db.statement": "SELECT * FROM payments WHERE order_id = :order_id",
"order.id": order_id,
})
# Simulate DB work
with tracer.start_as_current_span("external_payment_gateway") as span:
# Simulate external API call
span.set_attribute("http.url", "https://example-gateway.com/verify")
# In real code, call the gateway and set status codes and errors
return jsonify({"order_id": order_id, "verified": True}), 200
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5001)
4) Docker Compose
docker-compose.yml:
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # HTTP collector
- "6831:6831/udp" # UDP agent (optional)
orders:
build: ./orders
environment:
- JAEGER_ENDPOINT=http://jaeger:14268/api/traces
ports:
- "5000:5000"
depends_on:
- jaeger
payments:
build: ./payments
environment:
- JAEGER_ENDPOINT=http://jaeger:14268/api/traces
ports:
- "5001:5001"
depends_on:
- jaeger
orders/Dockerfile and payments/Dockerfile are straightforward:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
Run:
docker-compose up --build
Open Jaeger UI at http://localhost:16686. Trigger a request:
curl http://localhost:5000/orders/123
You should see a trace with spans for DB queries and the call from orders to payments.
What real-world instrumentation looks like
The example above is minimal. In production, you’ll need to handle a few additional concerns:
Context propagation beyond HTTP
If you use gRPC, the OpenTelemetry gRPC instrumentation propagates trace context automatically. If you use Kafka, you’ll propagate trace metadata in message headers. Here’s a conceptual example in Python using Kafka:
from opentelemetry import trace
from opentelemetry.propagate import inject
from kafka import KafkaProducer
tracer = trace.get_tracer(__name__)
def send_order_event(order_id: str):
with tracer.start_as_current_span("produce_order_event") as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "orders")
span.set_attribute("messaging.operation", "publish")
producer = KafkaProducer(bootstrap_servers="localhost:9092")
headers = []
# Inject trace context into headers
inject(headers)
producer.send("orders", key=str(order_id).encode(), value=b"order_created", headers=headers)
producer.flush()
On the consumer side, extract context from headers and continue the trace:
from opentelemetry import trace
from opentelemetry.propagate import extract
from kafka import KafkaConsumer
tracer = trace.get_tracer(__name__)
consumer = KafkaConsumer("orders", bootstrap_servers="localhost:9092")
for msg in consumer:
ctx = extract(msg.headers)
with tracer.start_as_current_span("consume_order_event", context=ctx) as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "orders")
span.set_attribute("messaging.operation", "process")
# Process the message
Error handling and span status
Set span status to reflect errors, and record exceptions. In OpenTelemetry Python:
from opentelemetry.trace import Status, StatusCode
with tracer.start_as_current_span("call_payment_service") as span:
try:
response = requests.get(payment_url, timeout=5)
response.raise_for_status()
except requests.RequestException as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
This makes error traces easier to find and filter.
Sampling strategies
Head-based sampling is simpler and reduces volume, but it may drop slow or error-prone traces you care about. Tail-based sampling solves that by evaluating full traces before deciding. In production, many teams run an OpenTelemetry Collector to handle sampling and routing. A minimal collector config (otel-collector-config.yaml) might look like:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 1s
send_batch_size: 1024
tail_sampling:
decision_wait: 5s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold: 500ms}
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [jaeger]
You would deploy the collector alongside your services and change your SDK exporter to send OTLP to the collector. This decouples instrumentation from backend details and allows centralized sampling policies.
Strengths, weaknesses, and tradeoffs
Strengths
- Pinpoint latency and failures across services in one view.
- Provide context for logs and metrics via trace IDs and span attributes.
- Standardized via OpenTelemetry, reducing vendor lock-in.
- Tail-based sampling captures the “interesting” traces without drowning in volume.
Weaknesses
- Overhead: Tracing adds CPU, memory, and network costs. Sampling helps, but always test.
- Noise: Over-instrumenting can produce clutter and cost. Start with automatic instrumentation, then add manual spans where it matters.
- Data privacy: Be careful not to record PII in span attributes or logs. Use redaction at the collector level.
- Operational complexity: You need to run and maintain a tracing backend or pay for a SaaS. Jaeger is easy to start with but requires care at scale.
When it’s a good choice
- Microservice architectures with cross-service requests.
- Event-driven systems where context must be propagated through messages.
- Teams that need actionable visibility to ship faster and reduce MTTR.
When it’s less necessary
- Single monolith with minimal internal calls; logs and metrics might suffice.
- Extremely resource-constrained environments where tracing overhead is unacceptable (though you can sample aggressively).
- Scenarios where data sensitivity prevents any storage of request data.
Personal experience: lessons learned
A few things I learned the hard way:
- Start with automatic instrumentation. It gives you immediate spans for HTTP, database calls, and common libraries. Manual spans should target critical business logic and slow paths. Over-instrumenting early makes traces noisy.
- Don’t forget context propagation across asynchronous boundaries. The most common gap I’ve seen is message queues; if you don’t propagate headers, your trace breaks. A missing trace context in Kafka is like losing a breadcrumb trail.
- Choose the right sampling. Early on, we sampled 1% and missed rare errors. We switched to tail-based sampling (via a collector) to keep volume low while capturing slow and failed traces. That single change dramatically improved debugging speed.
- Attributes matter. Record high-cardinality attributes carefully (e.g., user IDs or order IDs) because they can affect backend storage costs. Keep attributes bounded and meaningful.
- Treat tracing as part of your domain model. Name spans after what the code does (“db.query,” “call_payment_service,” “process_kafka_message”) rather than generic labels. Future you will thank past you.
One moment stands out: a customer reported occasional timeouts for “checkout.” Logs showed success; metrics showed P99 latency creeping up. The trace revealed a single external call taking 15 seconds in one service, only when the user had a large number of saved payment methods. The fix was adding a timeout and cache. Tracing turned a vague complaint into a targeted fix.
Getting started: workflow and mental model
1) Decide what to trace
- External ingress points (HTTP/gRPC handlers, message consumers).
- Calls to databases and caches.
- Calls to other services.
- Any async boundaries (queues, streams).
- Expensive computations.
2) Set up your backend
- Start with Jaeger (local or Docker) for development.
- In production, consider a collector (OTel Collector) to buffer and process spans before sending to a backend (Jaeger/Tempo/Zipkin or SaaS).
- Plan retention and storage. Traces can be large; you’ll want TTLs and sampling.
3) Instrument your code
- Add OpenTelemetry SDK and auto-instrumentation libraries for your language.
- Ensure context propagation works for all outbound and inbound calls.
- Add manual spans where auto-instrumentation misses business logic.
- Record attributes that help debugging, not every possible field.
4) Integrate with logs and metrics
- Inject trace and span IDs into your logs (structured logging).
- Add span attributes that align with metric labels for correlation.
- Use the same resource labels across signals for consistent grouping.
5) Iterate and tune
- Review your sampling policy.
- Monitor tracing system resource usage.
- Periodically audit span names and attributes to keep them meaningful.
A minimal workflow in your repo might look like:
tracing-demo/
├── collector/
│ └── config.yaml
├── services/
│ ├── orders/
│ │ ├── app.py
│ │ └── Dockerfile
│ └── payments/
│ ├── app.py
│ └── Dockerfile
├── docker-compose.yml
└── README.md
In README, document:
- How to run locally.
- Which environment variables control endpoints and sampling.
- How to view traces (Jaeger UI link).
- How to add a new span or attribute.
What makes distributed tracing stand out
- End-to-end visibility: Fewer finger-pointing sessions; you can see the whole path.
- Developer experience: Traces provide a timeline view that’s intuitive for debugging.
- Maintainability: With OpenTelemetry, you instrument once and keep backend flexibility.
- Actionable insights: Slow spans and error paths jump out; you can set alerts on specific attributes.
Compared to ad hoc logging, tracing saves hours. Compared to metrics, it saves guesswork. The combination is powerful: metrics tell you there’s a fire, logs tell you what burned, and tracing tells you where it started and how it spread.
Free learning resources
- OpenTelemetry documentation: https://opentelemetry.io/docs/
- Authoritative, language-agnostic reference for concepts and SDKs.
- Jaeger documentation: https://www.jaegertracing.io/docs/
- Great for quick start with a backend and understanding trace storage.
- OpenTelemetry Python examples: https://github.com/open-telemetry/opentelemetry-python
- Real-world code you can adapt for auto-instrumentation and manual spans.
- CNCF Distributed Tracing WG: https://contribute.cncf.io/about-groups/working-groups/distributed-tracing/
- Background on standards and ecosystem.
For OpenTelemetry itself, https://opentelemetry.io is the canonical source; for Jaeger, https://jaegertracing.io is the official site.
Summary: who should use it and who might skip it
Use distributed tracing if:
- You run microservices or event-driven systems.
- You care about latency, reliability, and fast debugging.
- You already have metrics and logs but need the connective tissue.
Consider skipping or deferring if:
- Your system is a single, simple application with minimal internal calls.
- You have extreme constraints on resource usage or data privacy that prevent storing request data.
- Your team lacks the bandwidth to maintain a tracing backend; in that case, start with a managed SaaS or defer until you can support it.
Distributed tracing is a practical, high-signal tool. It won’t fix your code, but it will show you where to fix it. And when you’re staring at a production incident, a good trace is the calmest voice in the room.




