Making the Invisible Visible
It’s Friday afternoon. Everything seems fine — until it isn’t. A flood of customer complaints suddenly hits support: orders are timing out. Your dashboards show healthy CPU, low memory, and “green” service statuses. The logs? A maze of JSON lines and timestamps, none of which point to the real issue.
You know something is wrong — you just can’t see where.
This is where observability steps in. It’s the difference between hoping the system works and actually understanding it. Observability gives your system a voice — the ability to explain itself through metrics, logs, and traces.
To make that possible, your code needs a way to emit this data — that’s what instrumentation does. It’s the act of adding sensors to your system, either manually or automatically, so every request, error, and dependency can be measured and observed. Without instrumentation, observability would be like listening for a heartbeat without a stethoscope.
The Three Pillars of Observability
Observability rests on three complementary pillars. Each provides a unique view into your system’s health and behavior — and together, they transform guesswork into clarity.
1. Metrics — The System’s Vital Signs
Metrics are numbers that represent the state of your system over time. They show patterns: how many requests per second, average latency, memory usage, error rates.
They’re lightweight, structured, and great for detecting trends. A sudden spike in request duration or a drop in throughput tells you something’s off.
In the JVM and Spring ecosystem, metrics are often collected using Micrometer — a dimensional metrics library designed specifically for Java applications. Think of it as SLF4J for metrics: an abstraction that lets you instrument your code once and send those metrics to any backend — Prometheus, Datadog, CloudWatch, and many others.
Micrometer comes built into Spring Boot Actuator, meaning your applications are automatically instrumented to report vital metrics without extra setup. Many third-party libraries also use Micrometer, giving you end-to-end visibility across your entire stack.
But metrics answer what is happening — not why.
2. Logs — The Narrative of Events
Logs are the system’s diary. They record discrete events: a login attempt, a failed API call, a timeout exception.
They’re detailed and descriptive — useful for understanding context. When you correlate logs with metrics, you can move from “The latency increased” to “The latency increased because the payment API threw repeated timeouts.”
Logs tell the story, but without connection to requests that span multiple services, they can become noise. That’s where tracing comes in.
3. Traces — The Journey of a Request
Traces reveal how a single request travels through your distributed system — across microservices, queues, and databases.
A trace shows the complete journey of one operation from start to finish. Each stop along the way — every API call, query, or function — is recorded as a span.
Traces answer where and why something went wrong. They’re the bridge between metrics and logs, showing you the full path that led to the problem.
The Relationship Between a Trace and Spans
Think of a trace as the entire story of a request. Each span is a chapter in that story — a smaller operation that contributes to the whole.
A trace = a collection of spans, connected through parent-child relationships.
A Concrete Example
Imagine your system has three services:
- Client Service – receives the user’s request.
- Order Service – processes the order details.
- Payment Service – charges the user’s credit card.
Now, a user places an order. Behind the scenes, this happens:
Step 1: The Client Service receives the request
- A trace is created for this request.
- The root span begins:
client /placeOrder
. - Timing starts when the request arrives and stops when the response is sent.
Step 2: The Client Service calls the Order Service
- A child span is created:
HTTP POST /orders
. - The trace ID links it back to the root span.
- The Order Service adds its own span to the same trace.
Step 3: The Order Service calls the Payment Service
- Another span appears:
call PaymentService
. - The Payment Service adds its own internal span:
processPayment
.
Each span includes:
- Start and end timestamps
- Metadata (like HTTP method, URL, or response code)
- The shared trace ID
- A parent span ID, defining relationships
Visualized, the trace looks like this:
Trace ID: 9afc1234...
├── client /placeOrder (root span) [120ms]
│ ├── HTTP POST /orders [85ms]
│ │ └── call PaymentService [40ms]
│ │ └── processPayment [20ms]
│ └── sendResponseToUser [5ms]
Why It Matters
If the Payment Service slows down, that specific span lights up as the bottleneck. You instantly know where the delay occurred — without scanning thousands of logs or guessing which service misbehaved. This is the power of distributed tracing: it makes invisible latency visible.
Summary
Metrics, logs, and traces aren’t competitors — they’re collaborators.
- Metrics show you what is happening.
- Logs tell you what happened.
- Traces show you where and why.
Concept | Description | Example |
---|---|---|
Trace | The entire journey of one request across services | “Place order” request from start to finish |
Span | A single operation within that journey | HTTP call to PaymentService , processOrder |
Trace ID | Shared ID linking all spans | Same across all related spans |
Parent-Child Relationship | Shows which operation called which | client → order → payment |
Happy coding! 💻