System Design for Software Engineers

Monitoring, Logging, and Tracing (Observability)

Observability: Monitoring, Logging, and Tracing

In a monolith, you can check one log file and one set of CPU/RAM metrics to see what's wrong. In microservices, a single user request might travel through 10 different services across 50 containers. If the request fails, where did it go wrong? Which service was slow?

Observability is the ability to understand the internal state of a system based on its external outputs. It is built on three pillars:

1. Metrics (Monitoring)

Quantitative data about your system over time.

Example: CPU usage, Request per second, Error rate, Latency.
Tools: Prometheus, Grafana, Datadog.

2. Logging

Discrete records of events that happened in your code.

The Challenge: Logs are scattered across hundreds of containers. You must use Centralized Logging.
Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Fluentd, Splunk.

3. Distributed Tracing

Tracks a single request as it moves through the entire system.

Correlation IDs: Every request is assigned a unique trace_id at the entry point (API Gateway). This ID is passed to every downstream service in the headers.
Tools: Jaeger, Zipkin, AWS X-Ray.

Debugging a Slow Request with Tracing

1
Step 1
A Grafana dashboard shows that the 99th percentile latency for 'Place Order' has spiked to 5 seconds.
2
Step 2
Open Jaeger and search for traces with operation=place_order and duration > 4s.
3
Step 3
Jaeger displays a timeline. You see that:

API Gateway took 10ms.

Auth Service took 50ms.

Inventory Service took 4.5 seconds.
4
Step 4
Use the trace_id from the slow trace to search your centralized logs (Elasticsearch). You filter logs from the Inventory Service with that ID.
5
Step 5
The logs show that the Inventory Service was performing a sequential scan on a large table because an index was missing. You've found the bug!

The Four Golden Signals

Google's SRE book recommends monitoring these four metrics for every service:

Latency: Time it takes to service a request.
Traffic: How much demand is being placed on your system.
Errors: The rate of requests that fail.
Saturation: How 'full' your service is (e.g., CPU/Memory usage).

Common Mistakes

Log Overload: Logging every single variable on every request. This creates massive storage costs and makes finding real errors like 'finding a needle in a haystack'.
Ignoring the 99th Percentile (p99): Only looking at 'average' latency. An average of 100ms could mean 90% of users see 10ms, but 10% see 10 seconds! Always look at p95 and p99.
Manual Log Analysis: SSH-ing into individual servers to run tail -f. This is impossible to do at scale.

Recap

Metrics tell you if something is wrong.
Tracing tells you where it is wrong.
Logging tells you why it is wrong.
A Correlation ID is the thread that ties all three pillars together.

Knowledge Check

Question 1 of 3

Q1Single choice

Which pillar of observability is best for identifying which specific service in a chain of 10 microservices is causing a delay?

Centralized Logging

Metrics/Monitoring

Distributed Tracing

Static Code Analysis

Distributed Transactions: Maintaining Consistency

Case Study: Designing a URL Shortener (TinyURL)