In the world of distributed systems, understanding what’s happening across your services is both critical and challenging. As systems grow in complexity—spanning multiple services, data stores, and infrastructure components—traditional monitoring approaches fall short. This is where modern monitoring and observability practices come into play, providing the visibility needed to operate distributed systems with confidence.
This article explores the evolution from basic monitoring to comprehensive observability, providing practical guidance on implementing effective observability practices in distributed systems.
From Monitoring to Observability
While monitoring and observability are often used interchangeably, they represent different approaches to understanding system behavior.
Monitoring
Monitoring involves collecting and analyzing predefined metrics to determine system health. It answers known questions about system behavior:
- Is the service up or down?
- How much memory is being used?
- How many requests per second is the system handling?
Observability
Observability goes beyond monitoring by providing context and enabling exploration. It helps answer unanticipated questions about system behavior:
- Why is this specific user experiencing slow response times?
- What caused this unexpected spike in error rates?
- How did a change in one service affect the performance of another?
Observability is built on three pillars:
- Metrics: Numerical measurements collected over time
- Logs: Detailed records of events that occurred in the system
- Traces: Records of requests as they flow through distributed services
┌───────────────────────────────────────────────────┐
│ │
│ Observability │
│ │
├───────────────┬───────────────┬───────────────────┤
│ │ │ │
│ Metrics │ Logs │ Traces │
│ │ │ │
└───────────────┴───────────────┴───────────────────┘
The Three Pillars of Observability
1. Metrics
Metrics are numerical measurements collected at regular intervals. They provide a high-level view of system behavior and performance.
Types of Metrics
- Counter: A cumulative metric that only increases (e.g., total requests)
- Gauge: A metric that can increase or decrease (e.g., memory usage)
- Histogram: Samples observations and counts them in configurable buckets (e.g., request duration)
- Summary: Similar to histogram but calculates configurable quantiles
Implementation Example: Prometheus Metrics in Go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter for total requests
requestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Histogram for request duration
requestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Track request duration
start := time.Now()
// Call the actual handler
next(w, r)
// Record metrics after the handler completes
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
Key Metrics for Distributed Systems
RED Metrics (for services):
- Rate: Requests per second
- Error rate: Failed requests per second
- Duration: Distribution of request latencies
USE Metrics (for resources):
- Utilization: Percentage of time the resource is busy
- Saturation: Amount of work the resource has to do
- Errors: Count of error events
2. Logs
Logs are timestamped records of discrete events that occurred in the system. They provide detailed context about specific events.
Structured Logging
In distributed systems, structured logging is essential for efficient log processing and analysis.
Implementation Example: Structured Logging in Node.js
const winston = require('winston');
const { format } = winston;
// Define the logger
const logger = winston.createLogger({
level: 'info',
format: format.combine(
format.timestamp(),
format.errors({ stack: true }),
format.metadata(),
format.json()
),
defaultMeta: { service: 'user-service' },
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' })
]
});
// Example of logging in a service
function createUser(userData) {
logger.debug('Creating user', { userData });
try {
// User creation logic
const user = userRepository.create(userData);
logger.info('User created', {
userId: user.id,
email: user.email
});
return user;
} catch (error) {
logger.error('Failed to create user', {
error: error.message,
userData
});
throw error;
}
}
3. Traces
Traces track requests as they flow through distributed services, providing visibility into the end-to-end request path.
Implementation Example: Distributed Tracing with OpenTelemetry in Python
from flask import Flask
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import requests
# Configure the tracer
resource = Resource(attributes={
SERVICE_NAME: "order-service"
})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
# Configure the exporter
otlp_exporter = OTLPSpanExporter(endpoint="jaeger:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Create and instrument the Flask app
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
@app.route('/orders/<order_id>', methods=['GET'])
def get_order(order_id):
with tracer.start_as_current_span("get_order_details") as span:
span.set_attribute("order_id", order_id)
# Get order from database
order = get_order_from_db(order_id)
# Get payment details
payment = get_payment_details(order_id)
# Combine and return
result = {
"order": order,
"payment": payment
}
return result
Implementing Observability in Distributed Systems
1. Instrumentation
Instrumentation is the process of adding code to your application to collect observability data.
Automatic vs. Manual Instrumentation
- Automatic Instrumentation: Uses libraries or agents to instrument code without manual changes
- Manual Instrumentation: Requires explicit code changes to add instrumentation
Implementation Example: Spring Boot with Micrometer and Prometheus
// Spring Boot application with Micrometer
@SpringBootApplication
public class OrderServiceApplication {
public static void main(String[] args) {
SpringApplication.run(OrderServiceApplication.class, args);
}
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config().commonTags(
"application", "order-service",
"environment", "${spring.profiles.active:default}"
);
}
}
// Custom metrics in a service
@Service
public class OrderService {
private final Counter orderCreatedCounter;
private final Timer orderProcessingTimer;
public OrderService(MeterRegistry registry) {
this.orderCreatedCounter = registry.counter("orders.created");
this.orderProcessingTimer = registry.timer("orders.processing.time");
}
public Order createOrder(OrderRequest request) {
return orderProcessingTimer.record(() -> {
// Order creation logic
Order order = new Order();
// ... set order properties
// Save order
Order savedOrder = orderRepository.save(order);
// Record metrics
orderCreatedCounter.increment();
return savedOrder;
});
}
}
2. Correlation
Correlation connects data across the three pillars of observability, typically using correlation IDs.
Implementation Example: Correlation IDs in Spring Boot
// Request filter to extract or generate correlation IDs
@Component
public class CorrelationFilter extends OncePerRequestFilter {
private static final String CORRELATION_ID_HEADER = "X-Correlation-ID";
@Override
protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response,
FilterChain filterChain) throws ServletException, IOException {
// Extract or generate correlation ID
String correlationId = request.getHeader(CORRELATION_ID_HEADER);
if (correlationId == null || correlationId.isEmpty()) {
correlationId = UUID.randomUUID().toString();
}
// Store in MDC for logging
MDC.put("correlationId", correlationId);
// Add to response headers
response.addHeader(CORRELATION_ID_HEADER, correlationId);
try {
filterChain.doFilter(request, response);
} finally {
MDC.remove("correlationId");
}
}
}
3. Visualization and Alerting
Effective visualization and alerting are essential for deriving insights from observability data.
Implementation Example: Prometheus Alerting Rules
# prometheus-alerts.yml
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for the last 5 minutes (current value: {{ $value | humanizePercentage }})"
- alert: SlowResponses
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "Slow response times detected"
description: "95th percentile of response time is above 1 second for service {{ $labels.service }}"
Observability Patterns for Distributed Systems
1. Health Checks
Health checks provide basic information about service availability and readiness.
2. Service Level Objectives (SLOs)
SLOs define measurable targets for service reliability.
Implementation Example: SLO Monitoring with Prometheus
# prometheus-rules.yml
groups:
- name: slo-rules
rules:
- record: slo:request_availability:ratio_5m
expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- record: slo:request_latency:p95_5m
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
3. Synthetic Monitoring
Synthetic monitoring simulates user interactions to proactively detect issues.
Building an Observability Stack
1. Components of an Observability Stack
A complete observability stack typically includes:
- Collection: Agents and libraries that collect observability data
- Storage: Databases optimized for time-series data, logs, and traces
- Processing: Systems that process, aggregate, and analyze data
- Visualization: Dashboards and UIs for exploring data
- Alerting: Systems that notify operators of issues
2. Popular Observability Tools
Metrics
- Prometheus: Open-source metrics collection and alerting
- Grafana: Visualization platform for metrics
- Datadog: Commercial metrics platform with broad integration support
Logging
- Elasticsearch: Distributed search and analytics engine
- Logstash: Log processing pipeline
- Kibana: Visualization platform for Elasticsearch
- Loki: Horizontally-scalable, highly-available log aggregation system
Tracing
- Jaeger: End-to-end distributed tracing system
- Zipkin: Distributed tracing system
- OpenTelemetry: Vendor-neutral observability framework
Best Practices for Observability
1. Design for Observability
- Instrument code from the beginning
- Use consistent naming conventions
- Capture business-level metrics, not just technical ones
2. Focus on High-Cardinality Data
- Collect data that allows for detailed filtering and grouping
- Include user IDs, request IDs, and other contextual information
3. Implement Contextual Logging
- Include relevant context in logs
- Use structured logging formats
- Correlate logs with traces and metrics
4. Establish SLOs and SLIs
- Define clear Service Level Objectives
- Monitor Service Level Indicators
- Track error budgets
5. Automate Response to Common Issues
- Implement automated remediation for known issues
- Use alerts to trigger automated responses
- Document manual response procedures for complex issues
Conclusion
Effective observability is essential for operating distributed systems reliably. By implementing the three pillars of observability—metrics, logs, and traces—and following the best practices outlined in this article, you can gain deep insights into your distributed systems, troubleshoot issues more effectively, and ensure optimal performance.
Remember that observability is not just about tools but also about culture. Foster a culture where teams value observability and incorporate it into their development practices from the beginning. With the right approach to observability, you can navigate the complexity of distributed systems with confidence and deliver reliable services to your users.