Monitoring and Observability in Distributed Systems

Andrew • May 25, 2025 • Distributed Systems , Monitoring , Observability , DevOps

7 min read 1507 words

In the world of distributed systems, understanding what’s happening across your services is both critical and challenging. As systems grow in complexity—spanning multiple services, data stores, and infrastructure components—traditional monitoring approaches fall short. This is where modern monitoring and observability practices come into play, providing the visibility needed to operate distributed systems with confidence.

This article explores the evolution from basic monitoring to comprehensive observability, providing practical guidance on implementing effective observability practices in distributed systems.

From Monitoring to Observability

While monitoring and observability are often used interchangeably, they represent different approaches to understanding system behavior.

Monitoring

Monitoring involves collecting and analyzing predefined metrics to determine system health. It answers known questions about system behavior:

Is the service up or down?
How much memory is being used?
How many requests per second is the system handling?

Observability

Observability goes beyond monitoring by providing context and enabling exploration. It helps answer unanticipated questions about system behavior:

Why is this specific user experiencing slow response times?
What caused this unexpected spike in error rates?
How did a change in one service affect the performance of another?

Observability is built on three pillars:

Metrics: Numerical measurements collected over time
Logs: Detailed records of events that occurred in the system
Traces: Records of requests as they flow through distributed services

┌───────────────────────────────────────────────────┐
│                                                   │
│                  Observability                    │
│                                                   │
├───────────────┬───────────────┬───────────────────┤
│               │               │                   │
│    Metrics    │     Logs      │      Traces       │
│               │               │                   │
└───────────────┴───────────────┴───────────────────┘

The Three Pillars of Observability

1. Metrics

Metrics are numerical measurements collected at regular intervals. They provide a high-level view of system behavior and performance.

Types of Metrics

Counter: A cumulative metric that only increases (e.g., total requests)
Gauge: A metric that can increase or decrease (e.g., memory usage)
Histogram: Samples observations and counts them in configurable buckets (e.g., request duration)
Summary: Similar to histogram but calculates configurable quantiles

Implementation Example: Prometheus Metrics in Go

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Counter for total requests
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // Histogram for request duration
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Track request duration
        start := time.Now()
        
        // Call the actual handler
        next(w, r)
        
        // Record metrics after the handler completes
        duration := time.Since(start).Seconds()
        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    }
}

Key Metrics for Distributed Systems

RED Metrics (for services):
- Rate: Requests per second
- Error rate: Failed requests per second
- Duration: Distribution of request latencies
USE Metrics (for resources):
- Utilization: Percentage of time the resource is busy
- Saturation: Amount of work the resource has to do
- Errors: Count of error events

2. Logs

Logs are timestamped records of discrete events that occurred in the system. They provide detailed context about specific events.

Structured Logging

In distributed systems, structured logging is essential for efficient log processing and analysis.

Implementation Example: Structured Logging in Node.js

const winston = require('winston');
const { format } = winston;

// Define the logger
const logger = winston.createLogger({
  level: 'info',
  format: format.combine(
    format.timestamp(),
    format.errors({ stack: true }),
    format.metadata(),
    format.json()
  ),
  defaultMeta: { service: 'user-service' },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' })
  ]
});

// Example of logging in a service
function createUser(userData) {
  logger.debug('Creating user', { userData });
  
  try {
    // User creation logic
    const user = userRepository.create(userData);
    
    logger.info('User created', { 
      userId: user.id,
      email: user.email
    });
    
    return user;
  } catch (error) {
    logger.error('Failed to create user', {
      error: error.message,
      userData
    });
    
    throw error;
  }
}

3. Traces

Traces track requests as they flow through distributed services, providing visibility into the end-to-end request path.

Implementation Example: Distributed Tracing with OpenTelemetry in Python

from flask import Flask
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import requests

# Configure the tracer
resource = Resource(attributes={
    SERVICE_NAME: "order-service"
})

trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)

# Configure the exporter
otlp_exporter = OTLPSpanExporter(endpoint="jaeger:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Create and instrument the Flask app
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/orders/<order_id>', methods=['GET'])
def get_order(order_id):
    with tracer.start_as_current_span("get_order_details") as span:
        span.set_attribute("order_id", order_id)
        
        # Get order from database
        order = get_order_from_db(order_id)
        
        # Get payment details
        payment = get_payment_details(order_id)
        
        # Combine and return
        result = {
            "order": order,
            "payment": payment
        }
        
        return result

Implementing Observability in Distributed Systems

1. Instrumentation

Instrumentation is the process of adding code to your application to collect observability data.

Automatic vs. Manual Instrumentation

Automatic Instrumentation: Uses libraries or agents to instrument code without manual changes
Manual Instrumentation: Requires explicit code changes to add instrumentation

Implementation Example: Spring Boot with Micrometer and Prometheus

// Spring Boot application with Micrometer
@SpringBootApplication
public class OrderServiceApplication {
    public static void main(String[] args) {
        SpringApplication.run(OrderServiceApplication.class, args);
    }
    
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config().commonTags(
            "application", "order-service",
            "environment", "${spring.profiles.active:default}"
        );
    }
}

// Custom metrics in a service
@Service
public class OrderService {
    private final Counter orderCreatedCounter;
    private final Timer orderProcessingTimer;
    
    public OrderService(MeterRegistry registry) {
        this.orderCreatedCounter = registry.counter("orders.created");
        this.orderProcessingTimer = registry.timer("orders.processing.time");
    }
    
    public Order createOrder(OrderRequest request) {
        return orderProcessingTimer.record(() -> {
            // Order creation logic
            Order order = new Order();
            // ... set order properties
            
            // Save order
            Order savedOrder = orderRepository.save(order);
            
            // Record metrics
            orderCreatedCounter.increment();
            
            return savedOrder;
        });
    }
}

2. Correlation

Correlation connects data across the three pillars of observability, typically using correlation IDs.

Implementation Example: Correlation IDs in Spring Boot

// Request filter to extract or generate correlation IDs
@Component
public class CorrelationFilter extends OncePerRequestFilter {
    private static final String CORRELATION_ID_HEADER = "X-Correlation-ID";
    
    @Override
    protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, 
                                   FilterChain filterChain) throws ServletException, IOException {
        // Extract or generate correlation ID
        String correlationId = request.getHeader(CORRELATION_ID_HEADER);
        if (correlationId == null || correlationId.isEmpty()) {
            correlationId = UUID.randomUUID().toString();
        }
        
        // Store in MDC for logging
        MDC.put("correlationId", correlationId);
        
        // Add to response headers
        response.addHeader(CORRELATION_ID_HEADER, correlationId);
        
        try {
            filterChain.doFilter(request, response);
        } finally {
            MDC.remove("correlationId");
        }
    }
}

3. Visualization and Alerting

Effective visualization and alerting are essential for deriving insights from observability data.

Implementation Example: Prometheus Alerting Rules

# prometheus-alerts.yml
groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% for the last 5 minutes (current value: {{ $value | humanizePercentage }})"
      
  - alert: SlowResponses
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
    for: 5m
    labels:
      severity: warning
      team: backend
    annotations:
      summary: "Slow response times detected"
      description: "95th percentile of response time is above 1 second for service {{ $labels.service }}"

Observability Patterns for Distributed Systems

1. Health Checks

Health checks provide basic information about service availability and readiness.

2. Service Level Objectives (SLOs)

SLOs define measurable targets for service reliability.

Implementation Example: SLO Monitoring with Prometheus

# prometheus-rules.yml
groups:
- name: slo-rules
  rules:
  - record: slo:request_availability:ratio_5m
    expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    
  - record: slo:request_latency:p95_5m
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

3. Synthetic Monitoring

Synthetic monitoring simulates user interactions to proactively detect issues.

Building an Observability Stack

1. Components of an Observability Stack

A complete observability stack typically includes:

Collection: Agents and libraries that collect observability data
Storage: Databases optimized for time-series data, logs, and traces
Processing: Systems that process, aggregate, and analyze data
Visualization: Dashboards and UIs for exploring data
Alerting: Systems that notify operators of issues

2. Popular Observability Tools

Metrics

Prometheus: Open-source metrics collection and alerting
Grafana: Visualization platform for metrics
Datadog: Commercial metrics platform with broad integration support

Logging

Elasticsearch: Distributed search and analytics engine
Logstash: Log processing pipeline
Kibana: Visualization platform for Elasticsearch
Loki: Horizontally-scalable, highly-available log aggregation system

Tracing

Jaeger: End-to-end distributed tracing system
Zipkin: Distributed tracing system
OpenTelemetry: Vendor-neutral observability framework

Best Practices for Observability

1. Design for Observability

Instrument code from the beginning
Use consistent naming conventions
Capture business-level metrics, not just technical ones

2. Focus on High-Cardinality Data

Collect data that allows for detailed filtering and grouping
Include user IDs, request IDs, and other contextual information

3. Implement Contextual Logging

Include relevant context in logs
Use structured logging formats
Correlate logs with traces and metrics

4. Establish SLOs and SLIs

Define clear Service Level Objectives
Monitor Service Level Indicators
Track error budgets

5. Automate Response to Common Issues

Implement automated remediation for known issues
Use alerts to trigger automated responses
Document manual response procedures for complex issues

Conclusion

Effective observability is essential for operating distributed systems reliably. By implementing the three pillars of observability—metrics, logs, and traces—and following the best practices outlined in this article, you can gain deep insights into your distributed systems, troubleshoot issues more effectively, and ensure optimal performance.

Remember that observability is not just about tools but also about culture. Foster a culture where teams value observability and incorporate it into their development practices from the beginning. With the right approach to observability, you can navigate the complexity of distributed systems with confidence and deliver reliable services to your users.

Andrew

Andrew is a visionary software engineer and DevOps expert with a proven track record of delivering cutting-edge solutions that drive innovation at Ataiva.com. As a leader on numerous high-profile projects, Andrew brings his exceptional technical expertise and collaborative leadership skills to the table, fostering a culture of agility and excellence within the team. With a passion for architecting scalable systems, automating workflows, and empowering teams, Andrew is a sought-after authority in the field of software development and DevOps.