SLO and SLI Implementation Guide: Building Reliable Services

Andrew • Feb 25, 2025 • SLO , SLI , Reliability , Observability , SRE , Monitoring

12 min read 2420 words

In today’s digital landscape, reliability has become a critical differentiator for services and products. Users expect systems to be available, responsive, and correct—all the time. However, pursuing 100% reliability is not only prohibitively expensive but often unnecessary. This is where Service Level Objectives (SLOs) and Service Level Indicators (SLIs) come in, providing a framework to define, measure, and maintain appropriate reliability targets that balance user expectations with engineering costs.

This comprehensive guide explores the practical aspects of implementing SLOs and SLIs in your organization. We’ll cover everything from selecting the right metrics to building the technical infrastructure needed to track them, and establishing the processes to act on the resulting data. Whether you’re just starting with reliability engineering or looking to refine your existing practices, this guide provides actionable insights to help you build more reliable services.

Understanding SLOs and SLIs: The Foundation

Before diving into implementation, let’s establish a clear understanding of SLOs and SLIs and how they relate to other service level concepts.

Key Definitions

Service Level Indicator (SLI): A quantitative measure of some aspect of the level of service provided. Examples include:

Availability (the proportion of requests that are successful)
Latency (the time it takes to respond to a request)
Throughput (the number of requests processed per second)
Error rate (the proportion of requests that fail)
Durability (the proportion of stored data that is retrievable)

Service Level Objective (SLO): A target value or range for an SLI. For example:

99.9% of requests will be successful over a 30-day window
95% of requests will be processed in under 200ms over a 24-hour window

Service Level Agreement (SLA): A contract with users that includes consequences of meeting or missing SLOs. SLAs typically include:

Financial penalties for missing targets
Credits or refunds to customers
Termination clauses for repeated violations

Error Budget: The allowed amount of unreliability derived from your SLO. For example:

If your SLO is 99.9% availability, your error budget is 0.1% (or about 43 minutes per month)
Error budgets provide a common language for engineering and business decisions

The Relationship Between SLIs, SLOs, SLAs, and Error Budgets

┌─────────────────────────────────────────────────────┐
│                                                     │
│  SLA (Service Level Agreement)                      │
│  "We promise 99.9% availability or you get a refund"│
│                                                     │
│  ┌─────────────────────────────────────────────┐    │
│  │                                             │    │
│  │  SLO (Service Level Objective)              │    │
│  │  "Our target is 99.95% availability"        │    │
│  │                                             │    │
│  │  ┌─────────────────────────────────────┐    │    │
│  │  │                                     │    │    │
│  │  │  SLI (Service Level Indicator)      │    │    │
│  │  │  "Our actual availability is 99.97%"│    │    │
│  │  │                                     │    │    │
│  │  └─────────────────────────────────────┘    │    │
│  │                                             │    │
│  └─────────────────────────────────────────────┘    │
│                                                     │
└─────────────────────────────────────────────────────┘

Error Budget: 0.05% (100% - 99.95%) = ~22 minutes per month

Why SLOs Matter

SLOs provide several critical benefits:

Set Clear Expectations: They define what “good enough” reliability means for your service
Enable Data-Driven Decisions: They provide objective criteria for balancing reliability work against feature development
Focus Engineering Efforts: They help teams prioritize work on the aspects of reliability that matter most
Improve Communication: They create a common language for discussing reliability between technical and non-technical stakeholders
Reduce Toil: They help avoid over-engineering by defining “good enough” reliability

Selecting the Right SLIs

The foundation of effective SLOs is choosing the right SLIs—metrics that truly reflect the user experience.

The Four Golden Signals

Google’s Site Reliability Engineering (SRE) book identifies four “golden signals” that are often good starting points for SLIs:

Latency: How long it takes to serve a request
Traffic: The demand placed on your system
Errors: The rate of failed requests
Saturation: How “full” your system is (e.g., memory usage, CPU utilization)

SLI Selection Criteria

When selecting SLIs, consider these criteria:

User-Centric: Does the metric reflect what users actually experience?
Controllable: Can your team influence this metric through their actions?
Measurable: Can you collect this data reliably and at scale?
Understandable: Is the metric intuitive for both technical and non-technical stakeholders?
Proportional: Does the metric correlate with user happiness?

Common SLI Types by Service Category

Different types of services require different SLIs:

Web Services and APIs:

Availability: % of successful requests
Latency: Time to first byte or time to complete request
Throughput: Requests per second
Error rate: % of 5xx or 4xx responses

Data Processing Systems:

Freshness: Age of most recent data
Correctness: % of data that is accurate
Coverage: % of expected data that is processed
Throughput: Records processed per second

Storage Systems:

Durability: % of data that can be retrieved without loss
Availability: % of successful read/write operations
Latency: Time to complete read/write operations
Throughput: Operations per second

Example SLIs for a Web Application

# Example SLIs for a web application
slis:
  - name: Availability
    description: Proportion of successful HTTP requests
    metric: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    
  - name: Latency
    description: Proportion of requests served faster than threshold
    metric: sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
    
  - name: Quality
    description: Proportion of valid responses (correct content)
    metric: sum(rate(http_response_validation_success[5m])) / sum(rate(http_response_validation_total[5m]))

Avoiding Common SLI Pitfalls

Server-Side Only Metrics: Failing to capture the full user journey, including client-side performance
Averages Over Percentiles: Using averages that hide the long tail of poor experiences
Too Many Metrics: Creating too many SLIs, diluting focus and making prioritization difficult
Vanity Metrics: Choosing metrics that look good but don’t correlate with user experience
Unmeasurable Goals: Setting objectives that can’t be reliably measured

Setting Appropriate SLOs

Once you’ve selected your SLIs, the next step is setting appropriate SLO targets.

Approaches to Setting Initial SLOs

Historical Performance: Base SLOs on what you’ve actually achieved in the past

# Example: Calculating historical performance
import pandas as pd

# Load monitoring data
data = pd.read_csv('monitoring_data.csv')

# Calculate 99th percentile latency over past 30 days
p99_latency = data['request_latency_ms'].quantile(0.99)
print(f"99th percentile latency: {p99_latency}ms")

# Calculate availability percentage
availability = (data['status_code'] < 500).mean() * 100
print(f"Historical availability: {availability:.3f}%")

Customer Expectations: Base SLOs on what users expect or what competitors offer
Business Requirements: Base SLOs on what the business needs to succeed
Technical Constraints: Base SLOs on what is technically feasible

SLO Time Windows

SLOs need appropriate time windows to be meaningful:

Calendar Windows: Fixed time periods (day, week, month, quarter)
- Pros: Easy to understand, aligns with business reporting
- Cons: Can hide patterns, resets arbitrarily
Rolling Windows: Moving time periods (last 30 days, last 7 days)
- Pros: Provides continuous visibility, no arbitrary resets
- Cons: More complex to implement and explain
Sliding Windows: Multiple overlapping windows
- Pros: Captures both short and long-term trends
- Cons: Most complex to implement and explain

Setting Multiple SLO Targets

Consider using multiple targets for the same SLI:

Latency SLO:
- 95% of requests < 100ms
- 99% of requests < 300ms
- 99.9% of requests < 1000ms

This approach:

Captures the experience of most users (95%)
Addresses the experience of edge cases (99.9%)
Provides more nuanced reliability targets

Aligning SLOs with Business Impact

Different services and features have different reliability requirements:

Critical Path: Core functionality that directly impacts revenue or user experience
- Example: Payment processing, authentication
- Typical SLO: 99.9% - 99.99% availability
Supporting Services: Important but not immediately visible to users
- Example: Analytics, recommendation engines
- Typical SLO: 99.5% - 99.9% availability
Non-Critical Services: Nice-to-have features
- Example: Avatar customization, preference settings
- Typical SLO: 99% - 99.5% availability

Example SLO Document

# Service Level Objectives for User Authentication Service

## Service Overview
The User Authentication Service handles user login, registration, and session management.

## SLO Definitions

### Availability SLO
- **Target**: 99.95% of requests successful
- **Window**: Rolling 30-day
- **SLI Definition**: Proportion of HTTP requests that return a status code other than 5xx
- **Error Budget**: 0.05% (21.6 minutes per month)

### Latency SLO
- **Target**: 95% of requests completed in under 200ms
- **Window**: Rolling 30-day
- **SLI Definition**: Proportion of requests that complete in less than 200ms
- **Error Budget**: 5% of requests can exceed 200ms

## Measurement Methodology
- Data collected via application logs and Prometheus metrics
- Excludes planned maintenance windows (communicated 7 days in advance)
- Measured at the load balancer level

## Stakeholders
- **Owner**: Authentication Team
- **Consumers**: All user-facing services
- **Escalation Contact**: [email protected]

## Review Schedule
This SLO will be reviewed quarterly or after significant architecture changes.

Implementing SLI Measurement

With SLIs and SLOs defined, the next step is implementing the technical infrastructure to measure them.

SLI Measurement Approaches

Request-Based Metrics: Measuring success/failure of individual requests
- Pros: Direct correlation with user experience
- Cons: May miss client-side issues
Window-Based Metrics: Measuring service health over time intervals
- Pros: Can capture broader patterns
- Cons: Less direct connection to individual user experiences
Synthetic Probes: Simulating user behavior to test service
- Pros: Proactive detection, consistent testing
- Cons: May not reflect real user conditions
Client Instrumentation: Measuring from the user’s perspective
- Pros: Captures true end-user experience
- Cons: Limited visibility into causes, privacy concerns

Implementing Request-Based SLIs

Example: Instrumenting an API with Prometheus and Go

package main

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1, 2, 5},
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func instrumentHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // Wrap the ResponseWriter to capture the status code
        wrapped := newResponseWriter(w)
        
        // Call the original handler
        next(wrapped, r)
        
        // Record metrics after the handler returns
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, wrapped.status()).Inc()
    }
}

func main() {
    // Set up API endpoints
    http.HandleFunc("/api/users", instrumentHandler(handleUsers))
    
    // Expose Prometheus metrics
    http.Handle("/metrics", promhttp.Handler())
    
    // Start server
    http.ListenAndServe(":8080", nil)
}

Example: Prometheus Query for Availability SLI

# Availability SLI: Proportion of successful requests
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))

# Latency SLI: Proportion of requests faster than 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Building SLO Dashboards and Alerts

With SLI measurement in place, the next step is creating dashboards and alerts to make the data actionable.

SLO Dashboard Components

An effective SLO dashboard should include:

Current SLI Values: Real-time or near-real-time SLI measurements
SLO Targets: Clear indication of the target values
Error Budget Consumption: Visual representation of error budget usage
Historical Trends: SLI performance over time
Burn Rate: Rate at which the error budget is being consumed

Example: Grafana Dashboard Configuration

# Grafana dashboard configuration (simplified)
dashboard:
  title: "API Service SLOs"
  panels:
    - title: "Availability SLO"
      type: "gauge"
      targets:
        - expr: sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
      thresholds:
        - value: 0.999
          color: "red"
        - value: 0.9995
          color: "yellow"
        - value: 0.9999
          color: "green"
      
    - title: "Availability Error Budget"
      type: "graph"
      targets:
        - expr: (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) / 0.001
          legendFormat: "Error Budget Consumed"
      yaxes:
        - format: "percentunit"
          max: 1
      
    - title: "Latency SLO (95th Percentile)"
      type: "gauge"
      targets:
        - expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
      thresholds:
        - value: 0.3
          color: "red"
        - value: 0.2
          color: "yellow"
        - value: 0.1
          color: "green"

SLO Alerting Strategies

Effective SLO alerting balances prompt notification with avoiding alert fatigue:

Burn Rate Alerts: Alert based on how quickly error budget is being consumed

# Alert when burning 24x faster than allowed (exhausting 30-day budget in ~1 day)
(
  sum(rate(http_requests_total{status=~"5.."}[1h])) /
  sum(rate(http_requests_total[1h]))
) > 24 * 0.001 # 0.001 = 1 - 0.999 (SLO target)

Multi-Window, Multi-Burn-Rate Alerts: Alert based on different time windows and burn rates

# Prometheus alert rules
groups:
- name: slo_alerts
  rules:
  - alert: HighErrorRateFast
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))
      ) > 14.4 * 0.001 # 14.4x burn rate = 2 days to exhaust 30-day budget      
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected - fast burn"
      description: "Error budget burning 14.4x faster than allowed for 5 minutes"

  - alert: HighErrorRateSlow
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[1h])) /
        sum(rate(http_requests_total[1h]))
      ) > 3 * 0.001 # 3x burn rate = 10 days to exhaust 30-day budget      
    for: 60m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected - slow burn"
      description: "Error budget burning 3x faster than allowed for 60 minutes"

Establishing SLO Processes and Culture

Technical implementation is only part of the SLO journey. Establishing the right processes and culture is equally important.

SLO Review Process

Regular SLO reviews help ensure your objectives remain relevant:

Quarterly Reviews: Assess SLO performance and relevance
Post-Incident Reviews: Evaluate SLO impact after major incidents
Annual Strategic Reviews: Align SLOs with changing business priorities

Example: SLO Review Template

# Quarterly SLO Review: User Authentication Service

## SLO Performance Summary
- **Availability SLO**: 99.95% target, 99.97% achieved
- **Latency SLO**: 95% < 200ms target, 97.2% achieved
- **Error Budget**: 21.6 minutes allowed, 13.1 minutes consumed

## Key Observations
- Availability consistently exceeding target
- Latency improved after database index optimization
- Error budget consumption primarily from two incidents:
  - Database failover issue on Jan 15 (8.2 minutes)
  - Network partition on Feb 3 (4.9 minutes)

## Recommendations
- **Maintain current availability SLO**: 99.95% remains appropriate
- **Tighten latency SLO**: Consider increasing to 97% < 200ms
- **Add new SLO**: Consider adding correctness SLO for authentication decisions

## Action Items
- [ ] Update latency SLO in documentation
- [ ] Implement measurement for correctness SLO
- [ ] Schedule follow-up review in 3 months

Error Budget Policies

Error budget policies define what happens when error budgets are exhausted:

Feature Freeze: Halt new feature development until reliability improves
Reliability Investment: Allocate engineering time to reliability improvements
Controlled Reduction: Reduce load or functionality to improve reliability
SLO Adjustment: Reassess whether the SLO is appropriate

Conclusion: SLOs as a Journey

Implementing SLOs and SLIs is not a one-time project but an ongoing journey of refinement and improvement. As your services evolve, so too should your reliability objectives and measurement approaches.

Remember these key principles as you implement SLOs:

Start Simple: Begin with a few critical SLOs and expand gradually
Focus on Users: Always tie SLOs back to user experience
Iterate and Improve: Regularly review and refine your SLOs
Balance Reliability and Innovation: Use error budgets to guide the trade-off
Build a Reliability Culture: Foster shared ownership of reliability across teams

By following these principles and implementing the practices outlined in this guide, you can create a reliability framework that helps your organization deliver services that meet user expectations while enabling sustainable innovation and growth.

SLOs provide a powerful framework for making reliability concrete, measurable, and actionable. They transform reliability from a vague aspiration to a specific engineering discipline with clear targets and tools. Whether you’re running a small startup or a large enterprise, implementing SLOs can help you deliver more reliable services while making better decisions about where to invest your engineering resources.

Andrew

Andrew is a visionary software engineer and DevOps expert with a proven track record of delivering cutting-edge solutions that drive innovation at Ataiva.com. As a leader on numerous high-profile projects, Andrew brings his exceptional technical expertise and collaborative leadership skills to the table, fostering a culture of agility and excellence within the team. With a passion for architecting scalable systems, automating workflows, and empowering teams, Andrew is a sought-after authority in the field of software development and DevOps.