Chaos Engineering Practices: Building Resilient Systems Through Controlled Failure

10 min read 2015 words

Table of Contents

In today’s complex distributed systems, failures are inevitable. Despite our best efforts at designing reliable architectures, unexpected interactions, edge cases, and environmental factors will eventually lead to outages. Chaos Engineering has emerged as a disciplined approach to identifying and addressing these potential failures before they impact users. By deliberately injecting controlled failures into systems, organizations can build confidence in their ability to withstand turbulent conditions in production.

This comprehensive guide explores practical chaos engineering practices that SRE teams can implement to improve system resilience, from basic experiments to advanced organizational strategies.


Understanding Chaos Engineering

Chaos Engineering is the practice of experimenting on a system by deliberately introducing failures to test its resilience and identify weaknesses. It’s not about creating chaos, but rather about creating controlled experiments that reveal how systems behave under stress.

Core Principles

  1. Start with a Steady State: Define what normal system behavior looks like
  2. Hypothesize about Failure Modes: Make educated guesses about what might break
  3. Introduce Real-world Events: Simulate actual failures that could occur
  4. Minimize Blast Radius: Limit the potential impact of experiments
  5. Learn and Improve: Use findings to enhance system resilience

The Evolution of Chaos Engineering

Chaos Engineering has evolved from simple failure testing to a sophisticated discipline:

  • 2010-2011: Netflix creates Chaos Monkey to randomly terminate instances
  • 2012-2014: Expansion to other failure modes (latency, network partitions)
  • 2015-2017: Formalization of principles and practices
  • 2018-2020: Adoption across industries and development of specialized tools
  • 2021-Present: Integration with observability, GitOps, and AI-driven experimentation

Building a Chaos Engineering Practice

1. Start with a Clear Purpose

Before running any chaos experiments, establish clear goals:

# Chaos Engineering Purpose Statement

Our chaos engineering practice aims to:

1. Validate that our systems can withstand common failure modes
2. Identify unknown failure modes before they affect customers
3. Build confidence in our system's resilience capabilities
4. Improve our incident response procedures through practice
5. Create a culture that embraces failure as a learning opportunity

Success Metrics:
- Reduction in customer-impacting incidents
- Decreased mean time to recovery (MTTR)
- Increased system availability
- Improved team confidence during actual incidents

2. Establish a Steady State

Define what “normal” looks like for your system:

# Example steady state monitoring in Python
import time
import statistics
from prometheus_client import start_http_server, Summary, Counter, Gauge

# Define metrics
REQUEST_LATENCY = Summary('request_latency_seconds', 'Request latency in seconds')
REQUEST_COUNT = Counter('request_count', 'Total request count')
ERROR_COUNT = Counter('error_count', 'Total error count')
ACTIVE_USERS = Gauge('active_users', 'Number of active users')

# Start Prometheus metrics server
start_http_server(8000)

def monitor_steady_state(duration_minutes=60, sample_interval_seconds=10):
    """
    Monitor system metrics to establish a steady state baseline
    
    Args:
        duration_minutes: How long to monitor
        sample_interval_seconds: How frequently to sample metrics
    
    Returns:
        Dictionary with steady state metrics
    """
    samples = {
        'latency': [],
        'requests_per_second': [],
        'error_rate': [],
        'active_users': []
    }
    
    start_time = time.time()
    end_time = start_time + (duration_minutes * 60)
    last_request_count = 0
    
    print(f"Monitoring steady state for {duration_minutes} minutes...")
    
    while time.time() < end_time:
        # Get current metrics
        current_time = time.time()
        current_request_count = REQUEST_COUNT._value.get()
        current_error_count = ERROR_COUNT._value.get()
        
        # Calculate derived metrics
        requests_per_second = (current_request_count - last_request_count) / sample_interval_seconds
        error_rate = 0 if current_request_count == 0 else (current_error_count / current_request_count) * 100
        
        # Get current latency from Prometheus summary
        latency = REQUEST_LATENCY._sum.get() / max(REQUEST_LATENCY._count.get(), 1)
        
        # Get active users
        active_users = ACTIVE_USERS._value.get()
        
        # Store samples
        samples['latency'].append(latency)
        samples['requests_per_second'].append(requests_per_second)
        samples['error_rate'].append(error_rate)
        samples['active_users'].append(active_users)
        
        # Update last values
        last_request_count = current_request_count
        
        # Wait for next sample
        time.sleep(sample_interval_seconds)
    
    # Calculate steady state statistics
    steady_state = {}
    for metric, values in samples.items():
        steady_state[metric] = {
            'mean': statistics.mean(values),
            'median': statistics.median(values),
            'p95': sorted(values)[int(len(values) * 0.95)],
            'p99': sorted(values)[int(len(values) * 0.99)],
            'min': min(values),
            'max': max(values),
            'stddev': statistics.stdev(values) if len(values) > 1 else 0
        }
    
    return steady_state

3. Form a Hypothesis

Create clear, testable hypotheses for your experiments:

# Example hypothesis document
experiment:
  name: "Database Primary Failure Experiment"
  description: "Test system resilience when the primary database instance fails"
  
  hypothesis:
    statement: "If the primary database instance fails, the system will automatically fail over to a replica with less than 10 seconds of write unavailability and no customer-visible errors"
    
    assumptions:
      - "Database replication is functioning correctly"
      - "Monitoring systems will detect the failure"
      - "Automated failover is properly configured"
    
    steady_state:
      - metric: "API success rate"
        condition: ">= 99.9%"
      - metric: "API p95 latency"
        condition: "< 500ms"
    
    method:
      - "Terminate the primary database instance"
      - "Monitor system behavior for 15 minutes"
      - "Verify automatic failover occurs"
    
    abort_conditions:
      - "Customer-facing error rate exceeds 1%"
      - "API latency exceeds 2000ms for more than 30 seconds"
      - "Failover does not complete within 5 minutes"

4. Start Small and Scale Gradually

Begin with simple experiments in non-production environments:

#!/bin/bash
# Simple chaos experiment: Test application resilience to API dependency failure

# Define variables
APP_NAMESPACE="my-application"
DEPENDENCY_SERVICE="payment-api"
EXPERIMENT_DURATION=300  # 5 minutes
CHECK_INTERVAL=10        # 10 seconds

# Ensure we're in the right context
kubectl config use-context staging

# Verify system is in steady state before experiment
echo "Verifying system steady state..."
ERROR_RATE=$(kubectl exec -n monitoring prometheus-0 -- \
  curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))' | \
  jq '.data.result[0].value[1]' | tr -d '"')

if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
  echo "System not in steady state. Current error rate: $ERROR_RATE. Aborting experiment."
  exit 1
fi

# Start experiment
echo "Starting chaos experiment: Simulating $DEPENDENCY_SERVICE failure"
echo "Experiment will run for $EXPERIMENT_DURATION seconds"

# Inject failure by adding a network policy that blocks traffic
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-block-$DEPENDENCY_SERVICE
  namespace: $APP_NAMESPACE
spec:
  podSelector:
    matchLabels:
      app: $DEPENDENCY_SERVICE
  policyTypes:
  - Ingress
  ingress: []
EOF

# Monitor system during experiment
echo "Monitoring system during experiment..."
start_time=$(date +%s)
end_time=$((start_time + EXPERIMENT_DURATION))

while [ $(date +%s) -lt $end_time ]; do
  current_time=$(date +%s)
  elapsed=$((current_time - start_time))
  
  # Check error rate
  ERROR_RATE=$(kubectl exec -n monitoring prometheus-0 -- \
    curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{status=~"5.."}[1m]))/sum(rate(http_requests_total[1m]))' | \
    jq '.data.result[0].value[1]' | tr -d '"')
  
  echo "[$elapsed s] Error rate: $ERROR_RATE"
  
  # Abort if error rate is too high
  if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
    echo "Error rate exceeded threshold (5%). Aborting experiment."
    break
  fi
  
  sleep $CHECK_INTERVAL
done

# Clean up and restore system
echo "Experiment complete. Restoring system..."
kubectl delete networkpolicy chaos-block-$DEPENDENCY_SERVICE -n $APP_NAMESPACE

5. Use Specialized Chaos Engineering Tools

Leverage purpose-built tools for more sophisticated experiments:

Chaos Toolkit Example:

# chaos-experiment.yaml
---
version: 1.0.0
title: What happens when our Redis cache becomes unavailable?
description: Verify that the application gracefully handles Redis cache failure
tags:
  - redis
  - cache
  - resilience

steady-state-hypothesis:
  title: Application is healthy
  probes:
    - name: api-returns-200
      type: probe
      tolerance: 200
      provider:
        type: http
        url: http://my-application-api/health
        method: GET
        timeout: 3

method:
  - type: action
    name: terminate-redis-pod
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: app=redis
        namespace: default
        grace_period: 0
    pauses:
      after: 20

  - type: probe
    name: api-still-responds
    provider:
      type: http
      url: http://my-application-api/products
      method: GET
      timeout: 3
    tolerance:
      type: regex
      pattern: ".*"

rollbacks:
  - type: action
    name: restart-redis
    provider:
      type: process
      path: kubectl
      arguments: "apply -f k8s/redis-deployment.yaml"

Advanced Chaos Engineering Practices

1. Gamedays

Organize structured chaos engineering events:

# Chaos Engineering Gameday Plan

## Overview
- **Date**: September 15, 2025
- **Duration**: 4 hours (10:00 AM - 2:00 PM)
- **Participants**: SRE Team, Application Developers, Product Managers
- **Systems Under Test**: Payment Processing Pipeline

## Objectives
- Validate resilience of the payment processing system
- Test incident response procedures
- Identify potential improvements in monitoring and alerting
- Build team confidence in handling production incidents

## Schedule

### 10:00 - 10:30 AM: Introduction
- Review gameday objectives and rules
- Assign roles (Incident Commander, Communicator, etc.)
- Verify monitoring dashboards are accessible

### 10:30 - 11:30 AM: Scenario 1 - Database Failover
- Inject failure: Terminate primary database instance
- Expected behavior: Automatic failover to replica
- Success criteria: < 30 seconds of write unavailability, no failed payments

### 11:30 AM - 12:30 PM: Scenario 2 - Network Partition
- Inject failure: Network partition between payment service and auth service
- Expected behavior: Circuit breaking, graceful degradation
- Success criteria: Non-auth functions continue working, clear error messages

### 12:30 - 1:30 PM: Scenario 3 - Surprise Scenario
- Facilitator will introduce an unplanned failure scenario
- Team must detect, diagnose, and mitigate without prior knowledge

### 1:30 - 2:00 PM: Debrief
- Review findings from each scenario
- Document action items
- Celebrate successes and learning opportunities

2. Chaos in CI/CD Pipelines

Integrate chaos testing into your continuous integration pipeline:

# GitLab CI configuration with chaos testing
stages:
  - build
  - test
  - chaos-test
  - deploy

variables:
  KUBERNETES_NAMESPACE: ${CI_PROJECT_NAME}-${CI_COMMIT_REF_SLUG}

build:
  stage: build
  script:
    - docker build -t ${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHA} .
    - docker push ${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHA}

test:
  stage: test
  script:
    - go test ./... -v

deploy-test:
  stage: test
  script:
    - helm upgrade --install ${CI_PROJECT_NAME} ./charts/${CI_PROJECT_NAME} 
      --namespace ${KUBERNETES_NAMESPACE}
      --set image.tag=${CI_COMMIT_SHA}
      --wait --timeout 5m

chaos-cpu-stress:
  stage: chaos-test
  script:
    - kubectl apply -f chaos/cpu-stress.yaml
    - sleep 60  # Allow chaos to run
    - python3 ./scripts/verify_slos.py --scenario cpu-stress
    - kubectl delete -f chaos/cpu-stress.yaml
    - sleep 30  # Allow system to recover

chaos-network-latency:
  stage: chaos-test
  script:
    - kubectl apply -f chaos/network-latency.yaml
    - sleep 60  # Allow chaos to run
    - python3 ./scripts/verify_slos.py --scenario network-latency
    - kubectl delete -f chaos/network-latency.yaml
    - sleep 30  # Allow system to recover

Building a Chaos Engineering Culture

1. Start with the Right Mindset

Chaos Engineering requires a blameless culture that values learning:

# Chaos Engineering Cultural Principles

## Blameless Learning
We focus on learning from failures, not assigning blame. We understand that complex systems fail in complex ways, and our goal is to improve system resilience, not find someone to blame.

## Embrace Failure
We recognize that failure is inevitable in complex systems. By embracing and learning from controlled failures, we build more resilient systems and teams.

## Data-Driven Decisions
We base our chaos experiments on data and hypotheses, not hunches. We measure the impact of our experiments and use that data to drive improvements.

## Incremental Progress
We start small and build our chaos engineering practice incrementally. We celebrate small wins and learn from setbacks.

## Shared Responsibility
Resilience is everyone's responsibility, not just operations or SRE. We involve all stakeholders in chaos engineering activities.

## Continuous Improvement
We continuously refine our chaos engineering practice based on what we learn. We regularly review and update our experiments, tools, and processes.

2. Establish Clear Governance

Create a framework for managing chaos experiments:

# Chaos Engineering Governance Framework

## Experiment Approval Process

### Low-Risk Experiments
- Definition: Experiments in non-production environments with no customer impact
- Approval: Team lead approval
- Documentation: Brief experiment plan in team wiki
- Notification: Team chat channel notification 1 hour before experiment

### Medium-Risk Experiments
- Definition: Experiments in production with minimal potential customer impact
- Approval: Engineering manager and SRE lead approval
- Documentation: Detailed experiment plan with abort criteria
- Notification: Company-wide notification 24 hours before experiment
- Scheduling: Only during business hours and outside peak traffic periods

### High-Risk Experiments
- Definition: Experiments in production with potential significant customer impact
- Approval: CTO and VP of Engineering approval
- Documentation: Comprehensive experiment plan with risk assessment
- Notification: Customer notification 72 hours before experiment
- Scheduling: Only during designated maintenance windows
- Monitoring: Executive stakeholder must be available during experiment

Conclusion: The Value of Chaos Engineering

Chaos Engineering is not about creating chaos—it’s about preventing it. By proactively identifying weaknesses in your systems through controlled experiments, you can build more resilient services that maintain reliability even when components fail.

The most successful chaos engineering practices share several common characteristics:

  1. They start small: Beginning with simple experiments in non-production environments
  2. They’re hypothesis-driven: Based on clear expectations about system behavior
  3. They’re incremental: Gradually increasing complexity and scope over time
  4. They’re collaborative: Involving multiple teams and stakeholders
  5. They’re educational: Focused on learning and improvement, not blame

As distributed systems continue to grow in complexity, chaos engineering will become an increasingly essential practice for organizations that prioritize reliability. By embracing controlled failure as a learning tool, you can build systems that are not just theoretically resilient, but proven to withstand the unpredictable nature of production environments.

Remember that chaos engineering is a journey, not a destination. Start small, learn continuously, and gradually build a culture that views failure as an opportunity for improvement rather than a crisis to be avoided.

Andrew
Andrew

Andrew is a visionary software engineer and DevOps expert with a proven track record of delivering cutting-edge solutions that drive innovation at Ataiva.com. As a leader on numerous high-profile projects, Andrew brings his exceptional technical expertise and collaborative leadership skills to the table, fostering a culture of agility and excellence within the team. With a passion for architecting scalable systems, automating workflows, and empowering teams, Andrew is a sought-after authority in the field of software development and DevOps.

Tags

Recent Posts