In today’s complex distributed systems, failures are inevitable. Despite our best efforts at designing reliable architectures, unexpected interactions, edge cases, and environmental factors will eventually lead to outages. Chaos Engineering has emerged as a disciplined approach to identifying and addressing these potential failures before they impact users. By deliberately injecting controlled failures into systems, organizations can build confidence in their ability to withstand turbulent conditions in production.
This comprehensive guide explores practical chaos engineering practices that SRE teams can implement to improve system resilience, from basic experiments to advanced organizational strategies.
Understanding Chaos Engineering
Chaos Engineering is the practice of experimenting on a system by deliberately introducing failures to test its resilience and identify weaknesses. It’s not about creating chaos, but rather about creating controlled experiments that reveal how systems behave under stress.
Core Principles
- Start with a Steady State: Define what normal system behavior looks like
- Hypothesize about Failure Modes: Make educated guesses about what might break
- Introduce Real-world Events: Simulate actual failures that could occur
- Minimize Blast Radius: Limit the potential impact of experiments
- Learn and Improve: Use findings to enhance system resilience
The Evolution of Chaos Engineering
Chaos Engineering has evolved from simple failure testing to a sophisticated discipline:
- 2010-2011: Netflix creates Chaos Monkey to randomly terminate instances
- 2012-2014: Expansion to other failure modes (latency, network partitions)
- 2015-2017: Formalization of principles and practices
- 2018-2020: Adoption across industries and development of specialized tools
- 2021-Present: Integration with observability, GitOps, and AI-driven experimentation
Building a Chaos Engineering Practice
1. Start with a Clear Purpose
Before running any chaos experiments, establish clear goals:
# Chaos Engineering Purpose Statement
Our chaos engineering practice aims to:
1. Validate that our systems can withstand common failure modes
2. Identify unknown failure modes before they affect customers
3. Build confidence in our system's resilience capabilities
4. Improve our incident response procedures through practice
5. Create a culture that embraces failure as a learning opportunity
Success Metrics:
- Reduction in customer-impacting incidents
- Decreased mean time to recovery (MTTR)
- Increased system availability
- Improved team confidence during actual incidents
2. Establish a Steady State
Define what “normal” looks like for your system:
# Example steady state monitoring in Python
import time
import statistics
from prometheus_client import start_http_server, Summary, Counter, Gauge
# Define metrics
REQUEST_LATENCY = Summary('request_latency_seconds', 'Request latency in seconds')
REQUEST_COUNT = Counter('request_count', 'Total request count')
ERROR_COUNT = Counter('error_count', 'Total error count')
ACTIVE_USERS = Gauge('active_users', 'Number of active users')
# Start Prometheus metrics server
start_http_server(8000)
def monitor_steady_state(duration_minutes=60, sample_interval_seconds=10):
"""
Monitor system metrics to establish a steady state baseline
Args:
duration_minutes: How long to monitor
sample_interval_seconds: How frequently to sample metrics
Returns:
Dictionary with steady state metrics
"""
samples = {
'latency': [],
'requests_per_second': [],
'error_rate': [],
'active_users': []
}
start_time = time.time()
end_time = start_time + (duration_minutes * 60)
last_request_count = 0
print(f"Monitoring steady state for {duration_minutes} minutes...")
while time.time() < end_time:
# Get current metrics
current_time = time.time()
current_request_count = REQUEST_COUNT._value.get()
current_error_count = ERROR_COUNT._value.get()
# Calculate derived metrics
requests_per_second = (current_request_count - last_request_count) / sample_interval_seconds
error_rate = 0 if current_request_count == 0 else (current_error_count / current_request_count) * 100
# Get current latency from Prometheus summary
latency = REQUEST_LATENCY._sum.get() / max(REQUEST_LATENCY._count.get(), 1)
# Get active users
active_users = ACTIVE_USERS._value.get()
# Store samples
samples['latency'].append(latency)
samples['requests_per_second'].append(requests_per_second)
samples['error_rate'].append(error_rate)
samples['active_users'].append(active_users)
# Update last values
last_request_count = current_request_count
# Wait for next sample
time.sleep(sample_interval_seconds)
# Calculate steady state statistics
steady_state = {}
for metric, values in samples.items():
steady_state[metric] = {
'mean': statistics.mean(values),
'median': statistics.median(values),
'p95': sorted(values)[int(len(values) * 0.95)],
'p99': sorted(values)[int(len(values) * 0.99)],
'min': min(values),
'max': max(values),
'stddev': statistics.stdev(values) if len(values) > 1 else 0
}
return steady_state
3. Form a Hypothesis
Create clear, testable hypotheses for your experiments:
# Example hypothesis document
experiment:
name: "Database Primary Failure Experiment"
description: "Test system resilience when the primary database instance fails"
hypothesis:
statement: "If the primary database instance fails, the system will automatically fail over to a replica with less than 10 seconds of write unavailability and no customer-visible errors"
assumptions:
- "Database replication is functioning correctly"
- "Monitoring systems will detect the failure"
- "Automated failover is properly configured"
steady_state:
- metric: "API success rate"
condition: ">= 99.9%"
- metric: "API p95 latency"
condition: "< 500ms"
method:
- "Terminate the primary database instance"
- "Monitor system behavior for 15 minutes"
- "Verify automatic failover occurs"
abort_conditions:
- "Customer-facing error rate exceeds 1%"
- "API latency exceeds 2000ms for more than 30 seconds"
- "Failover does not complete within 5 minutes"
4. Start Small and Scale Gradually
Begin with simple experiments in non-production environments:
#!/bin/bash
# Simple chaos experiment: Test application resilience to API dependency failure
# Define variables
APP_NAMESPACE="my-application"
DEPENDENCY_SERVICE="payment-api"
EXPERIMENT_DURATION=300 # 5 minutes
CHECK_INTERVAL=10 # 10 seconds
# Ensure we're in the right context
kubectl config use-context staging
# Verify system is in steady state before experiment
echo "Verifying system steady state..."
ERROR_RATE=$(kubectl exec -n monitoring prometheus-0 -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))' | \
jq '.data.result[0].value[1]' | tr -d '"')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "System not in steady state. Current error rate: $ERROR_RATE. Aborting experiment."
exit 1
fi
# Start experiment
echo "Starting chaos experiment: Simulating $DEPENDENCY_SERVICE failure"
echo "Experiment will run for $EXPERIMENT_DURATION seconds"
# Inject failure by adding a network policy that blocks traffic
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: chaos-block-$DEPENDENCY_SERVICE
namespace: $APP_NAMESPACE
spec:
podSelector:
matchLabels:
app: $DEPENDENCY_SERVICE
policyTypes:
- Ingress
ingress: []
EOF
# Monitor system during experiment
echo "Monitoring system during experiment..."
start_time=$(date +%s)
end_time=$((start_time + EXPERIMENT_DURATION))
while [ $(date +%s) -lt $end_time ]; do
current_time=$(date +%s)
elapsed=$((current_time - start_time))
# Check error rate
ERROR_RATE=$(kubectl exec -n monitoring prometheus-0 -- \
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total{status=~"5.."}[1m]))/sum(rate(http_requests_total[1m]))' | \
jq '.data.result[0].value[1]' | tr -d '"')
echo "[$elapsed s] Error rate: $ERROR_RATE"
# Abort if error rate is too high
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
echo "Error rate exceeded threshold (5%). Aborting experiment."
break
fi
sleep $CHECK_INTERVAL
done
# Clean up and restore system
echo "Experiment complete. Restoring system..."
kubectl delete networkpolicy chaos-block-$DEPENDENCY_SERVICE -n $APP_NAMESPACE
5. Use Specialized Chaos Engineering Tools
Leverage purpose-built tools for more sophisticated experiments:
Chaos Toolkit Example:
# chaos-experiment.yaml
---
version: 1.0.0
title: What happens when our Redis cache becomes unavailable?
description: Verify that the application gracefully handles Redis cache failure
tags:
- redis
- cache
- resilience
steady-state-hypothesis:
title: Application is healthy
probes:
- name: api-returns-200
type: probe
tolerance: 200
provider:
type: http
url: http://my-application-api/health
method: GET
timeout: 3
method:
- type: action
name: terminate-redis-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=redis
namespace: default
grace_period: 0
pauses:
after: 20
- type: probe
name: api-still-responds
provider:
type: http
url: http://my-application-api/products
method: GET
timeout: 3
tolerance:
type: regex
pattern: ".*"
rollbacks:
- type: action
name: restart-redis
provider:
type: process
path: kubectl
arguments: "apply -f k8s/redis-deployment.yaml"
Advanced Chaos Engineering Practices
1. Gamedays
Organize structured chaos engineering events:
# Chaos Engineering Gameday Plan
## Overview
- **Date**: September 15, 2025
- **Duration**: 4 hours (10:00 AM - 2:00 PM)
- **Participants**: SRE Team, Application Developers, Product Managers
- **Systems Under Test**: Payment Processing Pipeline
## Objectives
- Validate resilience of the payment processing system
- Test incident response procedures
- Identify potential improvements in monitoring and alerting
- Build team confidence in handling production incidents
## Schedule
### 10:00 - 10:30 AM: Introduction
- Review gameday objectives and rules
- Assign roles (Incident Commander, Communicator, etc.)
- Verify monitoring dashboards are accessible
### 10:30 - 11:30 AM: Scenario 1 - Database Failover
- Inject failure: Terminate primary database instance
- Expected behavior: Automatic failover to replica
- Success criteria: < 30 seconds of write unavailability, no failed payments
### 11:30 AM - 12:30 PM: Scenario 2 - Network Partition
- Inject failure: Network partition between payment service and auth service
- Expected behavior: Circuit breaking, graceful degradation
- Success criteria: Non-auth functions continue working, clear error messages
### 12:30 - 1:30 PM: Scenario 3 - Surprise Scenario
- Facilitator will introduce an unplanned failure scenario
- Team must detect, diagnose, and mitigate without prior knowledge
### 1:30 - 2:00 PM: Debrief
- Review findings from each scenario
- Document action items
- Celebrate successes and learning opportunities
2. Chaos in CI/CD Pipelines
Integrate chaos testing into your continuous integration pipeline:
# GitLab CI configuration with chaos testing
stages:
- build
- test
- chaos-test
- deploy
variables:
KUBERNETES_NAMESPACE: ${CI_PROJECT_NAME}-${CI_COMMIT_REF_SLUG}
build:
stage: build
script:
- docker build -t ${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHA} .
- docker push ${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHA}
test:
stage: test
script:
- go test ./... -v
deploy-test:
stage: test
script:
- helm upgrade --install ${CI_PROJECT_NAME} ./charts/${CI_PROJECT_NAME}
--namespace ${KUBERNETES_NAMESPACE}
--set image.tag=${CI_COMMIT_SHA}
--wait --timeout 5m
chaos-cpu-stress:
stage: chaos-test
script:
- kubectl apply -f chaos/cpu-stress.yaml
- sleep 60 # Allow chaos to run
- python3 ./scripts/verify_slos.py --scenario cpu-stress
- kubectl delete -f chaos/cpu-stress.yaml
- sleep 30 # Allow system to recover
chaos-network-latency:
stage: chaos-test
script:
- kubectl apply -f chaos/network-latency.yaml
- sleep 60 # Allow chaos to run
- python3 ./scripts/verify_slos.py --scenario network-latency
- kubectl delete -f chaos/network-latency.yaml
- sleep 30 # Allow system to recover
Building a Chaos Engineering Culture
1. Start with the Right Mindset
Chaos Engineering requires a blameless culture that values learning:
# Chaos Engineering Cultural Principles
## Blameless Learning
We focus on learning from failures, not assigning blame. We understand that complex systems fail in complex ways, and our goal is to improve system resilience, not find someone to blame.
## Embrace Failure
We recognize that failure is inevitable in complex systems. By embracing and learning from controlled failures, we build more resilient systems and teams.
## Data-Driven Decisions
We base our chaos experiments on data and hypotheses, not hunches. We measure the impact of our experiments and use that data to drive improvements.
## Incremental Progress
We start small and build our chaos engineering practice incrementally. We celebrate small wins and learn from setbacks.
## Shared Responsibility
Resilience is everyone's responsibility, not just operations or SRE. We involve all stakeholders in chaos engineering activities.
## Continuous Improvement
We continuously refine our chaos engineering practice based on what we learn. We regularly review and update our experiments, tools, and processes.
2. Establish Clear Governance
Create a framework for managing chaos experiments:
# Chaos Engineering Governance Framework
## Experiment Approval Process
### Low-Risk Experiments
- Definition: Experiments in non-production environments with no customer impact
- Approval: Team lead approval
- Documentation: Brief experiment plan in team wiki
- Notification: Team chat channel notification 1 hour before experiment
### Medium-Risk Experiments
- Definition: Experiments in production with minimal potential customer impact
- Approval: Engineering manager and SRE lead approval
- Documentation: Detailed experiment plan with abort criteria
- Notification: Company-wide notification 24 hours before experiment
- Scheduling: Only during business hours and outside peak traffic periods
### High-Risk Experiments
- Definition: Experiments in production with potential significant customer impact
- Approval: CTO and VP of Engineering approval
- Documentation: Comprehensive experiment plan with risk assessment
- Notification: Customer notification 72 hours before experiment
- Scheduling: Only during designated maintenance windows
- Monitoring: Executive stakeholder must be available during experiment
Conclusion: The Value of Chaos Engineering
Chaos Engineering is not about creating chaos—it’s about preventing it. By proactively identifying weaknesses in your systems through controlled experiments, you can build more resilient services that maintain reliability even when components fail.
The most successful chaos engineering practices share several common characteristics:
- They start small: Beginning with simple experiments in non-production environments
- They’re hypothesis-driven: Based on clear expectations about system behavior
- They’re incremental: Gradually increasing complexity and scope over time
- They’re collaborative: Involving multiple teams and stakeholders
- They’re educational: Focused on learning and improvement, not blame
As distributed systems continue to grow in complexity, chaos engineering will become an increasingly essential practice for organizations that prioritize reliability. By embracing controlled failure as a learning tool, you can build systems that are not just theoretically resilient, but proven to withstand the unpredictable nature of production environments.
Remember that chaos engineering is a journey, not a destination. Start small, learn continuously, and gradually build a culture that views failure as an opportunity for improvement rather than a crisis to be avoided.