Capacity planning is a critical discipline for Site Reliability Engineering (SRE) teams responsible for maintaining reliable, performant systems at scale. As organizations increasingly rely on digital services, the ability to accurately forecast resource needs, plan for growth, and efficiently allocate infrastructure becomes essential for both reliability and cost management.
This comprehensive guide explores capacity planning methodologies, metrics, forecasting techniques, and implementation strategies specifically tailored for SRE teams. Whether you’re managing on-premises infrastructure, cloud resources, or hybrid environments, this guide will help you develop a robust capacity planning practice that ensures your systems can handle expected and unexpected demands while optimizing resource utilization.
Understanding Capacity Planning for SRE
Before diving into specific methodologies, let’s establish what capacity planning means in the context of Site Reliability Engineering.
What is Capacity Planning?
Capacity planning is the process of determining the resources required to meet expected workloads while maintaining service level objectives (SLOs). For SRE teams, this involves:
- Forecasting demand: Predicting future workload based on historical data and business projections
- Resource modeling: Understanding how workload translates to resource requirements
- Capacity allocation: Provisioning appropriate resources across services and regions
- Performance analysis: Ensuring systems meet performance targets under expected load
- Cost optimization: Balancing reliability requirements with infrastructure costs
Why Capacity Planning Matters for SRE
Effective capacity planning directly impacts several key aspects of reliability engineering:
- Reliability: Ensuring sufficient capacity to handle expected and unexpected loads
- Performance: Maintaining response times and throughput under varying conditions
- Cost efficiency: Avoiding over-provisioning while maintaining reliability
- Incident prevention: Proactively addressing capacity issues before they cause outages
- Scalability: Supporting business growth without service degradation
The Capacity Planning Lifecycle
Capacity planning is not a one-time activity but a continuous process:
┌─────────────────┐
│ │
│ Collect Data │
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ │
│ Analyze Trends │
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ │
│ Forecast Demand│
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ │
│ Model Resource │
│ Requirements │
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ │
│ Plan Capacity │
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ │
│ Implement │
│ Changes │
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ │
│ Monitor and │
│ Validate │
│ │
└────────┬────────┘
│
└─────────────► (Back to Collect Data)
Key Metrics for Capacity Planning
Effective capacity planning relies on tracking and analyzing the right metrics.
Resource Utilization Metrics
These metrics measure how much of your available resources are being used:
CPU Utilization: Percentage of CPU capacity being used
- Target: Typically 60-80% for headroom
- Formula:
(CPU time used / CPU time available) * 100%
Memory Utilization: Percentage of memory being used
- Target: Typically 70-85% for headroom
- Formula:
(Memory used / Total memory) * 100%
Disk Utilization: Percentage of storage capacity being used
- Target: Typically <80% for performance reasons
- Formula:
(Disk space used / Total disk space) * 100%
Network Utilization: Percentage of network bandwidth being used
- Target: Typically <70% to avoid congestion
- Formula:
(Network traffic / Network capacity) * 100%
Performance Metrics
These metrics measure how well your system is performing:
Latency: Time taken to process a request
- Target: Depends on SLOs (e.g., p95 < 200ms)
- Formula:
Time request completed - Time request received
Throughput: Number of requests processed per unit time
- Target: Depends on system requirements
- Formula:
Number of requests / Time period
Error Rate: Percentage of requests that result in errors
- Target: Typically <0.1% for critical services
- Formula:
(Number of errors / Total requests) * 100%
Saturation: Extent to which a resource has more work than it can handle
- Target: Avoid saturation (queue depth > 0)
- Formula: Varies by resource (e.g., queue depth, thread pool utilization)
Business Metrics
These metrics connect technical capacity to business outcomes:
User Growth: Rate of increase in user base
- Formula:
(Current users - Previous users) / Previous users * 100%
- Formula:
Transaction Volume: Number of business transactions
- Formula:
Sum of transactions in time period
- Formula:
Feature Adoption: Usage of specific features
- Formula:
Number of feature uses / Total user sessions
- Formula:
Seasonal Patterns: Cyclical variations in demand
- Formula: Typically analyzed with time series decomposition
Cost Metrics
These metrics help optimize the financial aspects of capacity:
Cost per Request: Infrastructure cost divided by request count
- Formula:
Total infrastructure cost / Number of requests
- Formula:
Cost per User: Infrastructure cost divided by user count
- Formula:
Total infrastructure cost / Number of users
- Formula:
Resource Efficiency: Business value generated per unit of resource
- Formula:
Business value metric / Resource consumption
- Formula:
Utilization Efficiency: Actual utilization vs. provisioned capacity
- Formula:
Average utilization / Provisioned capacity
- Formula:
Demand Forecasting Techniques
Accurate demand forecasting is the foundation of effective capacity planning.
Time Series Analysis
Time series analysis examines historical data to identify patterns and project future demand:
Moving Averages: Smooths out short-term fluctuations
def moving_average(data, window): return [sum(data[i:i+window]) / window for i in range(len(data) - window + 1)]
Exponential Smoothing: Gives more weight to recent observations
def exponential_smoothing(data, alpha): result = [data[0]] for i in range(1, len(data)): result.append(alpha * data[i] + (1 - alpha) * result[i-1]) return result
Seasonal Decomposition: Separates time series into trend, seasonal, and residual components
from statsmodels.tsa.seasonal import seasonal_decompose def decompose_time_series(data, period): result = seasonal_decompose(data, model='multiplicative', period=period) return result.trend, result.seasonal, result.resid
ARIMA Models: Combines autoregression, differencing, and moving averages
from statsmodels.tsa.arima.model import ARIMA def arima_forecast(data, order, steps): model = ARIMA(data, order=order) model_fit = model.fit() forecast = model_fit.forecast(steps=steps) return forecast
Machine Learning Approaches
Machine learning can capture complex patterns and incorporate multiple variables:
Linear Regression: Models relationship between demand and influencing factors
from sklearn.linear_model import LinearRegression def linear_regression_forecast(X, y, X_future): model = LinearRegression() model.fit(X, y) return model.predict(X_future)
Random Forest: Captures non-linear relationships and feature interactions
from sklearn.ensemble import RandomForestRegressor def random_forest_forecast(X, y, X_future): model = RandomForestRegressor(n_estimators=100) model.fit(X, y) return model.predict(X_future)
LSTM Networks: Deep learning approach for complex sequential patterns
from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense def create_lstm_model(input_shape): model = Sequential() model.add(LSTM(50, return_sequences=True, input_shape=input_shape)) model.add(LSTM(50)) model.add(Dense(1)) model.compile(optimizer='adam', loss='mse') return model
Growth Modeling
Growth modeling helps predict long-term capacity needs based on business trajectories:
Linear Growth: Constant increase over time
y(t) = a * t + b
Exponential Growth: Growth proportional to current size
y(t) = a * e^(b*t)
Logistic Growth: S-shaped curve with saturation
y(t) = L / (1 + e^(-k*(t-t0)))
Gompertz Growth: Asymmetric S-shaped growth
y(t) = L * e^(-b*e^(-c*t))
Scenario-Based Forecasting
Scenario-based forecasting considers multiple possible futures:
- Base Case: Expected growth under normal conditions
- Best Case: Optimistic scenario (e.g., viral adoption)
- Worst Case: Conservative scenario (e.g., market downturn)
- Stress Case: Extreme but plausible scenario (e.g., 10x traffic spike)
Example Scenario Planning Table:
Scenario | User Growth | Request Growth | Data Growth | Probability |
---|---|---|---|---|
Base Case | 5% monthly | 8% monthly | 10% monthly | 60% |
Best Case | 15% monthly | 20% monthly | 25% monthly | 10% |
Worst Case | 2% monthly | 3% monthly | 5% monthly | 20% |
Stress Case | 200% spike | 300% spike | 150% spike | 10% |
Resource Modeling
Resource modeling translates demand forecasts into specific infrastructure requirements.
Workload Characterization
Before modeling resources, characterize your workload:
- Request Types: Different operations with varying resource needs
- Request Distribution: How requests are distributed over time
- Resource Consumption: CPU, memory, disk, network per request type
- Dependencies: How services interact and depend on each other
Example Workload Profile:
{
"service": "payment-processing",
"request_types": {
"create_payment": {
"cpu_ms": 120,
"memory_mb": 64,
"disk_io_kb": 5,
"network_io_kb": 2,
"percentage": 60
},
"verify_payment": {
"cpu_ms": 80,
"memory_mb": 48,
"disk_io_kb": 2,
"network_io_kb": 1,
"percentage": 30
},
"refund_payment": {
"cpu_ms": 150,
"memory_mb": 72,
"disk_io_kb": 8,
"network_io_kb": 2,
"percentage": 10
}
},
"peak_to_average_ratio": 2.5,
"dependencies": [
{"service": "user-service", "calls_per_request": 0.8},
{"service": "inventory-service", "calls_per_request": 0.5},
{"service": "notification-service", "calls_per_request": 1.0}
]
}
Resource Estimation Models
Several approaches can be used to estimate resource requirements:
Linear Scaling: Resources scale linearly with load
Resources = Base resources + (Load * Scaling factor)
Queueing Theory: Models systems as networks of queues
Utilization = Arrival rate / (Number of servers * Service rate) Average queue length = Utilization / (1 - Utilization)
Simulation: Mimics system behavior under various conditions
def simulate_system(arrival_rate, service_rate, num_servers, duration): # Simplified simulation example servers = [0] * num_servers queue = [] total_wait = 0 served = 0 for t in range(duration): # New arrivals new_arrivals = np.random.poisson(arrival_rate) queue.extend([t] * new_arrivals) # Service completions for i in range(num_servers): if servers[i] <= t and queue: arrival_time = queue.pop(0) wait_time = t - arrival_time total_wait += wait_time servers[i] = t + np.random.exponential(1/service_rate) served += 1 avg_wait = total_wait / served if served > 0 else 0 return avg_wait, len(queue)
Load Testing: Empirical measurement of resource needs
def analyze_load_test(results): cpu_per_rps = [] memory_per_rps = [] for test in results: cpu_per_rps.append(test['cpu_utilization'] / test['requests_per_second']) memory_per_rps.append(test['memory_utilization'] / test['requests_per_second']) return { 'avg_cpu_per_rps': sum(cpu_per_rps) / len(cpu_per_rps), 'avg_memory_per_rps': sum(memory_per_rps) / len(memory_per_rps) }
Capacity Models
Capacity models combine forecasts with resource estimates:
Static Capacity Model: Fixed resources based on peak demand
def static_capacity_model(peak_rps, resources_per_rps, headroom_factor=1.5): return { 'cpu': peak_rps * resources_per_rps['cpu'] * headroom_factor, 'memory': peak_rps * resources_per_rps['memory'] * headroom_factor, 'disk': peak_rps * resources_per_rps['disk'] * headroom_factor, 'network': peak_rps * resources_per_rps['network'] * headroom_factor }
Dynamic Capacity Model: Adjusts resources based on actual demand
def dynamic_capacity_model(current_rps, forecast_rps, resources_per_rps, min_headroom=1.2, max_headroom=2.0, scale_up_threshold=0.7, scale_down_threshold=0.3): # Calculate headroom based on forecast confidence forecast_confidence = calculate_forecast_confidence(current_rps, forecast_rps) headroom = min_headroom + (max_headroom - min_headroom) * (1 - forecast_confidence) # Calculate target capacity target_capacity = forecast_rps * resources_per_rps * headroom # Determine if scaling is needed current_utilization = current_rps / (target_capacity / resources_per_rps) if current_utilization > scale_up_threshold: action = "scale_up" elif current_utilization < scale_down_threshold: action = "scale_down" else: action = "maintain" return { 'target_capacity': target_capacity, 'action': action, 'headroom': headroom }
Implementing Capacity Planning
Let’s explore how to implement capacity planning in practice.
Capacity Planning Process
A structured capacity planning process includes:
Data Collection
- Gather historical usage data
- Collect business projections
- Document system dependencies
- Measure resource consumption
Analysis and Forecasting
- Identify trends and patterns
- Generate demand forecasts
- Model resource requirements
- Create capacity plans
Implementation
- Provision resources according to plan
- Configure auto-scaling policies
- Implement capacity alerts
- Document capacity decisions
Monitoring and Adjustment
- Track actual vs. forecast usage
- Measure forecast accuracy
- Adjust models based on observations
- Update capacity plans regularly
Capacity Planning Tools
Several tools can assist with capacity planning:
Monitoring Systems
- Prometheus + Grafana
- Datadog
- New Relic
- Dynatrace
Forecasting Tools
- Prophet (Facebook)
- StatsModels (Python)
- TensorFlow Time Series
- Amazon Forecast
Resource Modeling
- Custom simulation tools
- Queueing calculators
- Load testing frameworks (JMeter, Locust)
- Cloud provider calculators
Capacity Management
- Kubernetes Cluster Autoscaler
- AWS Auto Scaling
- Terraform for infrastructure as code
- Custom capacity management systems
Example: Capacity Planning for a Web Service
Let’s walk through a capacity planning example for a web service:
Step 1: Collect and analyze historical data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Load historical data
data = pd.read_csv('request_data.csv', parse_dates=['timestamp'])
data.set_index('timestamp', inplace=True)
# Resample to hourly data
hourly_data = data['requests'].resample('H').sum()
# Analyze seasonality
result = seasonal_decompose(hourly_data, model='multiplicative', period=24*7) # Weekly seasonality
# Plot components
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 10))
result.observed.plot(ax=ax1, title='Observed')
result.trend.plot(ax=ax2, title='Trend')
result.seasonal.plot(ax=ax3, title='Seasonality')
result.resid.plot(ax=ax4, title='Residuals')
plt.tight_layout()
plt.savefig('seasonality_analysis.png')
Step 2: Forecast future demand
from fbprophet import Prophet
# Prepare data for Prophet
prophet_data = pd.DataFrame({
'ds': hourly_data.index,
'y': hourly_data.values
})
# Create and fit model
model = Prophet(
yearly_seasonality=True,
weekly_seasonality=True,
daily_seasonality=True,
changepoint_prior_scale=0.05
)
model.fit(prophet_data)
# Make future dataframe
future = model.make_future_dataframe(periods=24*30, freq='H') # Forecast 30 days
# Forecast
forecast = model.predict(future)
# Plot forecast
fig = model.plot(forecast)
plt.title('Request Forecast')
plt.ylabel('Requests per Hour')
plt.savefig('request_forecast.png')
# Extract peak forecast
peak_forecast = forecast['yhat_upper'].max()
Step 3: Model resource requirements
# Resource requirements per request (from load testing)
resources_per_request = {
'cpu_cores': 0.0002, # CPU cores per request
'memory_mb': 0.5, # MB of memory per request
'disk_iops': 0.01, # Disk IOPS per request
'network_mbps': 0.005 # Mbps per request
}
# Calculate resource needs for peak forecast
peak_resources = {
'cpu_cores': peak_forecast * resources_per_request['cpu_cores'],
'memory_mb': peak_forecast * resources_per_request['memory_mb'],
'disk_iops': peak_forecast * resources_per_request['disk_iops'],
'network_mbps': peak_forecast * resources_per_request['network_mbps']
}
# Add headroom (50%)
headroom_factor = 1.5
capacity_plan = {k: v * headroom_factor for k, v in peak_resources.items()}
print("Capacity Plan:")
for resource, amount in capacity_plan.items():
print(f"- {resource}: {amount:.2f}")
Step 4: Translate to infrastructure
# Instance types and their resources
instance_types = {
'small': {
'cpu_cores': 2,
'memory_mb': 4096,
'cost_per_hour': 0.05
},
'medium': {
'cpu_cores': 4,
'memory_mb': 8192,
'cost_per_hour': 0.10
},
'large': {
'cpu_cores': 8,
'memory_mb': 16384,
'cost_per_hour': 0.20
}
}
# Calculate instances needed
def calculate_instances(capacity_plan, instance_type):
specs = instance_types[instance_type]
cpu_instances = math.ceil(capacity_plan['cpu_cores'] / specs['cpu_cores'])
memory_instances = math.ceil(capacity_plan['memory_mb'] / specs['memory_mb'])
return max(cpu_instances, memory_instances)
# Calculate for each instance type
instance_counts = {
instance_type: calculate_instances(capacity_plan, instance_type)
for instance_type in instance_types
}
# Calculate costs
instance_costs = {
instance_type: count * instance_types[instance_type]['cost_per_hour'] * 24 * 30
for instance_type, count in instance_counts.items()
}
# Find most cost-effective option
most_cost_effective = min(instance_costs, key=instance_costs.get)
print(f"Most cost-effective option: {instance_counts[most_cost_effective]} {most_cost_effective} instances")
print(f"Monthly cost: ${instance_costs[most_cost_effective]:.2f}")
Step 5: Implement capacity plan
# Kubernetes deployment with HPA
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
spec:
replicas: 10 # Initial capacity
selector:
matchLabels:
app: web-service
template:
metadata:
labels:
app: web-service
spec:
containers:
- name: web-service
image: web-service:1.0.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-service
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Advanced Capacity Planning Strategies
As your systems mature, consider these advanced strategies:
Multi-Region Capacity Planning
Planning capacity across multiple regions requires additional considerations:
- Regional Traffic Distribution: How traffic is distributed geographically
- Failover Scenarios: Capacity needed during regional failures
- Data Replication: Impact of data synchronization on capacity
- Latency Requirements: How latency affects regional deployment
Example Multi-Region Capacity Plan:
regions:
us-east:
normal_traffic_percentage: 40
peak_rps: 5000
instances:
baseline: 20
peak: 30
failover: 50 # Can handle us-west failure
us-west:
normal_traffic_percentage: 30
peak_rps: 3750
instances:
baseline: 15
peak: 25
failover: 45 # Can handle us-east failure
eu-central:
normal_traffic_percentage: 20
peak_rps: 2500
instances:
baseline: 10
peak: 15
failover: 20 # Not a failover region
ap-southeast:
normal_traffic_percentage: 10
peak_rps: 1250
instances:
baseline: 5
peak: 10
failover: 15 # Not a failover region
Predictive Auto-Scaling
Implement auto-scaling based on predictions rather than just current metrics:
def predictive_scaling(historical_data, forecast_horizon=24):
"""Generate scaling schedule based on predictions."""
# Train forecasting model
model = train_forecasting_model(historical_data)
# Generate hourly predictions
predictions = model.predict(horizon=forecast_horizon)
# Convert predictions to scaling schedule
scaling_schedule = []
for hour, prediction in enumerate(predictions):
required_instances = calculate_required_instances(prediction)
scaling_schedule.append({
'hour': hour,
'instances': required_instances
})
return scaling_schedule
Capacity Risk Management
Manage capacity risks through systematic analysis:
Risk Identification: Identify potential capacity risks
- Unexpected traffic spikes
- Resource exhaustion
- Dependency failures
- Infrastructure outages
Risk Assessment: Evaluate likelihood and impact
- Probability of occurrence
- Potential service impact
- Detection capability
- Recovery time
Risk Mitigation: Implement strategies to reduce risk
- Overprovisioning critical components
- Implementing circuit breakers
- Designing graceful degradation
- Creating contingency plans
Example Risk Assessment Matrix:
Risk | Likelihood | Impact | Risk Score | Mitigation |
---|---|---|---|---|
Traffic spike (2x) | High | Medium | High | Auto-scaling, rate limiting |
Database overload | Medium | High | High | Read replicas, connection pooling |
CDN failure | Low | High | Medium | Multi-CDN strategy, local caching |
Region outage | Low | Critical | High | Multi-region deployment, failover testing |
Continuous Capacity Optimization
Implement a continuous optimization process:
Regular Capacity Reviews: Schedule periodic reviews
- Weekly for short-term adjustments
- Monthly for medium-term planning
- Quarterly for long-term strategy
Automated Efficiency Analysis: Identify optimization opportunities
- Underutilized resources
- Over-provisioned services
- Cost anomalies
- Performance bottlenecks
Feedback Loops: Improve forecasting and planning
- Track forecast accuracy
- Document capacity decisions
- Analyze incident capacity factors
- Update models with new data
Capacity Planning Challenges and Solutions
Let’s address common challenges in capacity planning:
Challenge 1: Unpredictable Growth
Problem: Business growth doesn’t follow historical patterns.
Solutions:
- Implement scenario-based planning
- Maintain flexible infrastructure (cloud, containers)
- Create contingency plans for rapid scaling
- Establish early warning indicators
Challenge 2: Complex Dependencies
Problem: Service dependencies create cascading capacity requirements.
Solutions:
- Map service dependencies comprehensively
- Model capacity needs across the entire system
- Implement circuit breakers and fallbacks
- Test dependency failure scenarios
Challenge 3: Cost Constraints
Problem: Balancing reliability with cost efficiency.
Solutions:
- Implement tiered capacity strategies
- Use spot/preemptible instances for non-critical workloads
- Optimize resource utilization through better scheduling
- Implement cost allocation and chargeback
Challenge 4: Legacy Systems
Problem: Older systems with limited scalability.
Solutions:
- Identify and address bottlenecks
- Implement caching and offloading strategies
- Plan gradual modernization
- Create isolation boundaries around legacy components
Conclusion: Building a Capacity Planning Practice
Effective capacity planning is essential for SRE teams to maintain reliable, performant systems while optimizing costs. By implementing a structured approach to forecasting demand, modeling resource requirements, and planning capacity, you can ensure your infrastructure scales appropriately with your business needs.
Remember that capacity planning is not a one-time activity but a continuous process that improves over time. Start with the basics—collecting good data, establishing clear metrics, and creating simple models—then gradually incorporate more sophisticated techniques as your practice matures.
The most successful capacity planning practices combine quantitative analysis with engineering judgment, business context, and continuous learning. By following the methodologies and strategies outlined in this guide, you can build a capacity planning practice that supports your reliability goals while making efficient use of your infrastructure resources.