Site Reliability Engineering (SRE)¶
SRE Overview¶
SLIs, SLOs, and SLAs¶
Defining Good SLIs¶
SLO Example¶
# SLO definition
service: order-api
slos:
- name: availability
description: "Order API should be available"
sli:
type: availability
metric: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
objective: 99.9%
window: 30d
- name: latency
description: "Order API should be fast"
sli:
type: latency
metric: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
objective: 200ms # P99 < 200ms
window: 30d
Error Budgets¶
Error Budget Calculation¶
# Error budget calculation
def calculate_error_budget(slo_target, window_days):
"""
Calculate error budget in minutes.
99.9% SLO over 30 days = 43.2 minutes allowed downtime
"""
total_minutes = window_days * 24 * 60
error_budget_percent = 1 - slo_target
error_budget_minutes = total_minutes * error_budget_percent
return error_budget_minutes
# 99.9% SLO
budget = calculate_error_budget(0.999, 30)
print(f"Error budget: {budget:.1f} minutes") # 43.2 minutes
# Track consumption
def error_budget_remaining(slo_target, window_days, actual_uptime):
total_budget = calculate_error_budget(slo_target, window_days)
consumed = (1 - actual_uptime) * window_days * 24 * 60
return total_budget - consumed
Toil¶
Incident Management¶
Incident Lifecycle¶
Incident Roles¶
Severity Levels¶
Postmortems¶
Blameless Postmortem Template¶
# Incident Postmortem: [Title]
## Incident Summary
- **Date**: 2024-01-15
- **Duration**: 45 minutes (10:15 - 11:00 UTC)
- **Severity**: SEV 2
- **Impact**: 15% of users experienced checkout failures
- **Detection**: Automated alert (error rate > 1%)
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 10:15 | Alert fired: checkout error rate > 1% |
| 10:18 | On-call engineer acknowledges |
| 10:25 | Root cause identified: database connection pool exhausted |
| 10:35 | Mitigation: Increased connection pool size |
| 10:45 | Monitoring shows recovery |
| 11:00 | Incident resolved |
## Root Cause
Database connection pool was sized for normal traffic. A marketing
campaign drove 3x normal traffic, exhausting connections.
## Contributing Factors
- Connection pool not auto-scaling
- No alerting on connection pool utilization
- Load testing didn't cover this scenario
## What Went Well
- Alert fired promptly
- Quick identification of root cause
- Team collaboration was effective
## What Could Be Improved
- Earlier detection of connection pool saturation
- Automated scaling of database connections
- Better load testing for campaigns
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection pool utilization alert | @alice | 2024-01-22 | Done |
| Implement connection pool auto-scaling | @bob | 2024-02-01 | In Progress |
| Update load testing for campaigns | @carol | 2024-01-29 | Open |
## Lessons Learned
Always coordinate with marketing on campaign timing to
prepare for traffic spikes.
On-Call Best Practices¶
Common Interview Questions¶
- SLI vs SLO vs SLA?
- SLI: Metric measuring service behavior
- SLO: Target value for the SLI
-
SLA: Contract with consequences
-
What is an error budget?
- Allowed unreliability (100% - SLO)
- Balances reliability vs velocity
-
When exhausted, focus on reliability
-
What is toil?
- Manual, repetitive, automatable work
- Scales with service size
-
Target < 50% time on toil
-
Blameless postmortem principles?
- Focus on systems, not people
- Learn from failures
-
Create actionable improvements
-
How to reduce on-call burden?
- Improve alert quality
- Automate responses
- Reduce toil
- Better runbooks
- *