Site Reliability Engineering (SRE)¶

SRE Overview¶

SRE Core Principles

SLIs, SLOs, and SLAs¶

SLI, SLO, and SLA Relationship

Defining Good SLIs¶

Choosing SLIs

SLO Example¶

# SLO definition
service: order-api
slos:
  - name: availability
    description: "Order API should be available"
    sli:
      type: availability
      metric: |
        sum(rate(http_requests_total{status!~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))
    objective: 99.9%
    window: 30d

  - name: latency
    description: "Order API should be fast"
    sli:
      type: latency
      metric: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
    objective: 200ms  # P99 < 200ms
    window: 30d

Error Budgets¶

Error Budget

Error Budget Calculation¶

# Error budget calculation
def calculate_error_budget(slo_target, window_days):
    """
    Calculate error budget in minutes.

    99.9% SLO over 30 days = 43.2 minutes allowed downtime
    """
    total_minutes = window_days * 24 * 60
    error_budget_percent = 1 - slo_target
    error_budget_minutes = total_minutes * error_budget_percent
    return error_budget_minutes

# 99.9% SLO
budget = calculate_error_budget(0.999, 30)
print(f"Error budget: {budget:.1f} minutes")  # 43.2 minutes

# Track consumption
def error_budget_remaining(slo_target, window_days, actual_uptime):
    total_budget = calculate_error_budget(slo_target, window_days)
    consumed = (1 - actual_uptime) * window_days * 24 * 60
    return total_budget - consumed

Toil¶

Toil

Incident Management¶

Incident Lifecycle¶

Incident Lifecycle

Incident Roles¶

Incident Roles

Severity Levels¶

Incident Severity Levels

Postmortems¶

Blameless Postmortem Template¶

# Incident Postmortem: [Title]

## Incident Summary
- **Date**: 2024-01-15
- **Duration**: 45 minutes (10:15 - 11:00 UTC)
- **Severity**: SEV 2
- **Impact**: 15% of users experienced checkout failures
- **Detection**: Automated alert (error rate > 1%)

## Timeline
| Time (UTC) | Event |
|------------|-------|
| 10:15 | Alert fired: checkout error rate > 1% |
| 10:18 | On-call engineer acknowledges |
| 10:25 | Root cause identified: database connection pool exhausted |
| 10:35 | Mitigation: Increased connection pool size |
| 10:45 | Monitoring shows recovery |
| 11:00 | Incident resolved |

## Root Cause
Database connection pool was sized for normal traffic. A marketing
campaign drove 3x normal traffic, exhausting connections.

## Contributing Factors
- Connection pool not auto-scaling
- No alerting on connection pool utilization
- Load testing didn't cover this scenario

## What Went Well
- Alert fired promptly
- Quick identification of root cause
- Team collaboration was effective

## What Could Be Improved
- Earlier detection of connection pool saturation
- Automated scaling of database connections
- Better load testing for campaigns

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection pool utilization alert | @alice | 2024-01-22 | Done |
| Implement connection pool auto-scaling | @bob | 2024-02-01 | In Progress |
| Update load testing for campaigns | @carol | 2024-01-29 | Open |

## Lessons Learned
Always coordinate with marketing on campaign timing to
prepare for traffic spikes.

On-Call Best Practices¶

On-Call Best Practices

Common Interview Questions¶

SLI vs SLO vs SLA?
SLI: Metric measuring service behavior
SLO: Target value for the SLI
SLA: Contract with consequences
What is an error budget?
Allowed unreliability (100% - SLO)
Balances reliability vs velocity
When exhausted, focus on reliability
What is toil?
Manual, repetitive, automatable work
Scales with service size
Target < 50% time on toil
Blameless postmortem principles?
Focus on systems, not people
Learn from failures
Create actionable improvements
How to reduce on-call burden?
Improve alert quality
Automate responses
Reduce toil
Better runbooks

*