Skip to content

Site Reliability Engineering (SRE)


SRE Overview

SRE Core Principles


SLIs, SLOs, and SLAs

SLI, SLO, and SLA Relationship

Defining Good SLIs

Choosing SLIs

SLO Example

# SLO definition
service: order-api
slos:
  - name: availability
    description: "Order API should be available"
    sli:
      type: availability
      metric: |
        sum(rate(http_requests_total{status!~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))
    objective: 99.9%
    window: 30d

  - name: latency
    description: "Order API should be fast"
    sli:
      type: latency
      metric: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
    objective: 200ms  # P99 < 200ms
    window: 30d

Error Budgets

Error Budget

Error Budget Calculation

# Error budget calculation
def calculate_error_budget(slo_target, window_days):
    """
    Calculate error budget in minutes.

    99.9% SLO over 30 days = 43.2 minutes allowed downtime
    """
    total_minutes = window_days * 24 * 60
    error_budget_percent = 1 - slo_target
    error_budget_minutes = total_minutes * error_budget_percent
    return error_budget_minutes

# 99.9% SLO
budget = calculate_error_budget(0.999, 30)
print(f"Error budget: {budget:.1f} minutes")  # 43.2 minutes

# Track consumption
def error_budget_remaining(slo_target, window_days, actual_uptime):
    total_budget = calculate_error_budget(slo_target, window_days)
    consumed = (1 - actual_uptime) * window_days * 24 * 60
    return total_budget - consumed

Toil

Toil


Incident Management

Incident Lifecycle

Incident Lifecycle

Incident Roles

Incident Roles

Severity Levels

Incident Severity Levels


Postmortems

Blameless Postmortem Template

# Incident Postmortem: [Title]

## Incident Summary
- **Date**: 2024-01-15
- **Duration**: 45 minutes (10:15 - 11:00 UTC)
- **Severity**: SEV 2
- **Impact**: 15% of users experienced checkout failures
- **Detection**: Automated alert (error rate > 1%)

## Timeline
| Time (UTC) | Event |
|------------|-------|
| 10:15 | Alert fired: checkout error rate > 1% |
| 10:18 | On-call engineer acknowledges |
| 10:25 | Root cause identified: database connection pool exhausted |
| 10:35 | Mitigation: Increased connection pool size |
| 10:45 | Monitoring shows recovery |
| 11:00 | Incident resolved |

## Root Cause
Database connection pool was sized for normal traffic. A marketing
campaign drove 3x normal traffic, exhausting connections.

## Contributing Factors
- Connection pool not auto-scaling
- No alerting on connection pool utilization
- Load testing didn't cover this scenario

## What Went Well
- Alert fired promptly
- Quick identification of root cause
- Team collaboration was effective

## What Could Be Improved
- Earlier detection of connection pool saturation
- Automated scaling of database connections
- Better load testing for campaigns

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection pool utilization alert | @alice | 2024-01-22 | Done |
| Implement connection pool auto-scaling | @bob | 2024-02-01 | In Progress |
| Update load testing for campaigns | @carol | 2024-01-29 | Open |

## Lessons Learned
Always coordinate with marketing on campaign timing to
prepare for traffic spikes.

On-Call Best Practices

On-Call Best Practices


Common Interview Questions

  1. SLI vs SLO vs SLA?
  2. SLI: Metric measuring service behavior
  3. SLO: Target value for the SLI
  4. SLA: Contract with consequences

  5. What is an error budget?

  6. Allowed unreliability (100% - SLO)
  7. Balances reliability vs velocity
  8. When exhausted, focus on reliability

  9. What is toil?

  10. Manual, repetitive, automatable work
  11. Scales with service size
  12. Target < 50% time on toil

  13. Blameless postmortem principles?

  14. Focus on systems, not people
  15. Learn from failures
  16. Create actionable improvements

  17. How to reduce on-call burden?

  18. Improve alert quality
  19. Automate responses
  20. Reduce toil
  21. Better runbooks

  • *