Chaos Engineering¶

What is Chaos Engineering?¶

Chaos Engineering Overview

Chaos Engineering Process¶

Chaos Experiment Process

Types of Chaos Experiments¶

Infrastructure Failures¶

Infrastructure Chaos

Network Failures¶

Network Chaos

Application Failures¶

Application Chaos

Chaos Engineering Tools¶

Chaos Monkey (Netflix)¶

# Chaos Monkey configuration
chaos-monkey:
  enabled: true
  assaults:
    level: 5  # 1-10, frequency of attacks
    latencyActive: true
    latencyRangeStart: 1000
    latencyRangeEnd: 3000
    exceptionsActive: true
    killApplicationActive: false
  watcher:
    enabled: true
    components: [service, controller, repository]

Litmus Chaos (Kubernetes)¶

# Litmus ChaosExperiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["delete", "list", "get"]
    image: litmuschaos/go-runner:latest
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "30"
      - name: CHAOS_INTERVAL
        value: "10"
      - name: FORCE
        value: "false"

---
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
spec:
  appinfo:
    appns: production
    applabel: app=order-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"

Gremlin¶

# Gremlin CLI examples

# CPU attack
gremlin attack cpu \
  --length 60 \
  --percent 80 \
  --target-tags "service=order-service,env=production"

# Latency attack
gremlin attack latency \
  --length 120 \
  --delay 500 \
  --target-tags "service=payment-service"

# Blackhole attack (network partition)
gremlin attack blackhole \
  --length 60 \
  --hostnames "database.internal" \
  --target-tags "service=order-service"

Chaos Toolkit¶

# experiment.yaml
title: "Verify order service handles database failure"
description: "When database is unavailable, orders should be queued"

steady-state-hypothesis:
  title: "Order processing is normal"
  probes:
    - type: probe
      name: "orders-processed-per-minute"
      tolerance:
        type: range
        target: [90, 110]
      provider:
        type: http
        url: "http://metrics/orders-processed"

method:
  - type: action
    name: "terminate-database"
    provider:
      type: python
      module: chaosaws.rds.actions
      func: stop_db_instance
      arguments:
        db_instance_identifier: "orders-db"
    pauses:
      after: 60

  - type: probe
    name: "check-queue-size"
    provider:
      type: http
      url: "http://metrics/queue-size"

rollbacks:
  - type: action
    name: "restart-database"
    provider:
      type: python
      module: chaosaws.rds.actions
      func: start_db_instance
      arguments:
        db_instance_identifier: "orders-db"

Game Days¶

Game Days

Best Practices¶

Chaos Engineering Best Practices

Maturity Model¶

Chaos Engineering Maturity

Common Interview Questions¶

What is chaos engineering?
Controlled experiments to test resilience
Build confidence in system behavior
Discover weaknesses proactively
Chaos engineering vs testing?
Testing: Verify known behaviors
Chaos: Discover unknown failures
Chaos runs in production-like environments
How to start with chaos engineering?
Start in staging
Small blast radius
Clear hypothesis
Monitor and abort conditions
What experiments to run first?
Instance termination
Network latency
Dependency failures
Start with most critical services
What is a game day?
Scheduled chaos exercise
Team-wide participation
Practice incident response
Learn and improve

*