Skip to content

Chaos Engineering


What is Chaos Engineering?

Chaos Engineering Overview


Chaos Engineering Process

Chaos Experiment Process


Types of Chaos Experiments

Infrastructure Failures

Infrastructure Chaos

Network Failures

Network Chaos

Application Failures

Application Chaos


Chaos Engineering Tools

Chaos Monkey (Netflix)

# Chaos Monkey configuration
chaos-monkey:
  enabled: true
  assaults:
    level: 5  # 1-10, frequency of attacks
    latencyActive: true
    latencyRangeStart: 1000
    latencyRangeEnd: 3000
    exceptionsActive: true
    killApplicationActive: false
  watcher:
    enabled: true
    components: [service, controller, repository]

Litmus Chaos (Kubernetes)

# Litmus ChaosExperiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["delete", "list", "get"]
    image: litmuschaos/go-runner:latest
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "30"
      - name: CHAOS_INTERVAL
        value: "10"
      - name: FORCE
        value: "false"

---
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
spec:
  appinfo:
    appns: production
    applabel: app=order-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"

Gremlin

# Gremlin CLI examples

# CPU attack
gremlin attack cpu \
  --length 60 \
  --percent 80 \
  --target-tags "service=order-service,env=production"

# Latency attack
gremlin attack latency \
  --length 120 \
  --delay 500 \
  --target-tags "service=payment-service"

# Blackhole attack (network partition)
gremlin attack blackhole \
  --length 60 \
  --hostnames "database.internal" \
  --target-tags "service=order-service"

Chaos Toolkit

# experiment.yaml
title: "Verify order service handles database failure"
description: "When database is unavailable, orders should be queued"

steady-state-hypothesis:
  title: "Order processing is normal"
  probes:
    - type: probe
      name: "orders-processed-per-minute"
      tolerance:
        type: range
        target: [90, 110]
      provider:
        type: http
        url: "http://metrics/orders-processed"

method:
  - type: action
    name: "terminate-database"
    provider:
      type: python
      module: chaosaws.rds.actions
      func: stop_db_instance
      arguments:
        db_instance_identifier: "orders-db"
    pauses:
      after: 60

  - type: probe
    name: "check-queue-size"
    provider:
      type: http
      url: "http://metrics/queue-size"

rollbacks:
  - type: action
    name: "restart-database"
    provider:
      type: python
      module: chaosaws.rds.actions
      func: start_db_instance
      arguments:
        db_instance_identifier: "orders-db"

Game Days

Game Days


Best Practices

Chaos Engineering Best Practices


Maturity Model

Chaos Engineering Maturity


Common Interview Questions

  1. What is chaos engineering?
  2. Controlled experiments to test resilience
  3. Build confidence in system behavior
  4. Discover weaknesses proactively

  5. Chaos engineering vs testing?

  6. Testing: Verify known behaviors
  7. Chaos: Discover unknown failures
  8. Chaos runs in production-like environments

  9. How to start with chaos engineering?

  10. Start in staging
  11. Small blast radius
  12. Clear hypothesis
  13. Monitor and abort conditions

  14. What experiments to run first?

  15. Instance termination
  16. Network latency
  17. Dependency failures
  18. Start with most critical services

  19. What is a game day?

  20. Scheduled chaos exercise
  21. Team-wide participation
  22. Practice incident response
  23. Learn and improve

  • *