Skip to content

Observability


Three Pillars of Observability

Three Pillars of Observability


Metrics

Metric Types

Metric Types

RED Method (Request-driven services)

RED Method

USE Method (Resource-focused)

USE Method

Four Golden Signals (Google SRE)

Four Golden Signals


Logging

Structured Logging

// Unstructured (bad)
logger.info("User " + userId + " placed order " + orderId + " for $" + amount);

// Structured (good)
logger.info("Order placed",
    kv("user_id", userId),
    kv("order_id", orderId),
    kv("amount", amount),
    kv("currency", "USD")
);

// JSON output
{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "message": "Order placed",
  "user_id": "usr_123",
  "order_id": "ord_456",
  "amount": 99.99,
  "currency": "USD",
  "service": "order-service",
  "trace_id": "abc123",
  "span_id": "def456"
}

Log Levels

Log Levels

Correlation IDs

// MDC (Mapped Diagnostic Context)
public class CorrelationFilter implements Filter {
    @Override
    public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) {
        String traceId = request.getHeader("X-Trace-Id");
        if (traceId == null) {
            traceId = UUID.randomUUID().toString();
        }

        MDC.put("trace_id", traceId);
        response.setHeader("X-Trace-Id", traceId);

        try {
            chain.doFilter(req, resp);
        } finally {
            MDC.clear();
        }
    }
}

// All logs automatically include trace_id
// logback.xml pattern includes %X{trace_id}

Distributed Tracing

Trace Structure

Distributed Trace

OpenTelemetry

// Initialize tracer
Tracer tracer = GlobalOpenTelemetry.getTracer("order-service");

public Order getOrder(String orderId) {
    Span span = tracer.spanBuilder("getOrder")
        .setSpanKind(SpanKind.SERVER)
        .setAttribute("order.id", orderId)
        .startSpan();

    try (Scope scope = span.makeCurrent()) {
        // Your business logic
        Order order = orderRepository.findById(orderId);

        span.setAttribute("order.status", order.getStatus());
        span.setAttribute("order.total", order.getTotal());

        return order;
    } catch (Exception e) {
        span.recordException(e);
        span.setStatus(StatusCode.ERROR, e.getMessage());
        throw e;
    } finally {
        span.end();
    }
}

// Propagate context to downstream services
public void callPaymentService(Order order) {
    Span span = tracer.spanBuilder("callPaymentService")
        .setSpanKind(SpanKind.CLIENT)
        .startSpan();

    try (Scope scope = span.makeCurrent()) {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create("http://payment-service/charge"))
            .header("traceparent", getTraceparent())  // W3C Trace Context
            .POST(...)
            .build();
        // ...
    } finally {
        span.end();
    }
}

Alerting

Alert Design

# Prometheus alerting rule
groups:
  - name: api-alerts
    rules:
      # Symptom-based alert (what users experience)
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"
          runbook: "https://wiki/runbooks/high-error-rate"

      # Latency alert
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency > 1s"

      # Saturation alert
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
          node_memory_MemTotal_bytes > 0.9
        for: 10m
        labels:
          severity: warning

Alert Best Practices

Alerting Best Practices


Dashboards

Dashboard Design

Service Overview Dashboard


Tools Ecosystem

Observability Stack


Common Interview Questions

  1. Three pillars of observability?
  2. Metrics: Numeric, aggregated (what)
  3. Logs: Textual events (why)
  4. Traces: Request flow (where)

  5. RED vs USE methods?

  6. RED: Request-driven (Rate, Errors, Duration)
  7. USE: Resource-focused (Utilization, Saturation, Errors)

  8. What is distributed tracing?

  9. Tracking requests across services
  10. Traces contain spans
  11. Uses context propagation

  12. Good alerting practices?

  13. Alert on symptoms
  14. Actionable with runbooks
  15. Appropriate severity levels

  16. Structured vs unstructured logging?

  17. Structured: JSON, queryable, consistent
  18. Unstructured: Text, hard to parse

  • *