Observability¶
Three Pillars of Observability¶
Metrics¶
Metric Types¶
RED Method (Request-driven services)¶
USE Method (Resource-focused)¶
Four Golden Signals (Google SRE)¶
Logging¶
Structured Logging¶
// Unstructured (bad)
logger.info("User " + userId + " placed order " + orderId + " for $" + amount);
// Structured (good)
logger.info("Order placed",
kv("user_id", userId),
kv("order_id", orderId),
kv("amount", amount),
kv("currency", "USD")
);
// JSON output
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"message": "Order placed",
"user_id": "usr_123",
"order_id": "ord_456",
"amount": 99.99,
"currency": "USD",
"service": "order-service",
"trace_id": "abc123",
"span_id": "def456"
}
Log Levels¶
Correlation IDs¶
// MDC (Mapped Diagnostic Context)
public class CorrelationFilter implements Filter {
@Override
public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) {
String traceId = request.getHeader("X-Trace-Id");
if (traceId == null) {
traceId = UUID.randomUUID().toString();
}
MDC.put("trace_id", traceId);
response.setHeader("X-Trace-Id", traceId);
try {
chain.doFilter(req, resp);
} finally {
MDC.clear();
}
}
}
// All logs automatically include trace_id
// logback.xml pattern includes %X{trace_id}
Distributed Tracing¶
Trace Structure¶
OpenTelemetry¶
// Initialize tracer
Tracer tracer = GlobalOpenTelemetry.getTracer("order-service");
public Order getOrder(String orderId) {
Span span = tracer.spanBuilder("getOrder")
.setSpanKind(SpanKind.SERVER)
.setAttribute("order.id", orderId)
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Your business logic
Order order = orderRepository.findById(orderId);
span.setAttribute("order.status", order.getStatus());
span.setAttribute("order.total", order.getTotal());
return order;
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR, e.getMessage());
throw e;
} finally {
span.end();
}
}
// Propagate context to downstream services
public void callPaymentService(Order order) {
Span span = tracer.spanBuilder("callPaymentService")
.setSpanKind(SpanKind.CLIENT)
.startSpan();
try (Scope scope = span.makeCurrent()) {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("http://payment-service/charge"))
.header("traceparent", getTraceparent()) // W3C Trace Context
.POST(...)
.build();
// ...
} finally {
span.end();
}
}
Alerting¶
Alert Design¶
# Prometheus alerting rule
groups:
- name: api-alerts
rules:
# Symptom-based alert (what users experience)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
runbook: "https://wiki/runbooks/high-error-rate"
# Latency alert
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency > 1s"
# Saturation alert
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning
Alert Best Practices¶
Dashboards¶
Dashboard Design¶
Tools Ecosystem¶
Common Interview Questions¶
- Three pillars of observability?
- Metrics: Numeric, aggregated (what)
- Logs: Textual events (why)
-
Traces: Request flow (where)
-
RED vs USE methods?
- RED: Request-driven (Rate, Errors, Duration)
-
USE: Resource-focused (Utilization, Saturation, Errors)
-
What is distributed tracing?
- Tracking requests across services
- Traces contain spans
-
Uses context propagation
-
Good alerting practices?
- Alert on symptoms
- Actionable with runbooks
-
Appropriate severity levels
-
Structured vs unstructured logging?
- Structured: JSON, queryable, consistent
- Unstructured: Text, hard to parse
- *