Webhook Delivery System¶
Problem Statement¶
Design a webhook delivery system that reliably delivers event notifications to customer endpoints. The system should guarantee at-least-once delivery, handle failures gracefully, and scale to millions of webhooks per day.
Requirements¶
Functional Requirements¶
- Deliver webhooks for various event types
- Support customer endpoint configuration
- Guarantee at-least-once delivery
- Retry failed deliveries with exponential backoff
- Support webhook signing for security
- Provide delivery logs and debugging tools
- Allow customers to filter events
- Support webhook replay/resend
Non-Functional Requirements¶
- Reliability: At-least-once delivery guarantee
- Latency: < 30 seconds for initial delivery attempt
- Scalability: Handle millions of webhooks/day
- Ordering: Best-effort ordering per resource
- Observability: Complete delivery audit trail
High-Level Architecture¶
Core Components¶
1. Event Processor¶
- Consumes platform events
- Filters events per customer subscription
- Enriches event with additional data
- Routes to delivery queue
2. Webhook Config Service¶
- Manages endpoint configurations
- Stores event subscriptions
- Handles webhook secrets
- Manages enabled/disabled state
3. Delivery Workers¶
- Process delivery queue
- Make HTTP requests to endpoints
- Handle timeouts and errors
- Schedule retries
4. Retry Scheduler¶
- Manages retry backoff
- Re-enqueues failed webhooks
- Tracks retry attempts
- Handles dead letter queue
5. Delivery Log Service¶
- Records all delivery attempts
- Stores request/response details
- Enables debugging and replay
- Provides delivery analytics
Data Models¶
Webhook Endpoint Configuration¶
CREATE TABLE webhook_endpoints (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
-- Endpoint details
url VARCHAR(2048) NOT NULL,
description TEXT,
-- Security
secret VARCHAR(255) NOT NULL, -- For signing
api_version VARCHAR(20), -- API version for payload format
-- Subscriptions
enabled_events TEXT[] NOT NULL, -- ['payment.succeeded', 'invoice.*']
-- Status
status VARCHAR(20) DEFAULT 'enabled', -- enabled, disabled, deleted
-- Settings
connect_timeout_ms INT DEFAULT 5000,
read_timeout_ms INT DEFAULT 30000,
-- Metadata
metadata JSONB,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(id)
);
CREATE INDEX idx_webhook_endpoints_customer ON webhook_endpoints(customer_id, status);
Webhook Event¶
CREATE TABLE webhook_events (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
endpoint_id UUID NOT NULL,
-- Event details
event_type VARCHAR(100) NOT NULL, -- payment.succeeded
event_id UUID NOT NULL, -- Original event ID
payload JSONB NOT NULL,
api_version VARCHAR(20),
-- Delivery status
status VARCHAR(20) NOT NULL, -- pending, delivered, failed, dead
attempt_count INT DEFAULT 0,
max_attempts INT DEFAULT 5,
-- Scheduling
created_at TIMESTAMP NOT NULL,
scheduled_at TIMESTAMP NOT NULL, -- Next attempt time
delivered_at TIMESTAMP,
failed_at TIMESTAMP,
-- Idempotency
idempotency_key VARCHAR(255) UNIQUE,
FOREIGN KEY (endpoint_id) REFERENCES webhook_endpoints(id)
);
CREATE INDEX idx_webhook_events_status ON webhook_events(status, scheduled_at);
CREATE INDEX idx_webhook_events_customer ON webhook_events(customer_id, created_at DESC);
Delivery Attempt Log¶
CREATE TABLE webhook_delivery_attempts (
id UUID PRIMARY KEY,
event_id UUID NOT NULL,
attempt_number INT NOT NULL,
-- Request
request_url VARCHAR(2048),
request_headers JSONB,
request_body TEXT,
-- Response
response_status INT,
response_headers JSONB,
response_body TEXT, -- First 10KB
-- Timing
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
duration_ms INT,
-- Result
success BOOLEAN NOT NULL,
error_type VARCHAR(50), -- timeout, connection_error, http_error
error_message TEXT,
FOREIGN KEY (event_id) REFERENCES webhook_events(id)
);
CREATE INDEX idx_delivery_attempts_event ON webhook_delivery_attempts(event_id, attempt_number);
Webhook Payload¶
Standard Payload Format¶
{
"id": "evt_1234567890",
"object": "event",
"api_version": "2024-01-01",
"created": 1640000000,
"type": "payment.succeeded",
"livemode": true,
"pending_webhooks": 2,
"request": {
"id": "req_abc123",
"idempotency_key": "key_xyz789"
},
"data": {
"object": {
"id": "pay_123",
"object": "payment",
"amount": 10000,
"currency": "usd",
"status": "succeeded",
"customer": "cus_456"
},
"previous_attributes": {
"status": "pending"
}
}
}
Signature Generation¶
public class WebhookSigner {
public static String sign(String payload, String secret, long timestamp) {
String signedPayload = timestamp + "." + payload;
Mac mac = Mac.getInstance("HmacSHA256");
SecretKeySpec keySpec = new SecretKeySpec(secret.getBytes(), "HmacSHA256");
mac.init(keySpec);
byte[] hash = mac.doFinal(signedPayload.getBytes(UTF_8));
String signature = "v1=" + bytesToHex(hash);
return signature;
}
public static String buildHeader(String payload, String secret) {
long timestamp = System.currentTimeMillis() / 1000;
String signature = sign(payload, secret, timestamp);
return "t=" + timestamp + ",v1=" + signature;
}
}
// HTTP Headers sent:
// Stripe-Signature: t=1640000000,v1=abc123def456...
// Or custom:
// X-Webhook-Signature: t=1640000000,v1=abc123def456...
Delivery Flow¶
Happy Path¶
Retry Flow¶
public class RetryStrategy {
private static final int[] RETRY_DELAYS_SECONDS = {
60, // 1 minute
300, // 5 minutes
1800, // 30 minutes
7200, // 2 hours
18000, // 5 hours
36000, // 10 hours
36000 // 10 hours
};
public Instant getNextRetryTime(int attemptCount) {
if (attemptCount >= RETRY_DELAYS_SECONDS.length) {
return null; // No more retries
}
int delay = RETRY_DELAYS_SECONDS[attemptCount];
// Add jitter (±10%)
delay = delay + (int)(delay * (Math.random() * 0.2 - 0.1));
return Instant.now().plusSeconds(delay);
}
}
Delivery Worker¶
Worker Implementation¶
@Service
public class WebhookDeliveryWorker {
private final WebhookEventRepository eventRepository;
private final DeliveryAttemptRepository attemptRepository;
private final HttpClient httpClient;
private final RetryStrategy retryStrategy;
@Scheduled(fixedDelay = 100) // Poll every 100ms
public void processDeliveryQueue() {
List<WebhookEvent> events = eventRepository.findPendingEvents(
Instant.now(),
100 // Batch size
);
for (WebhookEvent event : events) {
processEvent(event);
}
}
private void processEvent(WebhookEvent event) {
WebhookEndpoint endpoint = endpointRepository.findById(event.getEndpointId());
if (endpoint.getStatus() != EndpointStatus.ENABLED) {
event.setStatus(EventStatus.SKIPPED);
eventRepository.save(event);
return;
}
DeliveryResult result = attemptDelivery(event, endpoint);
if (result.isSuccess()) {
handleSuccess(event, result);
} else {
handleFailure(event, result);
}
}
private DeliveryResult attemptDelivery(WebhookEvent event, WebhookEndpoint endpoint) {
String payload = buildPayload(event);
String signature = WebhookSigner.buildHeader(payload, endpoint.getSecret());
DeliveryAttempt attempt = DeliveryAttempt.builder()
.eventId(event.getId())
.attemptNumber(event.getAttemptCount() + 1)
.requestUrl(endpoint.getUrl())
.requestBody(payload)
.startedAt(Instant.now())
.build();
try {
HttpResponse response = httpClient.post(endpoint.getUrl())
.header("Content-Type", "application/json")
.header("X-Webhook-Signature", signature)
.header("X-Webhook-ID", event.getId().toString())
.body(payload)
.connectTimeout(endpoint.getConnectTimeoutMs())
.readTimeout(endpoint.getReadTimeoutMs())
.execute();
attempt.setResponseStatus(response.getStatusCode());
attempt.setResponseBody(truncate(response.getBody(), 10240));
attempt.setCompletedAt(Instant.now());
attempt.setSuccess(response.getStatusCode() >= 200 && response.getStatusCode() < 300);
if (!attempt.isSuccess()) {
attempt.setErrorType("http_error");
attempt.setErrorMessage("HTTP " + response.getStatusCode());
}
return DeliveryResult.fromAttempt(attempt);
} catch (ConnectTimeoutException e) {
attempt.setErrorType("connect_timeout");
attempt.setErrorMessage(e.getMessage());
attempt.setSuccess(false);
return DeliveryResult.failed(attempt, e);
} catch (ReadTimeoutException e) {
attempt.setErrorType("read_timeout");
attempt.setErrorMessage(e.getMessage());
attempt.setSuccess(false);
return DeliveryResult.failed(attempt, e);
} catch (Exception e) {
attempt.setErrorType("connection_error");
attempt.setErrorMessage(e.getMessage());
attempt.setSuccess(false);
return DeliveryResult.failed(attempt, e);
} finally {
attemptRepository.save(attempt);
}
}
private void handleSuccess(WebhookEvent event, DeliveryResult result) {
event.setStatus(EventStatus.DELIVERED);
event.setDeliveredAt(Instant.now());
event.setAttemptCount(event.getAttemptCount() + 1);
eventRepository.save(event);
}
private void handleFailure(WebhookEvent event, DeliveryResult result) {
event.setAttemptCount(event.getAttemptCount() + 1);
Instant nextRetry = retryStrategy.getNextRetryTime(event.getAttemptCount());
if (nextRetry != null) {
event.setStatus(EventStatus.PENDING);
event.setScheduledAt(nextRetry);
} else {
event.setStatus(EventStatus.DEAD);
event.setFailedAt(Instant.now());
// Move to dead letter queue
deadLetterService.enqueue(event);
}
eventRepository.save(event);
}
}
Endpoint Health Management¶
Circuit Breaker¶
public class EndpointCircuitBreaker {
private static final int FAILURE_THRESHOLD = 10;
private static final Duration OPEN_DURATION = Duration.ofMinutes(30);
@Cacheable("endpoint-health")
public EndpointHealth getHealth(UUID endpointId) {
// Check recent delivery attempts
int recentFailures = attemptRepository.countRecentFailures(
endpointId,
Instant.now().minus(Duration.ofHours(1))
);
if (recentFailures >= FAILURE_THRESHOLD) {
return EndpointHealth.UNHEALTHY;
}
return EndpointHealth.HEALTHY;
}
// When circuit is open:
// - Queue webhooks but delay delivery
// - Periodically test with single webhook
// - Close circuit when test succeeds
}
Automatic Endpoint Disabling¶
public void checkEndpointHealth(WebhookEndpoint endpoint) {
// Get failure rate over last 24 hours
DeliveryStats stats = statsService.getStats(endpoint.getId(), Duration.ofHours(24));
if (stats.getFailureRate() > 0.95 && stats.getTotalAttempts() > 100) {
// 95%+ failure rate with significant volume
endpoint.setStatus(EndpointStatus.DISABLED);
endpoint.setDisabledReason("High failure rate: " + stats.getFailureRate());
// Notify customer
notificationService.sendEndpointDisabledNotification(endpoint);
endpointRepository.save(endpoint);
}
}
Event Filtering & Routing¶
Subscription Matching¶
public class EventRouter {
public List<WebhookEndpoint> findMatchingEndpoints(PlatformEvent event) {
List<WebhookEndpoint> endpoints = endpointRepository
.findByCustomerId(event.getCustomerId());
return endpoints.stream()
.filter(e -> e.getStatus() == EndpointStatus.ENABLED)
.filter(e -> matchesEventType(e.getEnabledEvents(), event.getType()))
.collect(toList());
}
private boolean matchesEventType(List<String> patterns, String eventType) {
for (String pattern : patterns) {
if (pattern.equals("*")) {
return true; // All events
}
if (pattern.endsWith(".*")) {
// Wildcard: payment.* matches payment.succeeded, payment.failed
String prefix = pattern.substring(0, pattern.length() - 1);
if (eventType.startsWith(prefix)) {
return true;
}
}
if (pattern.equals(eventType)) {
return true;
}
}
return false;
}
}
Ordering Guarantees¶
Best-Effort Ordering¶
Customer-Facing Features¶
Webhook Logs API¶
// GET /v1/webhooks/events?limit=10
{
"object": "list",
"data": [
{
"id": "evt_123",
"type": "payment.succeeded",
"created": 1640000000,
"status": "delivered",
"endpoint_id": "we_456",
"attempt_count": 1,
"next_retry_at": null,
"last_attempt": {
"status": 200,
"response_time_ms": 150,
"attempted_at": 1640000001
}
}
]
}
Webhook Test & Replay¶
// Test endpoint with sample payload
@PostMapping("/v1/webhook_endpoints/{id}/test")
public TestResult testEndpoint(@PathVariable UUID id) {
WebhookEndpoint endpoint = endpointRepository.findById(id);
// Create test event
WebhookEvent testEvent = WebhookEvent.builder()
.endpointId(id)
.eventType("test.webhook")
.payload(createTestPayload())
.build();
DeliveryResult result = deliveryService.attemptDelivery(testEvent, endpoint);
return TestResult.builder()
.success(result.isSuccess())
.statusCode(result.getStatusCode())
.responseBody(result.getResponseBody())
.durationMs(result.getDurationMs())
.build();
}
// Replay failed webhook
@PostMapping("/v1/webhook_events/{id}/retry")
public WebhookEvent retryEvent(@PathVariable UUID id) {
WebhookEvent event = eventRepository.findById(id);
event.setStatus(EventStatus.PENDING);
event.setScheduledAt(Instant.now());
return eventRepository.save(event);
}
Scalability¶
Horizontal Scaling¶
Batching for High-Volume Customers¶
// For customers with many endpoints or high event volume
public void processBatch(List<WebhookEvent> events) {
// Group by endpoint URL (allows connection reuse)
Map<String, List<WebhookEvent>> byEndpoint = events.stream()
.collect(groupingBy(e -> e.getEndpoint().getUrl()));
for (Map.Entry<String, List<WebhookEvent>> entry : byEndpoint.entrySet()) {
try (HttpConnection conn = connectionPool.get(entry.getKey())) {
for (WebhookEvent event : entry.getValue()) {
deliverWithConnection(event, conn);
}
}
}
}
Monitoring¶
Key Metrics¶
| Metric | Description | Alert Threshold |
|---|---|---|
| Delivery success rate | % successful first attempts | < 95% |
| P99 delivery latency | Time from event to delivery | > 30s |
| Queue depth | Pending webhooks | > 10K |
| Retry rate | % of events needing retry | > 10% |
| Dead letter rate | % of events failing all retries | > 1% |
Dashboard¶
Technology Choices¶
| Component | Technology Options |
|---|---|
| Event Bus | Kafka, AWS SNS |
| Delivery Queue | Kafka, SQS, RabbitMQ |
| Database | PostgreSQL |
| HTTP Client | OkHttp, Apache HttpClient |
| Scheduling | Kafka, SQS delayed messages |
Interview Discussion Points¶
- How do you guarantee at-least-once delivery?
-
Persistent queue, ack after success, retries with backoff
-
How do you handle slow/unresponsive endpoints?
-
Timeouts, circuit breaker, endpoint health monitoring
-
How do you ensure ordering?
-
Kafka partitioning, best-effort ordering, sequence numbers
-
How do you handle malicious endpoints?
-
IP blocklist, rate limiting outbound, resource limits
-
How do you scale for millions of webhooks?
-
Horizontal scaling, partitioning, connection pooling
-
How do you debug delivery failures?
- Complete delivery logs, request/response capture, replay feature