Skip to content

Webhook Delivery System

Problem Statement

Design a webhook delivery system that reliably delivers event notifications to customer endpoints. The system should guarantee at-least-once delivery, handle failures gracefully, and scale to millions of webhooks per day.


Requirements

Functional Requirements

  • Deliver webhooks for various event types
  • Support customer endpoint configuration
  • Guarantee at-least-once delivery
  • Retry failed deliveries with exponential backoff
  • Support webhook signing for security
  • Provide delivery logs and debugging tools
  • Allow customers to filter events
  • Support webhook replay/resend

Non-Functional Requirements

  • Reliability: At-least-once delivery guarantee
  • Latency: < 30 seconds for initial delivery attempt
  • Scalability: Handle millions of webhooks/day
  • Ordering: Best-effort ordering per resource
  • Observability: Complete delivery audit trail

High-Level Architecture

Webhook Delivery System - High-Level Architecture


Core Components

1. Event Processor

  • Consumes platform events
  • Filters events per customer subscription
  • Enriches event with additional data
  • Routes to delivery queue

2. Webhook Config Service

  • Manages endpoint configurations
  • Stores event subscriptions
  • Handles webhook secrets
  • Manages enabled/disabled state

3. Delivery Workers

  • Process delivery queue
  • Make HTTP requests to endpoints
  • Handle timeouts and errors
  • Schedule retries

4. Retry Scheduler

  • Manages retry backoff
  • Re-enqueues failed webhooks
  • Tracks retry attempts
  • Handles dead letter queue

5. Delivery Log Service

  • Records all delivery attempts
  • Stores request/response details
  • Enables debugging and replay
  • Provides delivery analytics

Data Models

Webhook Endpoint Configuration

CREATE TABLE webhook_endpoints (
    id                  UUID PRIMARY KEY,
    customer_id         UUID NOT NULL,

    -- Endpoint details
    url                 VARCHAR(2048) NOT NULL,
    description         TEXT,

    -- Security
    secret              VARCHAR(255) NOT NULL,      -- For signing
    api_version         VARCHAR(20),                -- API version for payload format

    -- Subscriptions
    enabled_events      TEXT[] NOT NULL,            -- ['payment.succeeded', 'invoice.*']

    -- Status
    status              VARCHAR(20) DEFAULT 'enabled',  -- enabled, disabled, deleted

    -- Settings
    connect_timeout_ms  INT DEFAULT 5000,
    read_timeout_ms     INT DEFAULT 30000,

    -- Metadata
    metadata            JSONB,
    created_at          TIMESTAMP NOT NULL,
    updated_at          TIMESTAMP NOT NULL,

    FOREIGN KEY (customer_id) REFERENCES customers(id)
);

CREATE INDEX idx_webhook_endpoints_customer ON webhook_endpoints(customer_id, status);

Webhook Event

CREATE TABLE webhook_events (
    id                  UUID PRIMARY KEY,
    customer_id         UUID NOT NULL,
    endpoint_id         UUID NOT NULL,

    -- Event details
    event_type          VARCHAR(100) NOT NULL,      -- payment.succeeded
    event_id            UUID NOT NULL,              -- Original event ID
    payload             JSONB NOT NULL,
    api_version         VARCHAR(20),

    -- Delivery status
    status              VARCHAR(20) NOT NULL,       -- pending, delivered, failed, dead
    attempt_count       INT DEFAULT 0,
    max_attempts        INT DEFAULT 5,

    -- Scheduling
    created_at          TIMESTAMP NOT NULL,
    scheduled_at        TIMESTAMP NOT NULL,         -- Next attempt time
    delivered_at        TIMESTAMP,
    failed_at           TIMESTAMP,

    -- Idempotency
    idempotency_key     VARCHAR(255) UNIQUE,

    FOREIGN KEY (endpoint_id) REFERENCES webhook_endpoints(id)
);

CREATE INDEX idx_webhook_events_status ON webhook_events(status, scheduled_at);
CREATE INDEX idx_webhook_events_customer ON webhook_events(customer_id, created_at DESC);

Delivery Attempt Log

CREATE TABLE webhook_delivery_attempts (
    id                  UUID PRIMARY KEY,
    event_id            UUID NOT NULL,
    attempt_number      INT NOT NULL,

    -- Request
    request_url         VARCHAR(2048),
    request_headers     JSONB,
    request_body        TEXT,

    -- Response
    response_status     INT,
    response_headers    JSONB,
    response_body       TEXT,                       -- First 10KB

    -- Timing
    started_at          TIMESTAMP NOT NULL,
    completed_at        TIMESTAMP,
    duration_ms         INT,

    -- Result
    success             BOOLEAN NOT NULL,
    error_type          VARCHAR(50),                -- timeout, connection_error, http_error
    error_message       TEXT,

    FOREIGN KEY (event_id) REFERENCES webhook_events(id)
);

CREATE INDEX idx_delivery_attempts_event ON webhook_delivery_attempts(event_id, attempt_number);

Webhook Payload

Standard Payload Format

{
    "id": "evt_1234567890",
    "object": "event",
    "api_version": "2024-01-01",
    "created": 1640000000,
    "type": "payment.succeeded",
    "livemode": true,
    "pending_webhooks": 2,
    "request": {
        "id": "req_abc123",
        "idempotency_key": "key_xyz789"
    },
    "data": {
        "object": {
            "id": "pay_123",
            "object": "payment",
            "amount": 10000,
            "currency": "usd",
            "status": "succeeded",
            "customer": "cus_456"
        },
        "previous_attributes": {
            "status": "pending"
        }
    }
}

Signature Generation

public class WebhookSigner {

    public static String sign(String payload, String secret, long timestamp) {
        String signedPayload = timestamp + "." + payload;

        Mac mac = Mac.getInstance("HmacSHA256");
        SecretKeySpec keySpec = new SecretKeySpec(secret.getBytes(), "HmacSHA256");
        mac.init(keySpec);

        byte[] hash = mac.doFinal(signedPayload.getBytes(UTF_8));
        String signature = "v1=" + bytesToHex(hash);

        return signature;
    }

    public static String buildHeader(String payload, String secret) {
        long timestamp = System.currentTimeMillis() / 1000;
        String signature = sign(payload, secret, timestamp);

        return "t=" + timestamp + ",v1=" + signature;
    }
}

// HTTP Headers sent:
// Stripe-Signature: t=1640000000,v1=abc123def456...
// Or custom:
// X-Webhook-Signature: t=1640000000,v1=abc123def456...

Delivery Flow

Happy Path

Webhook Delivery Flow

Retry Flow

Retry Strategy

public class RetryStrategy {

    private static final int[] RETRY_DELAYS_SECONDS = {
        60,         // 1 minute
        300,        // 5 minutes
        1800,       // 30 minutes
        7200,       // 2 hours
        18000,      // 5 hours
        36000,      // 10 hours
        36000       // 10 hours
    };

    public Instant getNextRetryTime(int attemptCount) {
        if (attemptCount >= RETRY_DELAYS_SECONDS.length) {
            return null;  // No more retries
        }

        int delay = RETRY_DELAYS_SECONDS[attemptCount];
        // Add jitter (±10%)
        delay = delay + (int)(delay * (Math.random() * 0.2 - 0.1));

        return Instant.now().plusSeconds(delay);
    }
}

Delivery Worker

Worker Implementation

@Service
public class WebhookDeliveryWorker {

    private final WebhookEventRepository eventRepository;
    private final DeliveryAttemptRepository attemptRepository;
    private final HttpClient httpClient;
    private final RetryStrategy retryStrategy;

    @Scheduled(fixedDelay = 100)  // Poll every 100ms
    public void processDeliveryQueue() {
        List<WebhookEvent> events = eventRepository.findPendingEvents(
            Instant.now(),
            100  // Batch size
        );

        for (WebhookEvent event : events) {
            processEvent(event);
        }
    }

    private void processEvent(WebhookEvent event) {
        WebhookEndpoint endpoint = endpointRepository.findById(event.getEndpointId());

        if (endpoint.getStatus() != EndpointStatus.ENABLED) {
            event.setStatus(EventStatus.SKIPPED);
            eventRepository.save(event);
            return;
        }

        DeliveryResult result = attemptDelivery(event, endpoint);

        if (result.isSuccess()) {
            handleSuccess(event, result);
        } else {
            handleFailure(event, result);
        }
    }

    private DeliveryResult attemptDelivery(WebhookEvent event, WebhookEndpoint endpoint) {
        String payload = buildPayload(event);
        String signature = WebhookSigner.buildHeader(payload, endpoint.getSecret());

        DeliveryAttempt attempt = DeliveryAttempt.builder()
            .eventId(event.getId())
            .attemptNumber(event.getAttemptCount() + 1)
            .requestUrl(endpoint.getUrl())
            .requestBody(payload)
            .startedAt(Instant.now())
            .build();

        try {
            HttpResponse response = httpClient.post(endpoint.getUrl())
                .header("Content-Type", "application/json")
                .header("X-Webhook-Signature", signature)
                .header("X-Webhook-ID", event.getId().toString())
                .body(payload)
                .connectTimeout(endpoint.getConnectTimeoutMs())
                .readTimeout(endpoint.getReadTimeoutMs())
                .execute();

            attempt.setResponseStatus(response.getStatusCode());
            attempt.setResponseBody(truncate(response.getBody(), 10240));
            attempt.setCompletedAt(Instant.now());
            attempt.setSuccess(response.getStatusCode() >= 200 && response.getStatusCode() < 300);

            if (!attempt.isSuccess()) {
                attempt.setErrorType("http_error");
                attempt.setErrorMessage("HTTP " + response.getStatusCode());
            }

            return DeliveryResult.fromAttempt(attempt);

        } catch (ConnectTimeoutException e) {
            attempt.setErrorType("connect_timeout");
            attempt.setErrorMessage(e.getMessage());
            attempt.setSuccess(false);
            return DeliveryResult.failed(attempt, e);

        } catch (ReadTimeoutException e) {
            attempt.setErrorType("read_timeout");
            attempt.setErrorMessage(e.getMessage());
            attempt.setSuccess(false);
            return DeliveryResult.failed(attempt, e);

        } catch (Exception e) {
            attempt.setErrorType("connection_error");
            attempt.setErrorMessage(e.getMessage());
            attempt.setSuccess(false);
            return DeliveryResult.failed(attempt, e);

        } finally {
            attemptRepository.save(attempt);
        }
    }

    private void handleSuccess(WebhookEvent event, DeliveryResult result) {
        event.setStatus(EventStatus.DELIVERED);
        event.setDeliveredAt(Instant.now());
        event.setAttemptCount(event.getAttemptCount() + 1);
        eventRepository.save(event);
    }

    private void handleFailure(WebhookEvent event, DeliveryResult result) {
        event.setAttemptCount(event.getAttemptCount() + 1);

        Instant nextRetry = retryStrategy.getNextRetryTime(event.getAttemptCount());

        if (nextRetry != null) {
            event.setStatus(EventStatus.PENDING);
            event.setScheduledAt(nextRetry);
        } else {
            event.setStatus(EventStatus.DEAD);
            event.setFailedAt(Instant.now());
            // Move to dead letter queue
            deadLetterService.enqueue(event);
        }

        eventRepository.save(event);
    }
}

Endpoint Health Management

Circuit Breaker

public class EndpointCircuitBreaker {

    private static final int FAILURE_THRESHOLD = 10;
    private static final Duration OPEN_DURATION = Duration.ofMinutes(30);

    @Cacheable("endpoint-health")
    public EndpointHealth getHealth(UUID endpointId) {
        // Check recent delivery attempts
        int recentFailures = attemptRepository.countRecentFailures(
            endpointId,
            Instant.now().minus(Duration.ofHours(1))
        );

        if (recentFailures >= FAILURE_THRESHOLD) {
            return EndpointHealth.UNHEALTHY;
        }

        return EndpointHealth.HEALTHY;
    }

    // When circuit is open:
    // - Queue webhooks but delay delivery
    // - Periodically test with single webhook
    // - Close circuit when test succeeds
}

Automatic Endpoint Disabling

public void checkEndpointHealth(WebhookEndpoint endpoint) {
    // Get failure rate over last 24 hours
    DeliveryStats stats = statsService.getStats(endpoint.getId(), Duration.ofHours(24));

    if (stats.getFailureRate() > 0.95 && stats.getTotalAttempts() > 100) {
        // 95%+ failure rate with significant volume
        endpoint.setStatus(EndpointStatus.DISABLED);
        endpoint.setDisabledReason("High failure rate: " + stats.getFailureRate());

        // Notify customer
        notificationService.sendEndpointDisabledNotification(endpoint);

        endpointRepository.save(endpoint);
    }
}

Event Filtering & Routing

Subscription Matching

public class EventRouter {

    public List<WebhookEndpoint> findMatchingEndpoints(PlatformEvent event) {
        List<WebhookEndpoint> endpoints = endpointRepository
            .findByCustomerId(event.getCustomerId());

        return endpoints.stream()
            .filter(e -> e.getStatus() == EndpointStatus.ENABLED)
            .filter(e -> matchesEventType(e.getEnabledEvents(), event.getType()))
            .collect(toList());
    }

    private boolean matchesEventType(List<String> patterns, String eventType) {
        for (String pattern : patterns) {
            if (pattern.equals("*")) {
                return true;  // All events
            }
            if (pattern.endsWith(".*")) {
                // Wildcard: payment.* matches payment.succeeded, payment.failed
                String prefix = pattern.substring(0, pattern.length() - 1);
                if (eventType.startsWith(prefix)) {
                    return true;
                }
            }
            if (pattern.equals(eventType)) {
                return true;
            }
        }
        return false;
    }
}

Ordering Guarantees

Best-Effort Ordering

Ordering Strategy


Customer-Facing Features

Webhook Logs API

// GET /v1/webhooks/events?limit=10
{
    "object": "list",
    "data": [
        {
            "id": "evt_123",
            "type": "payment.succeeded",
            "created": 1640000000,
            "status": "delivered",
            "endpoint_id": "we_456",
            "attempt_count": 1,
            "next_retry_at": null,
            "last_attempt": {
                "status": 200,
                "response_time_ms": 150,
                "attempted_at": 1640000001
            }
        }
    ]
}

Webhook Test & Replay

// Test endpoint with sample payload
@PostMapping("/v1/webhook_endpoints/{id}/test")
public TestResult testEndpoint(@PathVariable UUID id) {
    WebhookEndpoint endpoint = endpointRepository.findById(id);

    // Create test event
    WebhookEvent testEvent = WebhookEvent.builder()
        .endpointId(id)
        .eventType("test.webhook")
        .payload(createTestPayload())
        .build();

    DeliveryResult result = deliveryService.attemptDelivery(testEvent, endpoint);

    return TestResult.builder()
        .success(result.isSuccess())
        .statusCode(result.getStatusCode())
        .responseBody(result.getResponseBody())
        .durationMs(result.getDurationMs())
        .build();
}

// Replay failed webhook
@PostMapping("/v1/webhook_events/{id}/retry")
public WebhookEvent retryEvent(@PathVariable UUID id) {
    WebhookEvent event = eventRepository.findById(id);

    event.setStatus(EventStatus.PENDING);
    event.setScheduledAt(Instant.now());

    return eventRepository.save(event);
}

Scalability

Horizontal Scaling

Scaling Strategy

Batching for High-Volume Customers

// For customers with many endpoints or high event volume
public void processBatch(List<WebhookEvent> events) {
    // Group by endpoint URL (allows connection reuse)
    Map<String, List<WebhookEvent>> byEndpoint = events.stream()
        .collect(groupingBy(e -> e.getEndpoint().getUrl()));

    for (Map.Entry<String, List<WebhookEvent>> entry : byEndpoint.entrySet()) {
        try (HttpConnection conn = connectionPool.get(entry.getKey())) {
            for (WebhookEvent event : entry.getValue()) {
                deliverWithConnection(event, conn);
            }
        }
    }
}

Monitoring

Key Metrics

Metric Description Alert Threshold
Delivery success rate % successful first attempts < 95%
P99 delivery latency Time from event to delivery > 30s
Queue depth Pending webhooks > 10K
Retry rate % of events needing retry > 10%
Dead letter rate % of events failing all retries > 1%

Dashboard

Webhook Delivery Dashboard


Technology Choices

Component Technology Options
Event Bus Kafka, AWS SNS
Delivery Queue Kafka, SQS, RabbitMQ
Database PostgreSQL
HTTP Client OkHttp, Apache HttpClient
Scheduling Kafka, SQS delayed messages

Interview Discussion Points

  1. How do you guarantee at-least-once delivery?
  2. Persistent queue, ack after success, retries with backoff

  3. How do you handle slow/unresponsive endpoints?

  4. Timeouts, circuit breaker, endpoint health monitoring

  5. How do you ensure ordering?

  6. Kafka partitioning, best-effort ordering, sequence numbers

  7. How do you handle malicious endpoints?

  8. IP blocklist, rate limiting outbound, resource limits

  9. How do you scale for millions of webhooks?

  10. Horizontal scaling, partitioning, connection pooling

  11. How do you debug delivery failures?

  12. Complete delivery logs, request/response capture, replay feature