News Aggregator (like Google News)¶

Quick Reference Guide for System Design Interviews

Problem Statement¶

Design a news aggregation service that crawls news from multiple sources, deduplicates similar stories, ranks them by relevance, and presents personalized feeds to users.

Requirements¶

Functional Requirements¶

Aggregate news from 10,000+ sources
Deduplicate similar articles
Categorize articles (politics, sports, tech, etc.)
Personalized news feed
Trending/breaking news section
Search articles

Non-Functional Requirements¶

Freshness: New articles within 15 minutes
Availability: 99.99%
Scale: 10M articles/day, 100M users

High-Level Architecture¶

High-Level Architecture

Crawler Service¶

Crawler Service

Content Deduplication¶

Content Deduplication

Categorization & NLP¶

Categorization & NLP

Ranking & Personalization¶

Ranking and Personalization

Trending Detection

Data Models¶

-- Sources (news outlets)
CREATE TABLE sources (
    source_id       UUID PRIMARY KEY,
    name            VARCHAR(255),
    domain          VARCHAR(255) UNIQUE,
    feed_url        VARCHAR(500),
    quality_score   DECIMAL(3,2),  -- 0.0 to 1.0
    category        VARCHAR(50),
    crawl_interval  INT,  -- minutes
    last_crawled    TIMESTAMP,
    is_active       BOOLEAN DEFAULT TRUE
);

-- Articles
CREATE TABLE articles (
    article_id      UUID PRIMARY KEY,
    source_id       UUID REFERENCES sources(source_id),
    cluster_id      UUID,  -- Story cluster
    url             VARCHAR(2000) UNIQUE,
    title           VARCHAR(500),
    summary         TEXT,
    content         TEXT,
    author          VARCHAR(200),
    image_url       VARCHAR(500),
    published_at    TIMESTAMP,
    crawled_at      TIMESTAMP,
    categories      VARCHAR(100)[],
    entities        JSONB,
    simhash         BIGINT,  -- For dedup

    INDEX idx_published (published_at DESC),
    INDEX idx_cluster (cluster_id),
    INDEX idx_simhash (simhash)
);

-- User reading history
CREATE TABLE user_reads (
    user_id         UUID,
    article_id      UUID,
    read_at         TIMESTAMP,
    time_spent_ms   INT,
    clicked_from    VARCHAR(50),  -- 'feed', 'search', 'trending'

    PRIMARY KEY (user_id, article_id)
);

-- User preferences
CREATE TABLE user_preferences (
    user_id         UUID PRIMARY KEY,
    preferred_categories VARCHAR(50)[],
    followed_topics VARCHAR(100)[],
    blocked_sources UUID[]
);

Interview Discussion Points¶

How do you crawl 10,000+ sources efficiently?
Distributed crawlers
Priority-based scheduling
Respect rate limits
How do you deduplicate articles?
SimHash/MinHash for near-duplicates
Cluster similar stories together
How do you detect trending topics?
Sliding window counts
Spike detection vs baseline
How do you personalize the feed?
User reading history
Category preferences
Collaborative filtering
How do you ensure freshness?
Frequent crawling of top sources
Push notifications from news APIs
Time decay in ranking
How do you handle fake news?
Source quality scoring
Cross-reference multiple sources
Flag low-authority sources