Skip to content

News Aggregator (like Google News)

Quick Reference Guide for System Design Interviews


Problem Statement

Design a news aggregation service that crawls news from multiple sources, deduplicates similar stories, ranks them by relevance, and presents personalized feeds to users.


Requirements

Functional Requirements

  • Aggregate news from 10,000+ sources
  • Deduplicate similar articles
  • Categorize articles (politics, sports, tech, etc.)
  • Personalized news feed
  • Trending/breaking news section
  • Search articles

Non-Functional Requirements

  • Freshness: New articles within 15 minutes
  • Availability: 99.99%
  • Scale: 10M articles/day, 100M users

High-Level Architecture

High-Level Architecture


Crawler Service

Crawler Service


Content Deduplication

Content Deduplication


Categorization & NLP

Categorization & NLP


Ranking & Personalization

Ranking and Personalization


Trending Detection


Data Models

-- Sources (news outlets)
CREATE TABLE sources (
    source_id       UUID PRIMARY KEY,
    name            VARCHAR(255),
    domain          VARCHAR(255) UNIQUE,
    feed_url        VARCHAR(500),
    quality_score   DECIMAL(3,2),  -- 0.0 to 1.0
    category        VARCHAR(50),
    crawl_interval  INT,  -- minutes
    last_crawled    TIMESTAMP,
    is_active       BOOLEAN DEFAULT TRUE
);

-- Articles
CREATE TABLE articles (
    article_id      UUID PRIMARY KEY,
    source_id       UUID REFERENCES sources(source_id),
    cluster_id      UUID,  -- Story cluster
    url             VARCHAR(2000) UNIQUE,
    title           VARCHAR(500),
    summary         TEXT,
    content         TEXT,
    author          VARCHAR(200),
    image_url       VARCHAR(500),
    published_at    TIMESTAMP,
    crawled_at      TIMESTAMP,
    categories      VARCHAR(100)[],
    entities        JSONB,
    simhash         BIGINT,  -- For dedup

    INDEX idx_published (published_at DESC),
    INDEX idx_cluster (cluster_id),
    INDEX idx_simhash (simhash)
);

-- User reading history
CREATE TABLE user_reads (
    user_id         UUID,
    article_id      UUID,
    read_at         TIMESTAMP,
    time_spent_ms   INT,
    clicked_from    VARCHAR(50),  -- 'feed', 'search', 'trending'

    PRIMARY KEY (user_id, article_id)
);

-- User preferences
CREATE TABLE user_preferences (
    user_id         UUID PRIMARY KEY,
    preferred_categories VARCHAR(50)[],
    followed_topics VARCHAR(100)[],
    blocked_sources UUID[]
);

Interview Discussion Points

  1. How do you crawl 10,000+ sources efficiently?
  2. Distributed crawlers
  3. Priority-based scheduling
  4. Respect rate limits

  5. How do you deduplicate articles?

  6. SimHash/MinHash for near-duplicates
  7. Cluster similar stories together

  8. How do you detect trending topics?

  9. Sliding window counts
  10. Spike detection vs baseline

  11. How do you personalize the feed?

  12. User reading history
  13. Category preferences
  14. Collaborative filtering

  15. How do you ensure freshness?

  16. Frequent crawling of top sources
  17. Push notifications from news APIs
  18. Time decay in ranking

  19. How do you handle fake news?

  20. Source quality scoring
  21. Cross-reference multiple sources
  22. Flag low-authority sources