News Aggregator (like Google News)¶
Quick Reference Guide for System Design Interviews
Problem Statement¶
Design a news aggregation service that crawls news from multiple sources, deduplicates similar stories, ranks them by relevance, and presents personalized feeds to users.
Requirements¶
Functional Requirements¶
- Aggregate news from 10,000+ sources
- Deduplicate similar articles
- Categorize articles (politics, sports, tech, etc.)
- Personalized news feed
- Trending/breaking news section
- Search articles
Non-Functional Requirements¶
- Freshness: New articles within 15 minutes
- Availability: 99.99%
- Scale: 10M articles/day, 100M users
High-Level Architecture¶
Crawler Service¶
Content Deduplication¶
Categorization & NLP¶
Ranking & Personalization¶
Trending Detection¶
Data Models¶
-- Sources (news outlets)
CREATE TABLE sources (
source_id UUID PRIMARY KEY,
name VARCHAR(255),
domain VARCHAR(255) UNIQUE,
feed_url VARCHAR(500),
quality_score DECIMAL(3,2), -- 0.0 to 1.0
category VARCHAR(50),
crawl_interval INT, -- minutes
last_crawled TIMESTAMP,
is_active BOOLEAN DEFAULT TRUE
);
-- Articles
CREATE TABLE articles (
article_id UUID PRIMARY KEY,
source_id UUID REFERENCES sources(source_id),
cluster_id UUID, -- Story cluster
url VARCHAR(2000) UNIQUE,
title VARCHAR(500),
summary TEXT,
content TEXT,
author VARCHAR(200),
image_url VARCHAR(500),
published_at TIMESTAMP,
crawled_at TIMESTAMP,
categories VARCHAR(100)[],
entities JSONB,
simhash BIGINT, -- For dedup
INDEX idx_published (published_at DESC),
INDEX idx_cluster (cluster_id),
INDEX idx_simhash (simhash)
);
-- User reading history
CREATE TABLE user_reads (
user_id UUID,
article_id UUID,
read_at TIMESTAMP,
time_spent_ms INT,
clicked_from VARCHAR(50), -- 'feed', 'search', 'trending'
PRIMARY KEY (user_id, article_id)
);
-- User preferences
CREATE TABLE user_preferences (
user_id UUID PRIMARY KEY,
preferred_categories VARCHAR(50)[],
followed_topics VARCHAR(100)[],
blocked_sources UUID[]
);
Interview Discussion Points¶
- How do you crawl 10,000+ sources efficiently?
- Distributed crawlers
- Priority-based scheduling
-
Respect rate limits
-
How do you deduplicate articles?
- SimHash/MinHash for near-duplicates
-
Cluster similar stories together
-
How do you detect trending topics?
- Sliding window counts
-
Spike detection vs baseline
-
How do you personalize the feed?
- User reading history
- Category preferences
-
Collaborative filtering
-
How do you ensure freshness?
- Frequent crawling of top sources
- Push notifications from news APIs
-
Time decay in ranking
-
How do you handle fake news?
- Source quality scoring
- Cross-reference multiple sources
- Flag low-authority sources