Introduction
A web crawler is a program that automatically traverses the web by downloading web pages and following links from one page to another. It is used to index the web for search engines, collect data for research, or monitor websites for changes.
This post provides a detailed walkthrough of designing a scalable web crawler system that can efficiently crawl billions of web pages while adhering to politeness policies and handling failures gracefully. This is a common system design interview question that tests your understanding of distributed systems, queue management, rate limiting, and data storage at scale.
Table of Contents
- Problem Statement
- Requirements
- Capacity Estimation
- Core Entities
- API
- Data Flow
- Database Design
- High-Level Design
- Deep Dive
- Summary
Problem Statement
Design a web crawler system that can:
- Crawl the web starting from a given set of seed URLs
- Extract text data from each web page and store it for later processing
- Follow links to discover new pages to crawl
- Handle billions of web pages efficiently
- Adhere to politeness policies (robots.txt, rate limiting)
- Handle failures gracefully and resume crawling without losing progress
Out of Scope:
- Processing of the extracted text data (e.g., training LLMs)
- Handling non-text data (images, videos, etc.)
- Handling dynamic content (JavaScript-rendered pages)
- Handling authentication (login-required pages)
Requirements
Functional Requirements
Core Requirements:
- Crawl URLs: Start from seed URLs and crawl discovered pages
- Extract Text Data: Extract text content from HTML pages
- Follow Links: Discover and queue new URLs from crawled pages
- Store Data: Store extracted text data for later processing
- Track Progress: Track which URLs have been crawled and which are pending
Out of Scope:
- Text data processing (LLM training, indexing, ranking)
- Non-text content handling
- Dynamic content rendering
- Authentication handling
Non-Functional Requirements
Scale Assumptions:
- 10 billion pages on the web
- Average page size: 2MB
- Time constraint: Complete crawling in 5 days
- Total data: ~20PB of data
Core Requirements:
- Fault Tolerance: Handle failures gracefully and resume without losing progress
- Politeness: Adhere to robots.txt and avoid overloading servers
- Efficiency: Complete crawling within 5 days
- Scalability: Handle 10 billion pages
- Deduplication: Avoid crawling the same URL multiple times
- Rate Limiting: Respect per-domain rate limits
Out of Scope:
- Security (protecting from malicious actors)
- Cost optimization
- Legal compliance and privacy regulations
Capacity Estimation
Traffic Estimates
- Total pages to crawl: 10 billion
- Average page size: 2MB
- Time constraint: 5 days
- Required throughput: 10B pages / (5 days × 24 hours × 3600 seconds) = ~23,148 pages/second
- Bandwidth requirement: 23,148 pages/sec × 2MB = ~46GB/s
Storage Estimates
Text Data:
- 10B pages × 200KB (extracted text) = 2PB
- With compression (50%): 1PB
Metadata:
- 10B URLs × 100 bytes = 1TB
- Crawl history: 10B × 50 bytes = 500GB
- Total metadata: ~1.5TB
Total Storage: ~1PB
Bandwidth Estimates
- Download bandwidth: 46GB/s peak
- Upload bandwidth: 46GB/s (to blob storage)
- Total bandwidth: ~92GB/s
Core Entities
URL
- Attributes: url_id, url_string, domain, status, priority, discovered_at, crawled_at
- Relationships: Links to other URLs, belongs to domain
Domain
- Attributes: domain_id, domain_name, robots_txt, rate_limit, last_crawled_at
- Relationships: Contains URLs, has rate limits
Page
- Attributes: page_id, url_id, content_hash, text_data_url, links_count, created_at
- Relationships: Belongs to URL, contains links to other URLs
Crawl Job
- Attributes: job_id, seed_urls, status, started_at, completed_at
- Relationships: Contains URLs to crawl
API
1. Start Crawl Job
POST /api/v1/crawl/start
Parameters:
- seed_urls: array of seed URLs
- max_pages: maximum pages to crawl (optional)
Response:
- job_id: unique job identifier
- status: "started"
2. Get Crawl Status
GET /api/v1/crawl/{job_id}/status
Parameters:
- job_id: job ID
Response:
- status: "running", "completed", "failed"
- pages_crawled: number of pages crawled
- pages_remaining: number of pages remaining
- progress: percentage complete
3. Get Crawled Data
GET /api/v1/crawl/{job_id}/data
Parameters:
- job_id: job ID
- limit: number of pages to return (default: 100)
- offset: pagination offset (optional)
Response:
- pages: array of page objects
- total_pages: total number of pages crawled
Data Flow
The high-level data flow for our web crawler:
- Seed URLs → Add to frontier queue
- Frontier Queue → Dequeue URL for crawling
- DNS Resolution → Resolve domain name to IP address
- Fetch HTML → Download HTML from web server
- Parse HTML → Extract text data and links
- Store Text → Save extracted text to blob storage
- Extract Links → Discover new URLs and add to frontier queue
- Repeat → Continue until all URLs are crawled
High-Level Design
┌─────────────────────────────────────────────────────────┐
│ System Boundary │
│ │
│ ┌──────────────┐ │
│ │ Frontier │ ──── Queue of URLs to crawl │
│ │ Queue │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Crawler │ ──── Fetches pages, extracts data │
│ │ Workers │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ DNS │ ──── Resolves domain names │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Parser │ ──── Extracts text and links │
│ │ Workers │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Blob Store │ ──── Stores extracted text data │
│ │ (S3) │ │
│ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Metadata DB │ ──── Tracks crawled URLs, status │
│ └──────────────┘ │
│ │
└───────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ Web Servers │ ──── External web pages
└─────────────────┘
Database Design
Schema Design
URLs Table:
CREATE TABLE urls (
url_id BIGINT PRIMARY KEY AUTO_INCREMENT,
url_string VARCHAR(2048) UNIQUE NOT NULL,
domain VARCHAR(255) NOT NULL,
status ENUM('pending', 'crawling', 'completed', 'failed') DEFAULT 'pending',
priority INT DEFAULT 0,
discovered_at TIMESTAMP,
crawled_at TIMESTAMP,
INDEX idx_domain_status (domain, status),
INDEX idx_status_priority (status, priority DESC),
INDEX idx_url_string (url_string(255))
);
Domains Table:
CREATE TABLE domains (
domain_id BIGINT PRIMARY KEY AUTO_INCREMENT,
domain_name VARCHAR(255) UNIQUE NOT NULL,
robots_txt TEXT,
rate_limit_per_second INT DEFAULT 1,
last_crawled_at TIMESTAMP,
INDEX idx_domain_name (domain_name)
);
Pages Table:
CREATE TABLE pages (
page_id BIGINT PRIMARY KEY AUTO_INCREMENT,
url_id BIGINT NOT NULL,
content_hash VARCHAR(64) NOT NULL,
text_data_url VARCHAR(512) NOT NULL,
links_count INT DEFAULT 0,
created_at TIMESTAMP,
INDEX idx_url_id (url_id),
INDEX idx_content_hash (content_hash),
FOREIGN KEY (url_id) REFERENCES urls(url_id)
);
Crawl Jobs Table:
CREATE TABLE crawl_jobs (
job_id BIGINT PRIMARY KEY AUTO_INCREMENT,
seed_urls JSON NOT NULL,
status ENUM('pending', 'running', 'completed', 'failed') DEFAULT 'pending',
pages_crawled INT DEFAULT 0,
started_at TIMESTAMP,
completed_at TIMESTAMP,
INDEX idx_status (status)
);
Database Sharding Strategy
Shard by Domain:
- URLs and pages grouped by domain
- Enables efficient domain-based rate limiting
- Reduces cross-shard queries
Deep Dive
Component Design
1. Frontier Queue
The frontier queue stores URLs that need to be crawled. It should support:
- High throughput: Millions of URLs per second
- Deduplication: Avoid duplicate URLs
- Prioritization: Crawl important URLs first
- Persistence: Survive system failures
Technology Options:
Kafka:
- ✅ High throughput (millions of messages/second)
- ✅ Persistence and durability
- ✅ Multiple consumer groups
- ✅ Built-in partitioning
- ❌ More complex setup
- ❌ Overkill for simple use cases
Redis:
- ✅ Simple and fast
- ✅ Built-in data structures (sets, sorted sets)
- ✅ Good for deduplication
- ❌ Limited persistence options
- ❌ Memory constraints
SQS:
- ✅ Fully managed
- ✅ Simple to use
- ✅ Built-in retry logic
- ❌ Lower throughput than Kafka
- ❌ Less control over prioritization
Recommendation: Use Kafka for high-scale crawling (10B pages) due to its high throughput and persistence capabilities.
Queue Structure:
- Partition by domain to ensure politeness (one partition per domain)
- Use priority queues for important URLs
- Implement deduplication using Bloom filters or hash sets
2. Crawler Workers
Crawler workers fetch HTML pages from web servers. They need to:
- Respect robots.txt: Check robots.txt before crawling
- Rate Limiting: Limit requests per domain
- Handle Failures: Retry failed requests
- Timeout Handling: Set appropriate timeouts
Architecture:
Crawler Worker Pool
├── Worker 1 → Domain A (rate limited)
├── Worker 2 → Domain B (rate limited)
├── Worker 3 → Domain C (rate limited)
└── ...
Rate Limiting Strategy:
- Maintain per-domain rate limiters
- Use token bucket or sliding window algorithms
- Default: 1 request per second per domain
- Respect robots.txt crawl-delay directives
Robots.txt Handling:
- Fetch and parse robots.txt for each domain
- Cache robots.txt rules (TTL: 24 hours)
- Check rules before crawling each URL
- Respect User-Agent requirements
Failure Handling:
- Retry transient failures (5xx errors) with exponential backoff
- Skip permanent failures (4xx errors)
- Track failure rates per domain
- Blacklist domains with high failure rates
3. DNS Resolution
DNS resolution is a critical bottleneck. We need to:
- Cache DNS lookups: Avoid repeated DNS queries
- Handle DNS failures: Retry with backoff
- Avoid overloading DNS servers: Rate limit DNS queries
DNS Caching Strategy:
- Cache DNS responses (TTL from DNS record)
- Use in-memory cache (Redis) for frequently accessed domains
- Fallback to DNS server if cache miss
- Handle DNS failures gracefully
Implementation:
- Use DNS caching library (e.g., dns-cache)
- Cache in Redis with TTL from DNS response
- Maintain DNS resolver pool for parallel lookups
4. Parser Workers
Parser workers extract text data and links from HTML:
- HTML Parsing: Parse HTML and extract text content
- Link Extraction: Find all links (href attributes)
- Normalization: Normalize URLs (remove fragments, resolve relative URLs)
- Filtering: Filter out unwanted URLs (mailto:, javascript:, etc.)
Text Extraction:
- Remove HTML tags
- Extract text content
- Clean and normalize text
- Handle encoding (UTF-8, etc.)
Link Extraction:
- Parse
<a href="">tags - Extract absolute and relative URLs
- Normalize URLs (resolve relative URLs)
- Filter URLs (remove duplicates, invalid URLs)
URL Normalization:
- Convert to absolute URLs
- Remove fragments (#)
- Remove default ports (80, 443)
- Convert to lowercase
- Remove trailing slashes (optional)
5. Blob Storage (S3)
Store extracted text data in blob storage:
- Scalability: Handle petabytes of data
- Durability: 99.999999999% (11 9’s) durability
- Cost: Low-cost storage for large files
- Access: Easy retrieval for downstream processing
Storage Structure:
s3://crawler-data/
├── domain1.com/
│ ├── page1.html.txt
│ ├── page2.html.txt
│ └── ...
├── domain2.com/
│ └── ...
Metadata:
- Store metadata in separate database (not in S3)
- Include: URL, crawl timestamp, file path, size, etc.
6. Metadata Database
Track crawled URLs and their status:
- URL Status: Pending, Crawling, Completed, Failed
- Crawl History: Timestamp, status, error messages
- Deduplication: Track seen URLs
- Statistics: Crawl progress, success rates
Database Schema:
CREATE TABLE urls (
url_hash VARCHAR(64) PRIMARY KEY,
url TEXT NOT NULL,
domain VARCHAR(255) NOT NULL,
status ENUM('pending', 'crawling', 'completed', 'failed') NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
crawled_at TIMESTAMP,
retry_count INT DEFAULT 0,
error_message TEXT,
file_path VARCHAR(512),
file_size BIGINT,
INDEX idx_domain (domain),
INDEX idx_status (status),
INDEX idx_crawled_at (crawled_at)
);
CREATE TABLE crawl_history (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
url_hash VARCHAR(64) NOT NULL,
status ENUM('crawling', 'completed', 'failed') NOT NULL,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
error_message TEXT,
FOREIGN KEY (url_hash) REFERENCES urls(url_hash),
INDEX idx_url_hash (url_hash),
INDEX idx_timestamp (timestamp)
);
Technology Choice:
- PostgreSQL: For structured metadata and queries
- Cassandra: For high-scale write-heavy workloads
- DynamoDB: For fully managed solution
Recommendation: Use PostgreSQL for structured queries and Cassandra for high-scale scenarios.
Scalability and Performance
Throughput Calculation
Target: Crawl 10 billion pages in 5 days
Calculation:
- Total pages: 10B
- Time: 5 days = 432,000 seconds
- Required throughput: 10B / 432,000 ≈ 23,148 pages/second
With overhead and failures:
- Assume 50% overhead (DNS, retries, etc.)
- Target: ~35,000 pages/second
Scaling Strategy
Horizontal Scaling:
- Scale crawler workers: 1,000+ workers
- Scale parser workers: 500+ workers
- Scale Kafka partitions: 100+ partitions
- Scale database: Shard by domain or URL hash
Bottlenecks:
- DNS Resolution: Cache aggressively, use DNS pools
- Network Bandwidth: Distribute across regions
- Database Writes: Batch writes, use write-optimized DB
- Queue Throughput: Partition Kafka topics by domain
Deduplication
Bloom Filter:
- Use Bloom filter for fast duplicate detection
- False positives acceptable (will skip some URLs)
- Memory efficient: ~10 bits per URL
- For 10B URLs: ~12.5GB memory
Hash Set:
- Use distributed hash set (Redis) for exact deduplication
- More memory intensive but accurate
- Use for critical URLs
Hybrid Approach:
- Use Bloom filter for first-pass filtering
- Use hash set for confirmed unique URLs
- Balance between memory and accuracy
Politeness and Rate Limiting
Per-Domain Rate Limiting:
- Default: 1 request per second per domain
- Respect robots.txt crawl-delay
- Use token bucket algorithm
- Track rate limiters per domain
Robots.txt Caching:
- Cache robots.txt rules (TTL: 24 hours)
- Store in Redis or database
- Parse and apply rules before crawling
Implementation:
class RateLimiter:
def __init__(self, domain, requests_per_second=1):
self.domain = domain
self.requests_per_second = requests_per_second
self.tokens = requests_per_second
self.last_update = time.time()
def acquire(self):
now = time.time()
elapsed = now - self.last_update
self.tokens = min(
self.requests_per_second,
self.tokens + elapsed * self.requests_per_second
)
self.last_update = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
Fault Tolerance
Failure Scenarios
- Crawler Worker Failure: Worker crashes mid-crawl
- Network Failure: Connection timeout or DNS failure
- Web Server Failure: 5xx errors from web servers
- Database Failure: Metadata DB unavailable
- Queue Failure: Kafka partition unavailable
Handling Strategies
Crawler Worker Failures:
- Use idempotent operations
- Track progress in database
- Re-queue URLs on worker failure
- Implement heartbeat mechanism
Network Failures:
- Retry with exponential backoff
- Skip after max retries
- Track failure rates per domain
- Blacklist problematic domains
Database Failures:
- Use database replication
- Implement retry logic
- Use eventual consistency where possible
- Batch writes to reduce load
Queue Failures:
- Use Kafka replication (3 replicas)
- Implement consumer group rebalancing
- Handle partition leader changes
Idempotency
URL Processing:
- Check if URL already crawled before processing
- Use database transaction to mark URL as crawling
- Store results atomically
- Handle duplicate processing gracefully
Implementation:
def crawl_url(url):
url_hash = hash_url(url)
# Check if already crawled
if is_crawled(url_hash):
return
# Mark as crawling (atomic operation)
if not mark_as_crawling(url_hash):
return # Another worker is processing
try:
# Fetch and parse
html = fetch_html(url)
text = extract_text(html)
links = extract_links(html)
# Store results
store_text(url, text)
add_links_to_queue(links)
mark_as_completed(url_hash)
except Exception as e:
mark_as_failed(url_hash, str(e))
raise
Advanced Considerations
1. URL Prioritization
Priority Queue:
- Prioritize important URLs (high PageRank, popular domains)
- Use multiple priority levels
- Implement priority queue in Kafka or Redis
Strategies:
- PageRank: Prioritize URLs with high PageRank
- Domain Popularity: Prioritize popular domains
- Recency: Prioritize recently updated pages
- User Signals: Prioritize based on user clicks/views
2. Crawler Traps
Detection:
- Track URL depth (number of hops from seed)
- Detect URL patterns (session IDs, timestamps)
- Limit depth per domain
- Detect cycles in URL graph
Prevention:
- Set maximum depth limit (e.g., 10 hops)
- Filter URLs with suspicious patterns
- Use canonical URLs
- Track URL patterns per domain
3. Distributed Crawling
Domain-Based Partitioning:
- Partition URLs by domain
- Each worker handles specific domains
- Ensures politeness (one worker per domain)
- Simplifies rate limiting
Geographic Distribution:
- Deploy crawlers in multiple regions
- Crawl from region closest to web server
- Reduce latency and bandwidth costs
- Handle regional restrictions
4. Monitoring and Observability
Key Metrics:
- Crawl rate (pages/second)
- Success rate (% of successful crawls)
- Failure rate by domain
- Queue depth (pending URLs)
- Storage usage
- Worker utilization
Alerting:
- Low crawl rate
- High failure rate
- Queue backup
- Worker failures
- Storage capacity
Tools:
- Prometheus for metrics
- Grafana for dashboards
- ELK stack for logs
- PagerDuty for alerts
Technology Stack Summary
| Component | Technology | Rationale |
|---|---|---|
| Frontier Queue | Kafka | High throughput, persistence, partitioning |
| Crawler Workers | Python/Go | HTTP clients, async processing |
| DNS Resolution | DNS Cache (Redis) | Fast lookups, caching |
| Parser Workers | Python (BeautifulSoup) | HTML parsing, text extraction |
| Blob Storage | S3 | Scalable, durable, cost-effective |
| Metadata DB | PostgreSQL/Cassandra | Structured queries, high-scale writes |
| Rate Limiting | Redis | Token bucket, per-domain limits |
| Monitoring | Prometheus + Grafana | Metrics and dashboards |
Interview Tips
For Mid-Level Engineers
Focus Areas:
- High-level architecture
- Basic components (queue, workers, storage)
- Simple scaling strategies
- Basic fault tolerance
Expected Depth:
- Understand data flow
- Know basic technologies (Kafka, S3, PostgreSQL)
- Discuss simple rate limiting
- Handle basic failures
For Senior Engineers
Focus Areas:
- Detailed component design
- Advanced scaling strategies
- Politeness and rate limiting details
- Fault tolerance and idempotency
- Performance optimization
Expected Depth:
- Deep dive into queue technology choices
- Detailed rate limiting implementation
- DNS caching strategies
- Database sharding strategies
- Throughput calculations
For Staff+ Engineers
Focus Areas:
- System-wide optimizations
- Advanced distributed systems concepts
- Cost optimization
- Operational excellence
- Trade-off analysis
Expected Depth:
- Multiple deep dives (3+ areas)
- Innovative solutions
- Real-world experience
- Complex problem-solving
- Strategic thinking
What Interviewers Look For
Distributed Systems Skills
- Queue Management
- High-throughput queue design
- Kafka/message queue usage
- Priority queues
- Red Flags: Simple queue, no prioritization, bottlenecks
- Deduplication Strategy
- Bloom filter usage
- Hash-based deduplication
- Efficient duplicate detection
- Red Flags: No deduplication, inefficient approach, memory issues
- Politeness & Rate Limiting
- Robots.txt handling
- Domain-based rate limiting
- Respectful crawling
- Red Flags: No politeness, aggressive crawling, no rate limits
Problem-Solving Approach
- Scale Thinking
- Billions of pages
- Efficient storage
- Distributed crawling
- Red Flags: Single-threaded, small-scale, no distribution
- Fault Tolerance
- Handle failures gracefully
- Retry mechanisms
- Checkpointing
- Red Flags: No failure handling, no retry, data loss
- Edge Cases
- Infinite loops
- Malformed URLs
- Dynamic content
- Red Flags: Ignoring edge cases, crashes
System Design Skills
- Storage Design
- Efficient URL storage
- Content storage (S3)
- Metadata management
- Red Flags: Database for content, inefficient storage
- Worker Design
- Distributed workers
- Load balancing
- Resource management
- Red Flags: Single worker, no distribution
- DNS Optimization
- DNS caching
- Efficient resolution
- Red Flags: No caching, slow DNS
Communication Skills
- Architecture Explanation
- Clear component design
- Data flow understanding
- Red Flags: Unclear design, no flow
- Trade-off Analysis
- Storage vs speed
- Freshness vs efficiency
- Red Flags: No trade-offs, dogmatic choices
Meta-Specific Focus
- Distributed Systems Expertise
- Queue management
- Worker coordination
- Key: Show distributed systems knowledge
- Efficiency Optimization
- Deduplication strategies
- Performance optimization
- Key: Demonstrate optimization skills
Conclusion
Designing a web crawler requires careful consideration of:
- Scalability: Handle billions of pages efficiently
- Politeness: Respect robots.txt and rate limits
- Fault Tolerance: Handle failures gracefully
- Deduplication: Avoid crawling duplicates
- Performance: Meet time constraints
Key design decisions:
- Kafka for high-throughput queue management
- Domain-based partitioning for politeness
- Aggressive DNS caching for performance
- Bloom filters + hash sets for deduplication
- S3 for scalable blob storage
- PostgreSQL/Cassandra for metadata tracking
The system should be designed to scale horizontally, handle failures gracefully, and respect web server resources while efficiently crawling the web.