Design a URL Shortener - System Design Interview

Introduction

Designing a URL shortener is a classic system design interview question at Meta that focuses on data modeling, storage design, hashing algorithms, and API design. Unlike distributed system design questions, Meta’s version emphasizes:

Data structures and algorithms: Efficient encoding/decoding
Data modeling: Database schema design
Uniqueness constraints: Handling collisions
Storage choice: Relational vs NoSQL trade-offs
Caching strategies: Fast lookups
API design: Clean, efficient interfaces

This guide covers the design of a URL shortener with focus on these aspects, including collision handling, custom URLs, and efficient storage strategies.

Problem Statement
Requirements
Data Modeling
Hashing and Encoding
API Design
Storage Design
Collision Handling
Custom URLs
Caching Strategy
Deep Dive
Summary

Problem Statement

Design a URL shortener service that:

Shortens long URLs to short, shareable links
Redirects short URLs to original long URLs
Handles collisions when generating short codes
Supports custom short URLs (optional)
Tracks analytics (click count, timestamps)
Provides fast lookups (< 10ms)

Scale Requirements:

Support 100M+ URL mappings
Handle 10K-100K requests per second
Short code length: 6-8 characters
Storage: Efficient space utilization
Lookup latency: < 10ms

Requirements

Functional Requirements

Shorten URL: Convert long URL to short code
Expand URL: Retrieve long URL from short code
Custom URLs: Allow users to specify custom short codes
Analytics: Track click counts, timestamps
Validation: Validate URL format
Expiration: Optional TTL for URLs

Non-Functional Requirements

Performance:

Short code generation: < 1ms
URL lookup: < 10ms
High throughput: 10K+ ops/sec

Reliability:

No data loss
Handle collisions
Validate URLs

Storage:

Efficient space usage
Fast lookups
Support 100M+ mappings

Data Modeling

Database Schema

Option 1: Relational Database (MySQL/PostgreSQL)

CREATE TABLE url_mappings (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_code VARCHAR(8) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    custom BOOLEAN DEFAULT FALSE,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP NULL,
    click_count BIGINT DEFAULT 0,
    INDEX idx_short_code (short_code),
    INDEX idx_user_id (user_id),
    INDEX idx_expires_at (expires_at)
);

CREATE TABLE url_analytics (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_code VARCHAR(8) NOT NULL,
    clicked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ip_address VARCHAR(45),
    user_agent TEXT,
    referrer TEXT,
    INDEX idx_short_code (short_code),
    INDEX idx_clicked_at (clicked_at)
);

Option 2: Key-Value Store (Redis)

# Redis structure
# Key: short_code -> Value: long_url
# Key: short_code:analytics -> Hash: {clicks, created_at, expires_at}
# Key: long_url:hash -> Value: short_code (for deduplication)

Data Model Classes

from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class URLMapping:
    id: Optional[int]
    short_code: str
    long_url: str
    custom: bool
    user_id: Optional[int]
    created_at: datetime
    expires_at: Optional[datetime]
    click_count: int
    
    def is_expired(self) -> bool:
        if self.expires_at is None:
            return False
        return datetime.now() > self.expires_at

@dataclass
class URLAnalytics:
    id: Optional[int]
    short_code: str
    clicked_at: datetime
    ip_address: Optional[str]
    user_agent: Optional[str]
    referrer: Optional[str]

Hashing and Encoding

Base62 Encoding

Character Set: 0-9, a-z, A-Z (62 characters)

BASE62_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def encode_base62(num: int) -> str:
    """Encode number to base62 string."""
    if num == 0:
        return BASE62_CHARS[0]
    
    result = []
    while num > 0:
        result.append(BASE62_CHARS[num % 62])
        num //= 62
    
    return ''.join(reversed(result))

def decode_base62(encoded: str) -> int:
    """Decode base62 string to number."""
    num = 0
    for char in encoded:
        num = num * 62 + BASE62_CHARS.index(char)
    return num

Hash-Based Approach

import hashlib
import base64

def generate_short_code(long_url: str, length: int = 6) -> str:
    """Generate short code using MD5 hash."""
    # Hash the URL
    hash_bytes = hashlib.md5(long_url.encode()).digest()
    
    # Encode to base64 and take first N characters
    encoded = base64.urlsafe_b64encode(hash_bytes).decode('ascii')
    
    # Remove padding and special characters, take first N chars
    encoded = encoded.rstrip('=').replace('-', '').replace('_', '')
    
    return encoded[:length]

Counter-Based Approach

class URLShortener:
    def __init__(self):
        self.counter = 0  # In production, use database sequence
    
    def generate_short_code(self, long_url: str) -> str:
        """Generate short code from counter."""
        self.counter += 1
        return encode_base62(self.counter)

Hybrid Approach (Recommended)

def generate_short_code(long_url: str, length: int = 6) -> str:
    """Generate short code using hash + timestamp."""
    # Combine URL with timestamp for uniqueness
    timestamp = int(time.time() * 1000)  # milliseconds
    combined = f"{long_url}{timestamp}"
    
    # Hash
    hash_bytes = hashlib.md5(combined.encode()).digest()
    
    # Convert to base62
    num = int.from_bytes(hash_bytes[:4], byteorder='big')
    code = encode_base62(num)
    
    # Ensure minimum length
    while len(code) < length:
        code = '0' + code
    
    return code[:length]

API Design

REST API

from flask import Flask, request, jsonify, redirect
from typing import Optional

app = Flask(__name__)

class URLShortenerService:
    def __init__(self, db, cache):
        self.db = db
        self.cache = cache
    
    def shorten_url(self, long_url: str, custom_code: Optional[str] = None,
                   user_id: Optional[int] = None, expires_in_days: Optional[int] = None) -> str:
        """Shorten URL and return short code."""
        # Validate URL
        if not self._validate_url(long_url):
            raise ValueError("Invalid URL")
        
        # Check if custom code provided
        if custom_code:
            if not self._is_valid_code(custom_code):
                raise ValueError("Invalid custom code")
            if self._code_exists(custom_code):
                raise ValueError("Custom code already exists")
            short_code = custom_code
        else:
            # Generate short code
            short_code = self._generate_unique_code(long_url)
        
        # Calculate expiration
        expires_at = None
        if expires_in_days:
            expires_at = datetime.now() + timedelta(days=expires_in_days)
        
        # Store mapping
        mapping = URLMapping(
            id=None,
            short_code=short_code,
            long_url=long_url,
            custom=(custom_code is not None),
            user_id=user_id,
            created_at=datetime.now(),
            expires_at=expires_at,
            click_count=0
        )
        
        self.db.save_mapping(mapping)
        self.cache.set(short_code, long_url, ttl=expires_in_days * 86400 if expires_in_days else None)
        
        return short_code
    
    def expand_url(self, short_code: str) -> Optional[str]:
        """Retrieve long URL from short code."""
        # Check cache first
        long_url = self.cache.get(short_code)
        if long_url:
            return long_url
        
        # Check database
        mapping = self.db.get_mapping(short_code)
        if not mapping:
            return None
        
        if mapping.is_expired():
            return None
        
        # Update click count
        self.db.increment_click_count(short_code)
        
        # Cache result
        self.cache.set(short_code, mapping.long_url)
        
        return mapping.long_url
    
    def _validate_url(self, url: str) -> bool:
        """Validate URL format."""
        try:
            from urllib.parse import urlparse
            result = urlparse(url)
            return all([result.scheme, result.netloc])
        except:
            return False
    
    def _is_valid_code(self, code: str) -> bool:
        """Validate custom code format."""
        if len(code) < 4 or len(code) > 8:
            return False
        return all(c in BASE62_CHARS for c in code)
    
    def _code_exists(self, code: str) -> bool:
        """Check if code already exists."""
        return self.db.code_exists(code)
    
    def _generate_unique_code(self, long_url: str) -> str:
        """Generate unique short code."""
        max_attempts = 10
        for _ in range(max_attempts):
            code = generate_short_code(long_url)
            if not self._code_exists(code):
                return code
        raise RuntimeError("Failed to generate unique code")

@app.route('/api/shorten', methods=['POST'])
def shorten():
    data = request.json
    long_url = data.get('url')
    custom_code = data.get('custom_code')
    user_id = data.get('user_id')
    expires_in_days = data.get('expires_in_days')
    
    try:
        service = URLShortenerService(db, cache)
        short_code = service.shorten_url(long_url, custom_code, user_id, expires_in_days)
        return jsonify({
            'short_code': short_code,
            'short_url': f'https://short.ly/{short_code}'
        }), 201
    except ValueError as e:
        return jsonify({'error': str(e)}), 400

@app.route('/<short_code>', methods=['GET'])
def redirect_to_long_url(short_code):
    service = URLShortenerService(db, cache)
    long_url = service.expand_url(short_code)
    
    if not long_url:
        return jsonify({'error': 'URL not found'}), 404
    
    # Track analytics
    service.track_click(short_code, request.remote_addr, request.headers.get('User-Agent'))
    
    return redirect(long_url, code=302)

Storage Design

Relational Database (Recommended for Meta Interview)

Pros:

ACID guarantees
Easy to query analytics
Support for complex relationships
Well-understood

Cons:

May need sharding at very large scale
Slightly slower than key-value stores

Schema Design:

-- Primary table for URL mappings
CREATE TABLE url_mappings (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_code VARCHAR(8) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    -- Indexes for fast lookups
    INDEX idx_short_code (short_code)
);

-- Analytics table (separate for performance)
CREATE TABLE url_clicks (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_code VARCHAR(8) NOT NULL,
    clicked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_short_code_time (short_code, clicked_at)
);

Key-Value Store (Alternative)

Pros:

Very fast lookups
Simple structure
Good for caching

Cons:

Limited query capabilities
Harder to do analytics
May need separate database for analytics

Collision Handling

Strategy 1: Retry with Different Input

def _generate_unique_code(self, long_url: str) -> str:
    """Generate unique code with collision handling."""
    max_attempts = 10
    salt = 0
    
    for attempt in range(max_attempts):
        # Add salt to make hash different
        salted_url = f"{long_url}{salt}{time.time()}"
        code = generate_short_code(salted_url, length=6)
        
        if not self._code_exists(code):
            return code
        
        salt += 1
    
    raise RuntimeError("Failed to generate unique code after retries")

Strategy 2: Increase Code Length

def _generate_unique_code(self, long_url: str, base_length: int = 6) -> str:
    """Generate code, increasing length on collision."""
    length = base_length
    max_length = 10
    
    while length <= max_length:
        code = generate_short_code(long_url, length=length)
        if not self._code_exists(code):
            return code
        length += 1
    
    raise RuntimeError("Failed to generate unique code")

Strategy 3: Database Sequence

def _generate_unique_code(self) -> str:
    """Generate code from database sequence (guaranteed unique)."""
    # Get next sequence number from database
    sequence_num = self.db.get_next_sequence()
    return encode_base62(sequence_num)

Strategy 4: Check Before Insert

def _generate_unique_code(self, long_url: str) -> str:
    """Generate code with database-level uniqueness check."""
    while True:
        code = generate_short_code(long_url)
        
        # Try to insert, handle duplicate key error
        try:
            self.db.insert_mapping(code, long_url)
            return code
        except DuplicateKeyError:
            # Collision detected, retry
            continue

Custom URLs

Implementation

def shorten_url(self, long_url: str, custom_code: Optional[str] = None) -> str:
    """Shorten URL with optional custom code."""
    if custom_code:
        # Validate custom code
        if not self._is_valid_custom_code(custom_code):
            raise ValueError("Invalid custom code format")
        
        # Check availability
        if self._code_exists(custom_code):
            raise ValueError("Custom code already taken")
        
        # Use custom code
        short_code = custom_code
        is_custom = True
    else:
        # Generate random code
        short_code = self._generate_unique_code(long_url)
        is_custom = False
    
    # Store mapping
    self._store_mapping(short_code, long_url, is_custom)
    return short_code

def _is_valid_custom_code(self, code: str) -> bool:
    """Validate custom code format."""
    # Length: 4-8 characters
    if len(code) < 4 or len(code) > 8:
        return False
    
    # Only alphanumeric and hyphens
    import re
    if not re.match(r'^[a-zA-Z0-9-]+$', code):
        return False
    
    # Reserved words
    reserved = ['api', 'admin', 'www', 'mail', 'ftp']
    if code.lower() in reserved:
        return False
    
    return True

Caching Strategy

Multi-Level Caching

class CacheManager:
    def __init__(self):
        self.l1_cache = {}  # In-memory LRU cache
        self.l2_cache = redis_client  # Redis cache
        self.l1_max_size = 10000
    
    def get(self, short_code: str) -> Optional[str]:
        # L1 cache (fastest)
        if short_code in self.l1_cache:
            return self.l1_cache[short_code]
        
        # L2 cache (Redis)
        long_url = self.l2_cache.get(f"url:{short_code}")
        if long_url:
            # Populate L1 cache
            self._add_to_l1(short_code, long_url)
            return long_url
        
        return None
    
    def set(self, short_code: str, long_url: str, ttl: Optional[int] = None):
        # Set in both caches
        self._add_to_l1(short_code, long_url)
        self.l2_cache.setex(f"url:{short_code}", ttl or 86400, long_url)
    
    def _add_to_l1(self, short_code: str, long_url: str):
        """Add to L1 cache with LRU eviction."""
        if len(self.l1_cache) >= self.l1_max_size:
            # Remove oldest (simple FIFO for demo)
            oldest = next(iter(self.l1_cache))
            del self.l1_cache[oldest]
        
        self.l1_cache[short_code] = long_url

Deep Dive

URL Validation

def validate_url(url: str) -> bool:
    """Comprehensive URL validation."""
    try:
        from urllib.parse import urlparse
        result = urlparse(url)
        
        # Must have scheme and netloc
        if not result.scheme or not result.netloc:
            return False
        
        # Scheme must be http or https
        if result.scheme not in ['http', 'https']:
            return False
        
        # Check for valid domain
        if '.' not in result.netloc:
            return False
        
        return True
    except:
        return False

Analytics Tracking

def track_click(self, short_code: str, ip_address: str, user_agent: str):
    """Track URL click for analytics."""
    analytics = URLAnalytics(
        id=None,
        short_code=short_code,
        clicked_at=datetime.now(),
        ip_address=ip_address,
        user_agent=user_agent,
        referrer=None
    )
    
    # Async write to avoid blocking
    self.analytics_queue.put(analytics)
    
    # Increment click count
    self.db.increment_click_count(short_code)

Expiration Handling

def cleanup_expired_urls(self):
    """Background job to clean up expired URLs."""
    expired_mappings = self.db.get_expired_mappings()
    
    for mapping in expired_mappings:
        # Remove from cache
        self.cache.delete(mapping.short_code)
        
        # Optionally: Archive or delete from database
        # self.db.delete_mapping(mapping.short_code)

What Interviewers Look For

Technical Skills

Data Modeling Ability
- Can you design a proper database schema?
- Do you understand relationships and constraints?
- Can you choose appropriate indexes?
- Red Flags: Missing indexes, no uniqueness constraints, poor normalization
Algorithm Knowledge
- Understanding of hashing and encoding (Base62, Base64)
- Collision handling strategies
- Trade-offs between different approaches
- Red Flags: No collision handling, naive approaches
System Design Thinking
- Storage choice reasoning (relational vs NoSQL)
- Caching strategy justification
- API design considerations
- Red Flags: No reasoning for choices, over-engineering

Problem-Solving Approach

Clarifying Questions
- Do you ask about requirements (custom URLs, analytics, expiration)?
- Do you clarify scale and constraints?
- Red Flags: Jumping to solution without understanding
Handling Edge Cases
- Collision scenarios
- Invalid URLs
- Custom URL conflicts
- Red Flags: Ignoring edge cases, no error handling
Trade-off Analysis
- Can you discuss pros/cons of different approaches?
- Do you justify your choices?
- Red Flags: No trade-off discussion, dogmatic choices

Communication Skills

Clear Explanation
- Can you explain your design clearly?
- Do you use appropriate terminology?
- Red Flags: Unclear explanations, jargon without context
Code Quality
- Clean, readable code examples
- Proper error handling
- Red Flags: Sloppy code, no error handling

Interview Focus

Data Structures & Algorithms
- Emphasis on efficient algorithms
- Understanding of time/space complexity
- Key: Show strong CS fundamentals
Practical Implementation
- Can you implement the solution?
- Do you consider real-world constraints?
- Key: Balance theory with practicality

Summary

Key Takeaways

Data Modeling: Relational schema with proper indexes
Encoding: Base62 encoding for short codes
Collision Handling: Retry with salt, increase length, or use sequence
Custom URLs: Validation and availability checking
Caching: Multi-level caching for fast lookups
Storage: Relational DB for ACID, key-value for speed

Common Interview Questions

How would you handle collisions?
- Retry with different input (salt)
- Increase code length
- Use database sequence
- Check before insert
How would you store short-long URL mapping?
- Relational DB: url_mappings table with indexes
- Key-value: Redis with short_code as key
- Hybrid: DB for persistence, cache for speed
How would you allow for custom short URLs?
- Validate format (length, characters, reserved words)
- Check availability before assignment
- Mark as custom in database
- Enforce uniqueness constraint

Design Principles

Efficiency: Fast lookups with caching
Uniqueness: Handle collisions properly
Flexibility: Support custom URLs
Reliability: Validate inputs, handle errors
Analytics: Track usage for insights

Understanding URL shortener design is crucial for Meta interviews focusing on:

Data modeling and schema design
Hashing and encoding algorithms
Collision handling strategies
Storage trade-offs
API design