Design CamelCamelCamel Price Tracker - System Design Interview

Introduction

CamelCamelCamel is a price tracking service specifically designed for Amazon products. It tracks historical prices, provides price drop alerts, displays price charts, and offers browser extensions for easy product tracking. The system handles millions of Amazon products, tracks billions of price points, and serves millions of users daily.

This post provides a detailed walkthrough of designing CamelCamelCamel’s architecture, covering Amazon-specific scraping, ASIN-based product tracking, historical price storage, price chart generation, browser extension integration, email alerts, and handling Amazon’s rate limits and anti-scraping measures. This is a common system design interview question that tests your understanding of distributed systems, web scraping, time-series data, browser extensions, and notification systems.

Problem Statement
Requirements
- Functional Requirements
- Non-Functional Requirements
Capacity Estimation
Core Entities
API
Data Flow
Database Design
- Schema Design
- Database Sharding Strategy
High-Level Design
Deep Dive
Summary

Problem Statement

Design CamelCamelCamel, a price tracking service for Amazon products with the following features:

Track Amazon products by ASIN (Amazon Standard Identification Number)
Store historical price data
Display price charts (1 day, 1 week, 1 month, 3 months, 1 year, all time)
Send price drop email alerts
Browser extension for easy product tracking
Product search and discovery
Price history API access
Support multiple Amazon marketplaces (US, UK, CA, etc.)
Track different product conditions (new, used, refurbished)
Handle Amazon’s rate limits and anti-scraping measures

Scale Requirements:

20 million+ users
50 million+ tracked products (ASINs)
500 million+ price checks per day
Peak: 30,000 price checks per second
Average products per user: 15
Popular products: 500,000+ trackers
Historical data: 5+ years per product
Must respect Amazon’s rate limits
Price check frequency: Every 1-6 hours per product

Requirements

Functional Requirements

Core Features:

Product Tracking: Add products by ASIN or Amazon URL
Price Monitoring: Continuously monitor Amazon product prices
Historical Price Storage: Store all price points with timestamps
Price Charts: Display interactive price charts (multiple time ranges)
Price Drop Alerts: Email alerts when price drops below threshold
Browser Extension: Chrome/Firefox extension for easy tracking
Product Search: Search products by name, ASIN, or URL
Multi-Marketplace: Support US, UK, CA, DE, FR, etc.
Product Conditions: Track new, used, refurbished prices separately
Price History API: Public API for price history data

Out of Scope:

Product reviews and ratings
Product recommendations
Coupon code management
Mobile app (focus on web and browser extension)
Payment processing
User authentication (assume existing auth system)

Non-Functional Requirements

Availability: 99.9% uptime
Reliability: No missed price drops, accurate historical data
Performance:
- Price check: < 3 seconds
- Price chart rendering: < 500ms
- Alert delivery: < 10 minutes after price drop
- Browser extension response: < 200ms
Scalability: Handle 30K+ price checks per second
Accuracy: Accurate price detection, handle Amazon’s dynamic pricing
Rate Limiting: Respect Amazon’s rate limits strictly
Data Retention: 5+ years of historical price data

Capacity Estimation

Traffic Estimates

Total Users: 20 million
Daily Active Users (DAU): 2 million
Tracked Products (ASINs): 50 million
Price Checks per Day: 500 million
Average Checks per Product: 10 per day
Peak Price Check Rate: 30,000 per second
Normal Price Check Rate: 5,000 per second
Email Alerts per Day: 5 million
Browser Extension Requests: 10 million per day
New Products Added per Day: 500,000

Storage Estimates

Product Data:

50M products × 2KB = 100GB
Product metadata, ASINs, URLs, marketplace info

Price History (Time-Series):

500M price checks/day × 150 bytes = 75GB/day
30-day retention (hot): ~2.25TB
1-year retention (warm): ~27TB
5-year archive (cold): ~135TB

User Tracking Preferences:

20M users × 15 products × 300 bytes = 90GB

Email Alerts:

5M alerts/day × 1KB = 5GB/day
30-day retention: ~150GB

Browser Extension Cache:

10M requests/day × 500 bytes = 5GB/day
7-day retention: ~35GB

Total Storage: ~135TB

Bandwidth Estimates

Normal Traffic:

5,000 price checks/sec × 15KB = 75MB/s = 600Mbps
Amazon scraping responses

Peak Traffic:

30,000 price checks/sec × 15KB = 450MB/s = 3.6Gbps

Browser Extension:

10M requests/day × 2KB = 20GB/day = ~230KB/s = ~2Mbps

Email Delivery:

5M emails/day × 5KB = 25GB/day = ~290KB/s = ~2Mbps

Total Peak: ~3.6Gbps

Core Entities

Product

product_id (UUID)
asin (VARCHAR, unique)
marketplace (us, uk, ca, de, fr, etc.)
product_url (VARCHAR)
product_name (VARCHAR)
brand (VARCHAR)
category (VARCHAR)
current_price_new (DECIMAL)
current_price_used (DECIMAL)
current_price_refurbished (DECIMAL)
currency (VARCHAR)
availability (in_stock, out_of_stock, unknown)
last_checked_at (TIMESTAMP)
check_frequency_hours (INT)
tracker_count (INT)
first_tracked_at (TIMESTAMP)
created_at (TIMESTAMP)
updated_at (TIMESTAMP)

Price Point

price_id (UUID)
product_id (UUID)
asin (VARCHAR)
marketplace (VARCHAR)
condition (new, used, refurbished)
price (DECIMAL)
currency (VARCHAR)
availability (VARCHAR)
seller_type (amazon, third_party)
checked_at (TIMESTAMP)
source (scraper, api)

User Tracking

tracking_id (UUID)
user_id (UUID)
product_id (UUID)
asin (VARCHAR)
marketplace (VARCHAR)
condition (new, used, refurbished)
target_price (DECIMAL, optional)
percentage_drop (DECIMAL, optional)
alert_enabled (BOOLEAN)
last_notified_at (TIMESTAMP)
last_notified_price (DECIMAL)
created_at (TIMESTAMP)
updated_at (TIMESTAMP)

Price Alert

alert_id (UUID)
user_id (UUID)
product_id (UUID)
tracking_id (UUID)
old_price (DECIMAL)
new_price (DECIMAL)
price_drop (DECIMAL)
percentage_drop (DECIMAL)
condition (VARCHAR)
status (pending, sent, failed)
sent_at (TIMESTAMP)
created_at (TIMESTAMP)

User

user_id (UUID)
email (VARCHAR)
email_verified (BOOLEAN)
notification_preferences (JSON)
browser_extension_installed (BOOLEAN)
created_at (TIMESTAMP)
updated_at (TIMESTAMP)

API

1. Add Product to Track

POST /api/v1/products/track
Request:
{
  "asin": "B08XYZ123",
  "marketplace": "us",
  "condition": "new",
  "target_price": 99.99,
  "alert_enabled": true
}

Response:
{
  "tracking_id": "uuid",
  "product_id": "uuid",
  "asin": "B08XYZ123",
  "product_name": "Example Product",
  "current_price": 149.99,
  "target_price": 99.99,
  "price_chart_url": "/charts/B08XYZ123",
  "status": "tracking"
}

2. Get Product Price History

GET /api/v1/products/{asin}/history?marketplace=us&condition=new&range=1year
Response:
{
  "asin": "B08XYZ123",
  "product_name": "Example Product",
  "marketplace": "us",
  "condition": "new",
  "price_history": [
    {
      "price": 149.99,
      "checked_at": "2025-11-13T10:00:00Z",
      "availability": "in_stock"
    },
    {
      "price": 159.99,
      "checked_at": "2025-11-12T10:00:00Z",
      "availability": "in_stock"
    }
  ],
  "statistics": {
    "current_price": 149.99,
    "lowest_price": 129.99,
    "highest_price": 179.99,
    "average_price": 149.50,
    "price_drops": 5,
    "price_increases": 3
  },
  "chart_data_url": "/api/v1/products/B08XYZ123/chart?range=1year"
}

3. Get Price Chart Data

GET /api/v1/products/{asin}/chart?marketplace=us&condition=new&range=1year
Response:
{
  "asin": "B08XYZ123",
  "range": "1year",
  "data_points": [
    {"date": "2025-11-13", "price": 149.99},
    {"date": "2025-11-12", "price": 159.99}
  ],
  "statistics": {
    "min": 129.99,
    "max": 179.99,
    "avg": 149.50
  }
}

4. Browser Extension API

GET /api/v1/browser/product?url=https://amazon.com/dp/B08XYZ123
Response:
{
  "asin": "B08XYZ123",
  "product_name": "Example Product",
  "current_price": 149.99,
  "lowest_price": 129.99,
  "highest_price": 179.99,
  "price_trend": "decreasing",
  "is_tracked": true,
  "tracking_id": "uuid",
  "chart_url": "/charts/B08XYZ123"
}

5. Search Products

GET /api/v1/products/search?q=laptop&marketplace=us&limit=20
Response:
{
  "products": [
    {
      "asin": "B08XYZ123",
      "product_name": "Laptop Example",
      "current_price": 999.99,
      "lowest_price": 899.99,
      "price_trend": "stable",
      "product_url": "https://amazon.com/dp/B08XYZ123"
    }
  ],
  "total": 100,
  "limit": 20,
  "offset": 0
}

6. Get User Trackings

GET /api/v1/users/{user_id}/trackings?limit=20&offset=0
Response:
{
  "trackings": [
    {
      "tracking_id": "uuid",
      "product": {
        "asin": "B08XYZ123",
        "product_name": "Example Product",
        "current_price": 149.99,
        "price_change": -10.00,
        "percentage_change": -6.25,
        "lowest_price": 129.99
      },
      "target_price": 99.99,
      "alert_enabled": true,
      "is_active": true
    }
  ],
  "total": 50,
  "limit": 20,
  "offset": 0
}

Data Flow

Product Addition Flow (Browser Extension)

User Clicks Extension:
- User visits Amazon product page
- Clicks browser extension icon
- Extension extracts ASIN from URL
Product Lookup:
- Extension sends ASIN to API Gateway
- API Gateway routes to Product Service
- Product Service:
  - Checks if product exists
  - If new, creates product record
  - Fetches initial price from Amazon Scraper
  - Returns product info to extension
Display Product Info:
- Extension displays:
  - Current price
  - Price history chart
  - Lowest/highest prices
  - Track button
User Tracks Product:
- User clicks “Track” button
- Extension sends tracking request
- Product Service creates user tracking record
- Returns confirmation

Price Monitoring Flow

Scheduled Price Check:
- Price Monitor Scheduler identifies products to check
- Determines check frequency:
  - Popular products (many trackers): Every 1 hour
  - Medium popularity: Every 3 hours
  - Low popularity: Every 6 hours
Price Check Execution:
- Price Monitor Service:
  - Gets product details (ASIN, marketplace)
  - Queues price check job to Message Queue
  - Amazon Scraper Worker picks up job
Amazon Scraping:
- Scraper Worker:
  - Constructs Amazon URL from ASIN and marketplace
  - Uses proxy rotation and user agent rotation
  - Respects rate limits (tracks requests per IP)
  - Fetches product page
  - Extracts prices for all conditions (new, used, refurbished)
  - Handles errors (retry, fallback, CAPTCHA)
Price Storage:
- Scraper Worker:
  - Stores price points in Time-Series Database
  - Updates product current prices
  - Publishes price update event
Price Drop Detection:
- Price Drop Detector:
  - Compares new price with previous price
  - Checks against user target prices
  - Identifies price drops
  - Creates alert records
Alert Delivery:
- Alert Service:
  - Gets pending alerts
  - Sends email alerts
  - Updates alert status

Price Chart Generation Flow

User Requests Chart:
- User visits product page
- Selects time range (1 day, 1 week, 1 month, etc.)
- Client requests chart data
Chart Data Retrieval:
- Chart Service:
  - Queries Time-Series Database for price history
  - Filters by time range and condition
  - Aggregates data points (if needed for long ranges)
  - Calculates statistics (min, max, avg)
Chart Rendering:
- Chart Service returns data points
- Client renders interactive chart
- Displays price trends and statistics

Database Design

Schema Design

Products Table:

CREATE TABLE products (
    product_id UUID PRIMARY KEY,
    asin VARCHAR(20) NOT NULL,
    marketplace VARCHAR(10) NOT NULL,
    product_url VARCHAR(1000) NOT NULL,
    product_name VARCHAR(500) NOT NULL,
    brand VARCHAR(200),
    category VARCHAR(200),
    current_price_new DECIMAL(10, 2) NULL,
    current_price_used DECIMAL(10, 2) NULL,
    current_price_refurbished DECIMAL(10, 2) NULL,
    currency VARCHAR(10) DEFAULT 'USD',
    availability VARCHAR(50) DEFAULT 'unknown',
    last_checked_at TIMESTAMP NULL,
    check_frequency_hours INT DEFAULT 6,
    tracker_count INT DEFAULT 0,
    first_tracked_at TIMESTAMP DEFAULT NOW(),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    INDEX idx_asin (asin),
    INDEX idx_marketplace (marketplace),
    INDEX idx_asin_marketplace (asin, marketplace),
    INDEX idx_last_checked (last_checked_at),
    INDEX idx_tracker_count (tracker_count),
    UNIQUE KEY uk_asin_marketplace (asin, marketplace)
);

Price Points Table (Time-Series, Sharded by ASIN):

CREATE TABLE price_points_0 (
    price_id UUID PRIMARY KEY,
    product_id UUID NOT NULL,
    asin VARCHAR(20) NOT NULL,
    marketplace VARCHAR(10) NOT NULL,
    condition VARCHAR(20) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    currency VARCHAR(10) DEFAULT 'USD',
    availability VARCHAR(50),
    seller_type VARCHAR(50),
    checked_at TIMESTAMP NOT NULL,
    source VARCHAR(50) DEFAULT 'scraper',
    INDEX idx_asin_marketplace (asin, marketplace),
    INDEX idx_checked_at (checked_at DESC),
    INDEX idx_product_condition (product_id, condition),
    INDEX idx_asin_condition_checked (asin, marketplace, condition, checked_at DESC)
);
-- Similar tables: price_points_1, price_points_2, ..., price_points_N

User Trackings Table:

CREATE TABLE user_trackings (
    tracking_id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    product_id UUID NOT NULL,
    asin VARCHAR(20) NOT NULL,
    marketplace VARCHAR(10) NOT NULL,
    condition VARCHAR(20) NOT NULL,
    target_price DECIMAL(10, 2) NULL,
    percentage_drop DECIMAL(5, 2) NULL,
    alert_enabled BOOLEAN DEFAULT TRUE,
    last_notified_at TIMESTAMP NULL,
    last_notified_price DECIMAL(10, 2) NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    INDEX idx_user_id (user_id),
    INDEX idx_product_id (product_id),
    INDEX idx_asin (asin),
    INDEX idx_alert_enabled (alert_enabled),
    INDEX idx_user_alert (user_id, alert_enabled),
    UNIQUE KEY uk_user_product_condition (user_id, product_id, condition)
);

Price Alerts Table:

CREATE TABLE price_alerts (
    alert_id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    product_id UUID NOT NULL,
    tracking_id UUID NOT NULL,
    asin VARCHAR(20) NOT NULL,
    old_price DECIMAL(10, 2) NOT NULL,
    new_price DECIMAL(10, 2) NOT NULL,
    price_drop DECIMAL(10, 2) NOT NULL,
    percentage_drop DECIMAL(5, 2) NOT NULL,
    condition VARCHAR(20) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending',
    sent_at TIMESTAMP NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    INDEX idx_user_id (user_id),
    INDEX idx_status (status),
    INDEX idx_created_at (created_at DESC),
    INDEX idx_user_status (user_id, status)
);

Users Table:

CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    email VARCHAR(100) UNIQUE NOT NULL,
    email_verified BOOLEAN DEFAULT FALSE,
    notification_preferences JSON,
    browser_extension_installed BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    INDEX idx_email (email)
);

Database Sharding Strategy

Price Points Table Sharding:

Shard by ASIN using consistent hashing
1000 shards: shard_id = hash(asin + marketplace) % 1000
All price points for a product in same shard
Enables efficient price history queries

Time-Series Optimization:

Use time-series database (TimescaleDB, InfluxDB) for price points
Partition by time (monthly partitions)
Efficient range queries for charts
Automatic data retention policies

Shard Key Selection:

asin + marketplace ensures all prices for a product are in same shard
Enables efficient queries for product price history
Prevents cross-shard queries for single product

Replication:

Each shard replicated 3x for high availability
Master-replica setup for read scaling
Writes go to master, reads can go to replicas

High-Level Design

┌─────────────┐
│   Client    │
│  (Web App)  │
└──────┬──────┘
       │
       │ HTTP
       │
┌──────▼──────────────────────────────────────────────┐
│        API Gateway / Load Balancer                   │
│        - Rate Limiting                               │
│        - Request Routing                             │
└──────┬──────────────────────────────────────────────┘
       │
       ├──────────────┬──────────────┬──────────────┬──────────────┐
       │              │              │              │              │
┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│  Product    │ │  Price     │ │  Chart     │ │  Alert     │ │  Browser   │
│  Service    │ │  Monitor   │ │  Service   │ │  Service   │ │  Extension │
│             │ │  Service   │ │            │ │            │ │  API       │
└──────┬──────┘ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘
       │              │              │              │              │
       │              │              │              │              │
┌──────▼──────────────▼──────────────▼──────────────▼──────────────▼──────┐
│              Message Queue (Kafka)                                        │
│              - Price check jobs                                            │
│              - Price update events                                         │
│              - Alert jobs                                                  │
└──────┬───────────────────────────────────────────────────────────────────┘
       │
       │
┌──────▼───────────────────────────────────────────────────────────────────┐
│         Amazon Scraper Service                                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                              │
│  │ US        │  │ UK        │  │ CA        │                              │
│  │ Scraper   │  │ Scraper   │  │ Scraper   │                              │
│  └──────────┘  └──────────┘  └──────────┘                              │
│                                                                           │
│  ┌──────────────────────────────────────────────────────┐                │
│  │ Proxy Pool Manager                                    │                │
│  │ - IP Rotation                                          │                │
│  │ - Rate Limit Management                               │                │
│  │ - CAPTCHA Handling                                    │                │
│  └──────────────────────────────────────────────────────┘                │
└──────┬───────────────────────────────────────────────────────────────────┘
       │
       │
┌──────▼───────────────────────────────────────────────────────────────────┐
│         Price Drop Detector                                               │
│         - Compare prices                                                  │
│         - Check thresholds                                                │
│         - Create alerts                                                   │
└──────┬───────────────────────────────────────────────────────────────────┘
       │
       │
┌──────▼───────────────────────────────────────────────────────────────────┐
│         Database Cluster                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                              │
│  │ Products │  │ User      │  │ Price     │                              │
│  │ DB       │  │ Trackings │  │ Alerts    │                              │
│  │          │  │ DB        │  │ DB        │                              │
│  └──────────┘  └──────────┘  └──────────┘                              │
└───────────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────────┐
│         Time-Series Database (TimescaleDB/InfluxDB)                        │
│         - Price points (sharded by ASIN)                                   │
│         - Time-based partitioning                                          │
│         - Efficient range queries                                         │
└───────────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────────┐
│              Cache Layer (Redis)                                          │
│  - Product metadata                                                       │
│  - Current prices                                                         │
│  - Rate limit counters                                                    │
│  - Popular products                                                       │
│  - Chart data (cached)                                                    │
└───────────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────────┐
│         Email Service                                                     │
│         - Send price drop alerts                                          │
│         - Email templates                                                 │
└───────────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────────┐
│         Browser Extension Backend                                         │
│         - Product lookup                                                  │
│         - Quick price display                                              │
│         - Tracking management                                              │
└───────────────────────────────────────────────────────────────────────────┘

Deep Dive

Component Design

1. Amazon Scraper Service

Responsibilities:

Scrape Amazon product pages by ASIN
Extract prices for all conditions (new, used, refurbished)
Handle multiple marketplaces
Respect rate limits and avoid blocking
Rotate proxies and user agents

Key Design Decisions:

ASIN-Based: Use ASIN for product identification
Multi-Condition: Track new, used, refurbished separately
Multi-Marketplace: Support US, UK, CA, DE, FR, etc.
Proxy Rotation: Rotate IPs to avoid blocking
Rate Limiting: Strict rate limit enforcement

Implementation:

class AmazonScraperService:
    def __init__(self):
        self.proxy_pool = ProxyPool()
        self.rate_limiter = RateLimiter()
        self.user_agent_rotator = UserAgentRotator()
    
    def scrape_product(self, asin, marketplace='us'):
        # Get proxy for marketplace
        proxy = self.proxy_pool.get_proxy(marketplace)
        
        # Check rate limit
        if not self.rate_limiter.can_scrape(marketplace, proxy):
            raise RateLimitError("Rate limit exceeded")
        
        try:
            # Construct Amazon URL
            url = self.build_amazon_url(asin, marketplace)
            
            # Make request
            response = requests.get(
                url,
                proxies={'http': proxy, 'https': proxy},
                headers={
                    'User-Agent': self.user_agent_rotator.get_random(),
                    'Accept': 'text/html,application/xhtml+xml',
                    'Accept-Language': 'en-US,en;q=0.9'
                },
                timeout=10,
                cookies=self.get_amazon_cookies(marketplace)
            )
            
            if response.status_code == 503:
                # CAPTCHA or blocking
                raise CAPTCHAError("CAPTCHA detected")
            
            if response.status_code != 200:
                raise ScrapingError(f"HTTP {response.status_code}")
            
            # Parse HTML
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Extract product name
            product_name = self.extract_product_name(soup)
            
            # Extract prices for all conditions
            prices = {
                'new': self.extract_price(soup, 'new'),
                'used': self.extract_price(soup, 'used'),
                'refurbished': self.extract_price(soup, 'refurbished')
            }
            
            # Extract availability
            availability = self.extract_availability(soup)
            
            # Update rate limit
            self.rate_limiter.record_scrape(marketplace, proxy)
            
            return {
                'asin': asin,
                'marketplace': marketplace,
                'product_name': product_name,
                'prices': prices,
                'availability': availability,
                'checked_at': datetime.now()
            }
        except CAPTCHAError:
            # Mark proxy as bad
            self.proxy_pool.mark_bad(proxy)
            raise
        except ScrapingError as e:
            # Retry with different proxy
            self.proxy_pool.mark_bad(proxy)
            raise
        finally:
            self.proxy_pool.return_proxy(proxy)
    
    def extract_price(self, soup, condition):
        # Different selectors for different conditions
        selectors = {
            'new': ['#priceblock_ourprice', '#priceblock_dealprice', '.a-price-whole'],
            'used': ['#usedBuySection .a-price-whole'],
            'refurbished': ['#renewedBuySection .a-price-whole']
        }
        
        for selector in selectors.get(condition, []):
            element = soup.select_one(selector)
            if element:
                price_text = element.get_text()
                price = self.parse_price(price_text)
                if price:
                    return price
        
        return None

2. Price Chart Service

Responsibilities:

Generate price chart data for different time ranges
Aggregate data points for long ranges
Calculate statistics (min, max, avg)
Cache chart data

Key Design Decisions:

Time-Series Database: Use TimescaleDB for efficient queries
Data Aggregation: Aggregate data for long ranges
Caching: Cache chart data for popular products
Multiple Ranges: Support 1 day, 1 week, 1 month, 3 months, 1 year, all time

Implementation:

class PriceChartService:
    def __init__(self):
        self.timeseries_db = TimescaleDB()
        self.redis = Redis()
    
    def get_chart_data(self, asin, marketplace, condition, time_range):
        # Check cache
        cache_key = f"chart:{asin}:{marketplace}:{condition}:{time_range}"
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # Calculate time window
        end_time = datetime.now()
        start_time = self.calculate_start_time(end_time, time_range)
        
        # Query time-series database
        query = """
            SELECT 
                time_bucket('1 hour', checked_at) AS bucket,
                AVG(price) AS avg_price,
                MIN(price) AS min_price,
                MAX(price) AS max_price
            FROM price_points
            WHERE asin = %s 
                AND marketplace = %s 
                AND condition = %s
                AND checked_at >= %s
                AND checked_at <= %s
            GROUP BY bucket
            ORDER BY bucket ASC
        """
        
        data_points = self.timeseries_db.query(
            query,
            (asin, marketplace, condition, start_time, end_time)
        )
        
        # Calculate statistics
        all_prices = [dp['avg_price'] for dp in data_points]
        statistics = {
            'min': min(all_prices) if all_prices else None,
            'max': max(all_prices) if all_prices else None,
            'avg': sum(all_prices) / len(all_prices) if all_prices else None
        }
        
        result = {
            'asin': asin,
            'marketplace': marketplace,
            'condition': condition,
            'time_range': time_range,
            'data_points': data_points,
            'statistics': statistics
        }
        
        # Cache result
        self.redis.setex(cache_key, 3600, json.dumps(result))  # 1 hour cache
        
        return result
    
    def calculate_start_time(self, end_time, time_range):
        ranges = {
            '1day': timedelta(days=1),
            '1week': timedelta(weeks=1),
            '1month': timedelta(days=30),
            '3months': timedelta(days=90),
            '1year': timedelta(days=365),
            'all': timedelta(days=3650)  # 10 years
        }
        
        delta = ranges.get(time_range, timedelta(days=30))
        return end_time - delta

3. Browser Extension Backend

Responsibilities:

Handle extension requests
Extract ASIN from Amazon URLs
Return product info quickly
Support quick tracking

Key Design Decisions:

Fast Response: Sub-200ms response time
ASIN Extraction: Parse Amazon URLs
Cached Data: Return cached product data
Quick Tracking: One-click tracking

Implementation:

class BrowserExtensionAPI:
    def get_product_info(self, amazon_url):
        # Extract ASIN from URL
        asin = self.extract_asin(amazon_url)
        marketplace = self.extract_marketplace(amazon_url)
        
        if not asin:
            raise ValueError("Invalid Amazon URL")
        
        # Get product from cache or database
        product = self.get_product(asin, marketplace)
        
        if not product:
            # Product not tracked yet
            return {
                'asin': asin,
                'marketplace': marketplace,
                'is_tracked': False,
                'message': 'Product not tracked yet'
            }
        
        # Get price history statistics
        stats = self.get_price_statistics(asin, marketplace)
        
        # Determine price trend
        trend = self.calculate_trend(stats)
        
        return {
            'asin': asin,
            'product_name': product.product_name,
            'current_price': product.current_price_new,
            'lowest_price': stats['lowest'],
            'highest_price': stats['highest'],
            'average_price': stats['average'],
            'price_trend': trend,
            'is_tracked': True,
            'chart_url': f"/charts/{asin}?marketplace={marketplace}"
        }
    
    def extract_asin(self, url):
        # Extract ASIN from various Amazon URL formats
        patterns = [
            r'/dp/([A-Z0-9]{10})',
            r'/gp/product/([A-Z0-9]{10})',
            r'/product/([A-Z0-9]{10})',
            r'asin=([A-Z0-9]{10})'
        ]
        
        for pattern in patterns:
            match = re.search(pattern, url)
            if match:
                return match.group(1)
        
        return None

Detailed Design

Handling Amazon Rate Limits

Challenge: Amazon has strict rate limits and anti-scraping measures

Solution:

Proxy Rotation: Large proxy pool with rotation
Rate Limiting: Track requests per IP per hour
Request Spacing: Add delays between requests
User Agent Rotation: Rotate user agents
Cookie Management: Use session cookies
CAPTCHA Handling: Detect and handle CAPTCHAs

Implementation:

class AmazonRateLimiter:
    def __init__(self):
        self.redis = Redis()
        self.rate_limits = {
            'us': {'requests': 50, 'window': 3600},  # 50 per hour per IP
            'uk': {'requests': 50, 'window': 3600},
            'ca': {'requests': 50, 'window': 3600}
        }
    
    def can_scrape(self, marketplace, proxy_ip):
        key = f"amazon_rate:{marketplace}:{proxy_ip}"
        limit = self.rate_limits[marketplace]
        
        count = self.redis.get(key)
        if count and int(count) >= limit['requests']:
            return False
        
        return True
    
    def record_scrape(self, marketplace, proxy_ip):
        key = f"amazon_rate:{marketplace}:{proxy_ip}"
        limit = self.rate_limits[marketplace]
        
        count = self.redis.incr(key)
        if count == 1:
            self.redis.expire(key, limit['window'])

Price Drop Detection with Conditions

Challenge: Detect price drops for different product conditions

Solution:

Condition-Specific Tracking: Track each condition separately
Condition-Specific Alerts: Alert based on tracked condition
Price Comparison: Compare with condition-specific previous price

Implementation:

def detect_price_drops(product_id, asin, marketplace, new_prices):
    # Get all active trackings for this product
    trackings = get_active_trackings(product_id)
    
    alerts = []
    
    for tracking in trackings:
        condition = tracking.condition
        new_price = new_prices.get(condition)
        
        if not new_price:
            continue  # Price not available for this condition
        
        # Get previous price for this condition
        previous_price = get_previous_price(asin, marketplace, condition)
        
        if not previous_price:
            continue  # No previous price
        
        # Calculate price change
        price_drop = previous_price - new_price
        percentage_drop = (price_drop / previous_price) * 100
        
        # Check if should alert
        should_alert = False
        
        # Check target price
        if tracking.target_price and new_price <= tracking.target_price:
            should_alert = True
        
        # Check percentage drop
        if tracking.percentage_drop and percentage_drop >= tracking.percentage_drop:
            should_alert = True
        
        # Check if already notified recently
        if tracking.last_notified_at:
            time_since = datetime.now() - tracking.last_notified_at
            if time_since < timedelta(hours=6):
                should_alert = False  # Don't notify too frequently
        
        if should_alert and tracking.alert_enabled:
            # Create alert
            alert = PriceAlert(
                user_id=tracking.user_id,
                product_id=product_id,
                tracking_id=tracking.tracking_id,
                asin=asin,
                old_price=previous_price,
                new_price=new_price,
                price_drop=price_drop,
                percentage_drop=percentage_drop,
                condition=condition,
                status='pending'
            )
            alert.save()
            alerts.append(alert)
            
            # Update tracking
            tracking.last_notified_at = datetime.now()
            tracking.last_notified_price = new_price
            tracking.save()
    
    return alerts

Historical Data Retention

Challenge: Store 5+ years of price data efficiently

Solution:

Time-Series Database: Use TimescaleDB for efficient storage
Time Partitioning: Partition by month
Data Archival: Archive old data to cold storage
Data Compression: Compress old partitions

Implementation:

class PriceHistoryManager:
    def __init__(self):
        self.timeseries_db = TimescaleDB()
        self.s3 = S3Client()
    
    def store_price_point(self, price_point):
        # Insert into time-series database
        self.timeseries_db.insert('price_points', price_point)
    
    def archive_old_data(self):
        # Archive data older than 1 year to S3
        cutoff_date = datetime.now() - timedelta(days=365)
        
        # Query old data
        old_data = self.timeseries_db.query(
            "SELECT * FROM price_points WHERE checked_at < %s",
            (cutoff_date,)
        )
        
        # Upload to S3
        for batch in chunks(old_data, 10000):
            key = f"price_history/{cutoff_date.year}/{cutoff_date.month}/batch_{uuid4()}.json"
            self.s3.upload(key, json.dumps(batch))
        
        # Delete from database
        self.timeseries_db.delete(
            "DELETE FROM price_points WHERE checked_at < %s",
            (cutoff_date,)
        )

Scalability Considerations

Horizontal Scaling

Scraper Service:

Stateless workers, horizontally scalable
Distribute scraping across multiple servers
Use message queue for job distribution
Shared proxy pool across workers

Chart Service:

Stateless service, horizontally scalable
Cache chart data in Redis
Use read replicas for time-series database

Caching Strategy

Redis Cache:

Product Metadata: TTL 1 hour
Current Prices: TTL 10 minutes
Chart Data: TTL 1 hour
Rate Limit Counters: TTL 1 hour

Cache Invalidation:

Invalidate on price update
Invalidate on product update
Use cache-aside pattern

Security Considerations

Anti-Scraping Measures

Proxy Rotation: Rotate IPs frequently
User Agent Rotation: Rotate user agents
Request Spacing: Add delays between requests
CAPTCHA Handling: Detect and handle CAPTCHAs
Cookie Management: Use session cookies

Data Privacy

User Data: Encrypt sensitive user data
Email Addresses: Secure email storage
Tracking Data: Anonymize tracking data for analytics

Monitoring & Observability

Key Metrics

System Metrics:

Price check rate (checks/second)
Scraping success rate
Scraping latency (p50, p95, p99)
Proxy failure rate
Rate limit hits
CAPTCHA rate

Business Metrics:

Total tracked products
Total price checks
Price drops detected
Email alerts sent
Browser extension usage
User engagement

Logging

Structured Logging: JSON logs for parsing
Price Events: Log all price changes
Scraping Events: Log scraping attempts and results
Alert Events: Log alert creation and delivery
Error Logging: Log errors with context

Alerting

High Failure Rate: Alert if scraping failure rate > 5%
Rate Limit Hits: Alert on frequent rate limit hits
Proxy Exhaustion: Alert if proxy pool depleted
CAPTCHA Rate: Alert if CAPTCHA rate > 10%
Alert Delivery Failures: Alert on high alert failure rate

Trade-offs and Optimizations

Trade-offs

1. Check Frequency: Frequent vs Infrequent

Frequent: More accurate, higher cost
Infrequent: Lower cost, less accurate
Decision: Dynamic frequency based on popularity

2. Scraping: Direct vs API

Direct: More control, risk of blocking
API: More reliable, may have costs
Decision: Direct scraping with extensive proxy rotation

3. Historical Data: Full vs Sampled

Full: More accurate, higher storage
Sampled: Lower storage, less accurate
Decision: Full data with archival to S3

4. Chart Data: Real-Time vs Cached

Real-Time: Always accurate, higher load
Cached: Lower load, may be stale
Decision: Cached with 1-hour TTL

Optimizations

1. Intelligent Scheduling

Prioritize popular products
Adjust frequency based on price volatility
Reduce unnecessary checks

2. Batch Price Checks

Batch multiple products per proxy session
Reduce overhead
Improve throughput

3. Data Compression

Compress old price data
Use columnar storage for time-series
Reduce storage costs

4. Chart Data Aggregation

Pre-aggregate chart data for common ranges
Store aggregated data
Reduce query time

What Interviewers Look For

Web Scraping Skills

Amazon-Specific Scraping
- ASIN-based tracking
- Multi-condition tracking (new/used/refurbished)
- Rate limit respect
- Red Flags: No ASIN, single condition, rate violations
Anti-Scraping Measures
- Proxy rotation
- CAPTCHA handling
- User-agent rotation
- Red Flags: No rotation, frequent blocking, no handling
Scraping Reliability
- Retry logic
- Failure recovery
- Red Flags: No retry, no recovery, data loss

Time-Series Data Skills

Historical Price Storage
- Time-series database
- Efficient queries
- 5+ years retention
- Red Flags: Relational DB, inefficient queries, short retention
Price Chart Generation
- Data aggregation
- Multiple time ranges
- Red Flags: No aggregation, slow queries, single range
Data Compression
- Old data compression
- Columnar storage
- Red Flags: No compression, high storage costs

Distributed Systems Skills

Price Monitoring Pipeline
- Continuous monitoring
- Dynamic frequency
- Red Flags: Fixed frequency, inefficient, poor scheduling
Scalability Design
- Horizontal scaling
- Database sharding
- Red Flags: Vertical scaling, no sharding, bottlenecks
Browser Extension Integration
- Fast product lookup
- Seamless tracking
- Red Flags: Slow lookup, poor UX, no integration

Problem-Solving Approach

Amazon Rate Limiting
- Strict enforcement
- Proxy rotation
- Intelligent scheduling
- Red Flags: No rate limiting, violations, blocking
Edge Cases
- Product unavailable
- Price format changes
- ASIN changes
- Red Flags: Ignoring edge cases, no handling
Trade-off Analysis
- Frequency vs accuracy
- Storage vs retention
- Red Flags: No trade-offs, dogmatic choices

System Design Skills

Component Design
- Scraper service
- Chart service
- Alert service
- Red Flags: Monolithic, unclear boundaries
Email Alert System
- Reliable delivery
- Price drop detection
- Red Flags: Unreliable, missed drops, no retry
Data Storage
- Time-series DB
- Proper indexing
- Red Flags: Wrong DB, missing indexes, poor queries

Communication Skills

Scraping Strategy Explanation
- Can explain ASIN tracking
- Understands rate limiting
- Red Flags: No understanding, vague
Time-Series Explanation
- Can explain data storage
- Understands aggregation
- Red Flags: No understanding, vague

Meta-Specific Focus

Web Scraping Expertise
- Amazon-specific knowledge
- Anti-scraping handling
- Key: Show scraping expertise
Time-Series Data Expertise
- Efficient storage
- Query optimization
- Key: Demonstrate data expertise

Summary

Designing CamelCamelCamel requires careful consideration of:

Amazon Scraping: Efficient scraping with proxy rotation and rate limiting
ASIN-Based Tracking: Use ASIN for product identification
Multi-Condition Tracking: Track new, used, refurbished separately
Historical Price Storage: Time-series database for 5+ years of data
Price Charts: Efficient chart generation with data aggregation
Browser Extension: Fast product lookup and tracking
Email Alerts: Reliable price drop notification delivery
Rate Limiting: Strict enforcement of Amazon’s rate limits
Scalability: Handle 30K+ price checks per second
Data Retention: Efficient storage and archival of historical data

Key architectural decisions:

ASIN-Based Product Identification for Amazon products
Time-Series Database for efficient price history storage
Multi-Condition Price Tracking for new, used, refurbished
Proxy Rotation to avoid blocking
Dynamic Check Frequency based on product popularity
Browser Extension Integration for easy product tracking
Email Alert System for price drop notifications
Horizontal Scaling for all services

The system handles 30,000 price checks per second, tracks 50 million Amazon products, stores 5+ years of historical data, and sends millions of email alerts daily while respecting Amazon’s rate limits and ensuring accurate price tracking.

Introduction

Table of Contents

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Capacity Estimation

Traffic Estimates

Storage Estimates

Bandwidth Estimates

Core Entities

Product

Price Point

User Tracking

Price Alert

User

API

1. Add Product to Track

2. Get Product Price History

3. Get Price Chart Data

4. Browser Extension API

5. Search Products

6. Get User Trackings

Data Flow

Product Addition Flow (Browser Extension)

Price Monitoring Flow

Price Chart Generation Flow

Database Design

Schema Design

Database Sharding Strategy

High-Level Design

Deep Dive

Component Design

1. Amazon Scraper Service

2. Price Chart Service

3. Browser Extension Backend

Detailed Design

Handling Amazon Rate Limits

Price Drop Detection with Conditions

Historical Data Retention

Scalability Considerations

Horizontal Scaling

Caching Strategy

Security Considerations

Anti-Scraping Measures

Data Privacy

Monitoring & Observability

Key Metrics

Logging

Alerting

Trade-offs and Optimizations

Trade-offs

Optimizations

What Interviewers Look For

Web Scraping Skills

Time-Series Data Skills

Distributed Systems Skills

Problem-Solving Approach

System Design Skills

Communication Skills

Meta-Specific Focus

Summary

Related Posts

Recent Posts