Design an Embedded Report Event System

Introduction

This post provides a comprehensive walkthrough of designing an embedded report event system. This system captures user events, processes them, and generates embedded reports (dashboards, analytics, insights) that can be displayed within applications. This is a common system design question for companies that need to provide analytics and reporting capabilities to their users.

Problem Statement
Requirements
- Functional Requirements
- Non-Functional Requirements
Capacity Estimation
Core Entities
API
Data Flow
Database Design
- Schema Design
- Database Sharding Strategy
High-Level Design
Deep Dive
Conclusion

Problem Statement

Design an embedded report event system that:

Event Collection: Capture user events from various sources (web, mobile, API)
Event Processing: Process events in real-time and batch modes
Report Generation: Generate embedded reports (dashboards, charts, analytics)
Real-Time Updates: Support real-time report updates as events occur
Historical Data: Store and query historical event data
Scalability: Handle millions of events per day
Multi-Tenancy: Support multiple customers/organizations with data isolation

Describe the system architecture, components, technology choices, and how to handle scalability, reliability, and real-time processing.

Requirements

Functional Requirements

Core Features:

Event Collection
- Capture events from web applications, mobile apps, APIs
- Support multiple event types (page views, clicks, conversions, custom events)
- Event validation and schema enforcement
- High-volume event ingestion
Event Processing
- Real-time event processing
- Batch event processing
- Event aggregation and transformation
- Event enrichment (adding metadata, user context)
Report Generation
- Generate embedded reports (dashboards, charts, tables)
- Support multiple report types (time-series, cohort, funnel, retention)
- Custom report creation
- Report templates and widgets
Real-Time Updates
- Real-time dashboard updates
- Live metrics and KPIs
- WebSocket/SSE for real-time data streaming
Historical Data
- Store historical event data
- Support time-range queries
- Data retention policies
- Efficient querying and aggregation
Multi-Tenancy
- Data isolation per customer/organization
- Per-tenant data access controls
- Tenant-specific configurations

Non-Functional Requirements

Scale:

100+ million events per day
10,000+ concurrent report viewers
1000+ customers/organizations
Petabytes of historical data

Performance:

Event ingestion latency: < 100ms
Real-time report updates: < 1 second
Historical report generation: < 5 seconds
Query response time: < 2 seconds

Reliability:

99.9% uptime
No event loss
Data consistency
Fault tolerance

Availability:

Multi-region deployment
Automatic failover
Graceful degradation

Clarifying Questions

Scale:

Q: How many events per day?
A: 100+ million events per day, with peaks up to 10 million events per hour

Event Types:

Q: What types of events?
A: Page views, clicks, conversions, custom business events, user actions

Report Types:

Q: What types of reports?
A: Time-series charts, cohort analysis, funnel analysis, retention reports, custom dashboards

Real-Time Requirements:

Q: How real-time should updates be?
A: Real-time dashboards should update within 1 second of events

Data Retention:

Q: How long to retain data?
A: 2 years for detailed events, longer for aggregated data

Multi-Tenancy:

Q: How many customers?
A: 1000+ customers, each with their own events and reports

Capacity Estimation

Storage Estimates

Event Data:

Each event: ~500 bytes (event data + metadata)
100M events/day × 500 bytes = 50GB/day
With 2-year retention: ~36.5TB
With compression: ~10TB

Aggregated Data:

Aggregated metrics: ~1KB per aggregation
1000 customers × 100 metrics × 365 days = 36.5M aggregations/year
Total: ~36.5GB/year

Report Definitions:

Each report: ~10KB
1000 customers × 100 reports = 100K reports
Total: ~1GB

Throughput Estimates

Write QPS:

100M events/day = ~1,157 events/second average
Peak: 10M events/hour = ~2,778 events/second
Total writes: ~3,000 QPS

Read QPS:

Report queries: 10,000 concurrent viewers
Average: 1 query per second per viewer
Total reads: ~10,000 QPS

Real-Time Updates:

Real-time dashboard updates: 1,000 concurrent dashboards
Updates: 1 update per second per dashboard
Total: ~1,000 updates/second

Network Bandwidth

Event ingestion: 3,000 QPS × 500 bytes = 1.5MB/s
Report queries: 10,000 QPS × 10KB = 100MB/s
Real-time updates: 1,000 updates/s × 5KB = 5MB/s
Total: ~100MB/s

Core Entities

Event

Attributes: event_id, tenant_id, event_type, user_id, properties, timestamp, created_at
Relationships: Belongs to tenant, processed into reports

Report

Attributes: report_id, tenant_id, report_name, report_type, widgets, created_at, updated_at
Relationships: Belongs to tenant, contains widgets, generated from events

Attributes: widget_id, report_id, widget_type, query, visualization_config, created_at
Relationships: Belongs to report, queries event data

Tenant

Attributes: tenant_id, tenant_name, plan_tier, storage_quota, created_at
Relationships: Has events, has reports, has users

API

Event Collection API

POST /api/v1/events
Authorization: Bearer {token}
Content-Type: application/json

{
  "event_type": "page_view",
  "user_id": "user-123",
  "properties": {
    "page_url": "/products"
  }
}

Response: 202 Accepted
{
  "event_id": "uuid",
  "status": "accepted"
}

Report Generation API

GET /api/v1/reports/{report_id}?start_time=2025-11-01&end_time=2025-11-04

Response: 200 OK
{
  "report_id": "uuid",
  "report_name": "Traffic Overview",
  "widgets": [
    {
      "widget_id": "uuid",
      "type": "time_series",
      "data": {
        "series": [
          {"time": "2025-11-01T00:00:00Z", "value": 1000},
          {"time": "2025-11-02T00:00:00Z", "value": 1200}
        ]
      }
    }
  ]
}

Embedding API

GET /api/v1/reports/{report_id}/embed?width=800&height=600

Response: 200 OK
{
  "embed_code": "<iframe src='...'></iframe>",
  "embed_url": "https://embed.example.com/report?token=...",
  "javascript_sdk": "<script src='...'></script>"
}

Data Flow

Event Ingestion Flow

Client sends event → API Gateway
API Gateway → Event Collection Service
Event Collection Service validates event
Event Collection Service → Message Queue (Kafka)
Message Queue → Event Processing Service (real-time)
Event Processing Service → Time-Series Database (store event)
Event Processing Service → Cache (update real-time metrics)
Response returned to client

Report Generation Flow

Client requests report → API Gateway
API Gateway → Report Service
Report Service checks cache for pre-computed report
If cache miss:
- Report Service → Query Service (query event data)
- Query Service → Time-Series Database (aggregate events)
- Query Service → Report Service (return aggregated data)
- Report Service → Cache (cache report)
Report Service → API Gateway (return report)
API Gateway → Client (return report)

Real-Time Update Flow

New event processed → Event Processing Service
Event Processing Service → Cache (update metrics)
Event Processing Service → Message Queue (report update event)
Message Queue → Real-Time Service
Real-Time Service → WebSocket/SSE (push update to clients)
Clients receive updated report data

Database Design

Schema Design

Events Table:

CREATE TABLE events (
    event_id VARCHAR(36) PRIMARY KEY,
    tenant_id VARCHAR(36) NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    user_id VARCHAR(36),
    properties JSON,
    timestamp TIMESTAMP NOT NULL,
    created_at TIMESTAMP,
    INDEX idx_tenant_timestamp (tenant_id, timestamp),
    INDEX idx_event_type (event_type),
    INDEX idx_user_id (user_id)
);

Reports Table:

CREATE TABLE reports (
    report_id VARCHAR(36) PRIMARY KEY,
    tenant_id VARCHAR(36) NOT NULL,
    report_name VARCHAR(255) NOT NULL,
    report_type VARCHAR(50) NOT NULL,
    widgets JSON,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    INDEX idx_tenant_id (tenant_id),
    FOREIGN KEY (tenant_id) REFERENCES tenants(tenant_id)
);

Widgets Table:

CREATE TABLE widgets (
    widget_id VARCHAR(36) PRIMARY KEY,
    report_id VARCHAR(36) NOT NULL,
    widget_type VARCHAR(50) NOT NULL,
    query JSON,
    visualization_config JSON,
    created_at TIMESTAMP,
    FOREIGN KEY (report_id) REFERENCES reports(report_id),
    INDEX idx_report_id (report_id)
);

Tenants Table:

CREATE TABLE tenants (
    tenant_id VARCHAR(36) PRIMARY KEY,
    tenant_name VARCHAR(255) NOT NULL,
    plan_tier VARCHAR(50),
    storage_quota BIGINT,
    created_at TIMESTAMP
);

Database Sharding Strategy

Shard by Tenant ID:

Tenant data, events, and reports on same shard
Enables efficient tenant queries
Use consistent hashing for distribution

Time-Series Database:

Use specialized time-series DB (InfluxDB, TimescaleDB)
Partition by time for efficient time-range queries
Optimize for time-series aggregations

High-Level Design

System Components

┌─────────────────────────────────────────────────────────────┐
│                    Client Applications                       │
│  (Web Apps, Mobile Apps, APIs - Event Sources)              │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Event Collection Layer                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Event       │  │  Event       │  │  Event       │      │
│  │  Gateway     │  │  Validator   │  │  Enricher    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Event Processing Layer                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Real-Time   │  │  Batch       │  │  Aggregation │      │
│  │  Processor   │  │  Processor   │  │  Service     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────────────────┬──────────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Message    │ │   Time-Series│ │  Aggregated  │
│   Queue      │ │   Database   │ │   Data Store │
│   (Kafka)    │ │  (InfluxDB)  │ │  (PostgreSQL)│
└──────────────┘ └──────────────┘ └──────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Report Generation Layer                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Report      │  │  Query       │  │  Cache       │      │
│  │  Generator   │  │  Engine      │  │  Service     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Embedding & Delivery Layer                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Embedding   │  │  Real-Time   │  │  API         │      │
│  │  Service     │  │  Streamer    │  │  Gateway     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Client Applications                       │
│  (Customer Applications - Embedded Reports)                   │
└─────────────────────────────────────────────────────────────┘

Core Services

Event Gateway: Receives events from clients, validates, routes
Event Validator: Validates event schema and data
Event Enricher: Adds metadata, user context, timestamps
Real-Time Processor: Processes events in real-time for live dashboards
Batch Processor: Processes events in batches for historical reports
Aggregation Service: Aggregates events into metrics
Report Generator: Generates reports from processed data
Query Engine: Executes queries on historical data
Embedding Service: Generates embeddable report widgets
Real-Time Streamer: Streams real-time updates via WebSocket/SSE

Data Stores

Message Queue (Kafka): Event streaming and processing
Time-Series Database (InfluxDB/TimescaleDB): Raw event storage
Aggregated Data Store (PostgreSQL): Aggregated metrics and reports
Cache (Redis): Report cache and real-time data
Object Storage (S3): Report exports and archives

Deep Dive

Component Design

Detailed Design

Event Schema

Event Structure:

{
  "event_id": "uuid",
  "tenant_id": "customer-123",
  "event_type": "page_view",
  "timestamp": "2025-11-04T10:00:00Z",
  "user_id": "user-456",
  "session_id": "session-789",
  "properties": {
    "page_url": "/products",
    "page_title": "Products",
    "referrer": "https://google.com",
    "device_type": "desktop",
    "browser": "Chrome",
    "location": {
      "country": "US",
      "city": "San Francisco"
    }
  },
  "metadata": {
    "ip_address": "192.168.1.1",
    "user_agent": "...",
    "sdk_version": "1.2.3"
  }
}

Database Schema

Events Table (Time-Series):

CREATE TABLE events (
    event_id UUID PRIMARY KEY,
    tenant_id VARCHAR(255) NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    timestamp TIMESTAMP NOT NULL,
    user_id VARCHAR(255),
    session_id VARCHAR(255),
    properties JSONB,
    metadata JSONB,
    
    -- Partitioning
    PARTITION BY RANGE (timestamp),
    
    -- Indexes
    INDEX idx_tenant_timestamp (tenant_id, timestamp),
    INDEX idx_event_type (event_type),
    INDEX idx_user_id (user_id),
    INDEX idx_session_id (session_id)
);

-- Partition by month
CREATE TABLE events_2025_11 PARTITION OF events
    FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');

Aggregated Metrics Table:

CREATE TABLE aggregated_metrics (
    metric_id UUID PRIMARY KEY,
    tenant_id VARCHAR(255) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    metric_type VARCHAR(50), -- 'counter', 'gauge', 'histogram'
    time_bucket TIMESTAMP NOT NULL, -- hourly, daily
    bucket_type VARCHAR(20), -- 'hour', 'day', 'week', 'month'
    value DECIMAL(20, 4),
    dimensions JSONB, -- filters/dimensions
    
    UNIQUE(tenant_id, metric_name, time_bucket, bucket_type, dimensions),
    INDEX idx_tenant_metric_time (tenant_id, metric_name, time_bucket)
);

Reports Table:

CREATE TABLE reports (
    report_id UUID PRIMARY KEY,
    tenant_id VARCHAR(255) NOT NULL,
    report_name VARCHAR(255) NOT NULL,
    report_type VARCHAR(50), -- 'time_series', 'cohort', 'funnel', 'retention'
    config JSONB, -- report configuration
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL,
    
    INDEX idx_tenant_id (tenant_id)
);

Report Widgets Table:

CREATE TABLE report_widgets (
    widget_id UUID PRIMARY KEY,
    report_id UUID REFERENCES reports(report_id),
    widget_type VARCHAR(50), -- 'chart', 'table', 'metric'
    config JSONB, -- widget configuration
    position INTEGER, -- position in dashboard
    created_at TIMESTAMP NOT NULL
);

Event Collection Service

API Design:

POST /api/v1/events
Content-Type: application/json
X-Tenant-ID: customer-123

{
  "event_type": "page_view",
  "user_id": "user-456",
  "session_id": "session-789",
  "properties": {
    "page_url": "/products",
    "page_title": "Products"
  }
}

Response: 202 Accepted
{
  "event_id": "uuid",
  "status": "accepted",
  "timestamp": "2025-11-04T10:00:00Z"
}

Implementation:

class EventCollectionService:
    def collect_event(self, event_data, tenant_id):
        # Validate event
        validated_event = self.validator.validate(event_data, tenant_id)
        
        # Enrich event
        enriched_event = self.enricher.enrich(validated_event, tenant_id)
        
        # Add tenant isolation
        enriched_event['tenant_id'] = tenant_id
        enriched_event['event_id'] = str(uuid.uuid4())
        enriched_event['timestamp'] = datetime.utcnow().isoformat()
        
        # Publish to Kafka
        self.kafka.produce('events', {
            'key': tenant_id,  # Partition by tenant
            'value': enriched_event
        })
        
        # Return event ID
        return enriched_event['event_id']

Real-Time Processing Service

Implementation:

class RealTimeProcessor:
    def process_event(self, event):
        tenant_id = event['tenant_id']
        event_type = event['event_type']
        
        # Update real-time metrics
        self.update_realtime_metrics(tenant_id, event_type, event)
        
        # Update cache for dashboards
        self.update_dashboard_cache(tenant_id, event)
        
        # Trigger real-time updates
        self.trigger_realtime_updates(tenant_id, event)
    
    def update_realtime_metrics(self, tenant_id, event_type, event):
        # Update counters
        key = f"metrics:{tenant_id}:{event_type}:realtime"
        self.redis.incr(key)
        
        # Update time-windowed metrics
        time_window = self.get_current_time_window()  # e.g., last hour
        window_key = f"metrics:{tenant_id}:{event_type}:{time_window}"
        self.redis.incr(window_key)
        self.redis.expire(window_key, 3600)  # 1 hour TTL
    
    def update_dashboard_cache(self, tenant_id, event):
        # Update cached dashboard data
        dashboard_key = f"dashboard:{tenant_id}:realtime"
        self.redis.hincrby(dashboard_key, event['event_type'], 1)
        self.redis.expire(dashboard_key, 300)  # 5 minute TTL
    
    def trigger_realtime_updates(self, tenant_id, event):
        # Publish to WebSocket channel
        self.websocket.publish(f"tenant:{tenant_id}:updates", {
            'event_type': event['event_type'],
            'timestamp': event['timestamp'],
            'metrics': self.get_realtime_metrics(tenant_id)
        })

Batch Processing Service

Implementation:

class BatchProcessor:
    def process_batch(self, events):
        # Group events by tenant and time bucket
        grouped = self.group_events(events)
        
        for (tenant_id, time_bucket), event_group in grouped.items():
            # Aggregate events
            aggregations = self.aggregate_events(event_group)
            
            # Store aggregated metrics
            self.store_aggregated_metrics(tenant_id, time_bucket, aggregations)
            
            # Update report cache
            self.update_report_cache(tenant_id, aggregations)
    
    def aggregate_events(self, events):
        aggregations = {}
        
        # Count by event type
        for event in events:
            event_type = event['event_type']
            aggregations[f"{event_type}_count"] = aggregations.get(f"{event_type}_count", 0) + 1
        
        # Calculate other metrics (unique users, sessions, etc.)
        unique_users = len(set(e['user_id'] for e in events if e.get('user_id')))
        aggregations['unique_users'] = unique_users
        
        return aggregations
    
    def store_aggregated_metrics(self, tenant_id, time_bucket, aggregations):
        for metric_name, value in aggregations.items():
            self.db.insert_aggregated_metric(
                tenant_id=tenant_id,
                metric_name=metric_name,
                time_bucket=time_bucket,
                value=value,
                bucket_type='hour'
            )

Report Generation Service

Implementation:

class ReportGenerator:
    def generate_report(self, report_id, tenant_id, time_range):
        # Get report configuration
        report = self.db.get_report(report_id, tenant_id)
        
        # Generate report data
        report_data = {}
        
        for widget in report['widgets']:
            widget_data = self.generate_widget_data(
                widget, tenant_id, time_range
            )
            report_data[widget['widget_id']] = widget_data
        
        return report_data
    
    def generate_widget_data(self, widget, tenant_id, time_range):
        widget_type = widget['widget_type']
        config = widget['config']
        
        if widget_type == 'time_series':
            return self.generate_timeseries_data(
                config, tenant_id, time_range
            )
        elif widget_type == 'cohort':
            return self.generate_cohort_data(
                config, tenant_id, time_range
            )
        elif widget_type == 'funnel':
            return self.generate_funnel_data(
                config, tenant_id, time_range
            )
        elif widget_type == 'metric':
            return self.generate_metric_data(
                config, tenant_id, time_range
            )
    
    def generate_timeseries_data(self, config, tenant_id, time_range):
        # Query aggregated metrics
        metrics = self.db.query_aggregated_metrics(
            tenant_id=tenant_id,
            metric_name=config['metric'],
            start_time=time_range['start'],
            end_time=time_range['end'],
            bucket_type=config.get('bucket_type', 'hour')
        )
        
        # Format for chart
        return {
            'data': [
                {'time': m['time_bucket'], 'value': m['value']}
                for m in metrics
            ],
            'type': 'line'
        }

Embedding Service

Implementation:

class EmbeddingService:
    def generate_embed_code(self, report_id, tenant_id, options):
        # Generate embed URL
        embed_url = self.generate_embed_url(report_id, tenant_id, options)
        
        # Generate iframe embed code
        embed_code = f"""
        <iframe
            src="{embed_url}"
            width="{options.get('width', 800)}"
            height="{options.get('height', 600)}"
            frameborder="0"
            allowfullscreen>
        </iframe>
        """
        
        return {
            'embed_code': embed_code,
            'embed_url': embed_url,
            'javascript_sdk': self.generate_sdk_code(report_id, tenant_id)
        }
    
    def generate_embed_url(self, report_id, tenant_id, options):
        base_url = self.config.embed_base_url
        params = {
            'report_id': report_id,
            'tenant_id': tenant_id,
            'token': self.generate_embed_token(report_id, tenant_id),
            **options
        }
        
        return f"{base_url}/embed?{urlencode(params)}"

Real-Time Streaming Service

WebSocket Implementation:

class WebSocketService:
    def handle_connection(self, websocket, tenant_id, report_id):
        # Authenticate connection
        if not self.authenticate(websocket, tenant_id):
            websocket.close()
            return
        
        # Subscribe to updates
        channel = f"tenant:{tenant_id}:report:{report_id}"
        self.subscribe(websocket, channel)
        
        # Send initial data
        initial_data = self.get_report_data(report_id, tenant_id)
        websocket.send_json(initial_data)
        
        # Keep connection alive
        while True:
            try:
                # Wait for updates
                message = self.wait_for_update(channel)
                if message:
                    websocket.send_json(message)
            except WebSocketDisconnect:
                break
        
        self.unsubscribe(websocket, channel)

Technology Choices

Message Queue: Apache Kafka

Why Kafka:

High Throughput: Handle millions of events per second
Durability: Events persisted to disk
Partitioning: Partition by tenant for isolation
Consumer Groups: Multiple consumers for processing
Replay Capability: Replay events for reprocessing

Topic Design:

events: Raw events (partitioned by tenant_id)
events-processed: Processed events
metrics-updates: Metric updates
report-updates: Report update events

Time-Series Database: InfluxDB or TimescaleDB

Why Time-Series DB:

Optimized for Time-Series: Efficient time-range queries
Data Compression: Automatic compression
High Write Throughput: Handle high event volumes
Retention Policies: Automatic data retention

Why InfluxDB:

Built specifically for time-series
Excellent compression
Good query performance

Why TimescaleDB:

PostgreSQL-based (familiar SQL)
Better for complex queries
Good for hybrid workloads

Aggregated Data Store: PostgreSQL

Why PostgreSQL:

ACID Compliance: Strong consistency for reports
Complex Queries: Support for complex aggregations
JSONB Support: Flexible schema for report configs
Mature Ecosystem: Proven reliability

Cache: Redis

Why Redis:

Low Latency: Sub-millisecond access
Data Structures: Rich data structures (hashes, sorted sets)
Pub/Sub: Real-time updates
Persistence: Optional persistence

Caching Strategy:

Report Cache: Cache generated reports (TTL: 5 minutes)
Real-Time Metrics: Cache real-time metrics (TTL: 1 minute)
Dashboard Cache: Cache dashboard data (TTL: 1 minute)

Scalability Design

Horizontal Scaling

Event Collection:

Multiple event gateway instances
Load balancer (round-robin, least connections)
Auto-scaling based on event rate

Event Processing:

Multiple real-time processors
Kafka consumer groups for parallel processing
Multiple batch processors

Report Generation:

Multiple report generator instances
Query engine scaling
Cache layer scaling

Data Partitioning

Events Partitioning:

Partition by tenant_id (Kafka partitioning)
Time-based partitioning (monthly partitions)
Partition pruning for queries

Aggregated Metrics Partitioning:

Partition by tenant_id and time_bucket
Monthly partitions for historical data
Hot/cold data separation

Caching Strategy

Multi-Level Caching:

L1 Cache: In-memory cache in application
L2 Cache: Redis (distributed cache)
CDN: Static report assets

Cache Invalidation:

Time-based expiration (TTL)
Event-driven invalidation
Manual cache invalidation API

Real-Time Updates

WebSocket Architecture

Client ←→ WebSocket Server ←→ Redis Pub/Sub ←→ Event Processors

Implementation:

WebSocket server for client connections
Redis Pub/Sub for event distribution
Event processors publish to Redis channels
WebSocket server subscribes and broadcasts to clients

Server-Sent Events (SSE) Alternative

For simpler use cases:

HTTP-based (easier than WebSocket)
One-way communication (server to client)
Automatic reconnection
Lower overhead

Multi-Tenancy

Data Isolation

Tenant Isolation Strategies:

Database-Level Isolation
- Separate database per tenant (strongest isolation)
- Tenant ID in all tables (shared database)
Application-Level Isolation
- All queries filtered by tenant_id
- Tenant context in all operations
Hybrid Approach
- Tenant ID in all tables
- Row-level security policies
- Application-level filtering

Implementation:

class TenantContext:
    def __init__(self, tenant_id):
        self.tenant_id = tenant_id
    
    def query_events(self, filters):
        # Always filter by tenant_id
        filters['tenant_id'] = self.tenant_id
        return self.db.query('events', filters)

Security

Authentication:

API keys per tenant
OAuth tokens
JWT tokens with tenant claim

Authorization:

Tenant-based access control
Role-based access within tenant
Report-level permissions

Monitoring and Observability

Key Metrics

Event Metrics:

Events ingested per second
Events processed per second
Event processing latency
Event loss rate

Report Metrics:

Report generation latency
Report cache hit rate
Report query performance
Concurrent report viewers

System Metrics:

API latency (p50, p95, p99)
Database query latency
Cache hit rate
Error rates

Alerting

Critical Alerts:

Event ingestion failure
Event processing lag
Report generation failures
Database connection issues

Warning Alerts:

High event processing latency
Low cache hit rate
High error rates
Resource utilization

Trade-offs and Design Decisions

Real-Time vs. Batch Processing

Decision: Both

Real-Time: For live dashboards and alerts
Batch: For historical reports and complex aggregations

Trade-off: Complexity vs. flexibility

Time-Series DB vs. Relational DB

Decision: Both

Time-Series DB: For raw event storage
Relational DB: For aggregated metrics and reports

Trade-off: Specialized storage vs. query flexibility

WebSocket vs. Polling

Decision: WebSocket for real-time, polling fallback

WebSocket: Real-time updates, lower latency
Polling: Fallback for WebSocket-unavailable scenarios

Trade-off: Complexity vs. real-time capabilities

Failure Scenarios

Event Collection Failure

Scenario: Event gateway fails

Handling:

Multiple gateway instances (redundancy)
Load balancer health checks
Client retry with exponential backoff
Dead letter queue for failed events

Event Processing Lag

Scenario: Event processing falls behind

Handling:

Scale processors horizontally
Increase Kafka consumer instances
Alert on lag threshold
Prioritize critical events

Database Failure

Scenario: Database unavailable

Handling:

Database replication (primary + replicas)
Automatic failover
Cache serving stale data temporarily
Graceful degradation

What Interviewers Look For

Event-Driven Architecture Skills

Event Collection
- Scalable ingestion
- Event validation
- Multi-tenant support
- Red Flags: Low throughput, no validation, tenant mixing
Event Processing
- Real-time processing
- Batch processing
- Stream processing
- Red Flags: No real-time, slow processing, bottlenecks
Report Generation
- Efficient queries
- Caching strategy
- Incremental updates
- Red Flags: Slow queries, no caching, full recompute

Distributed Systems Skills

Time-Series Data
- Efficient storage
- Time-based partitioning
- Retention policies
- Red Flags: Wrong storage, no partitioning, high costs
Multi-Tenancy
- Data isolation
- Security
- Resource quotas
- Red Flags: No isolation, security issues, no quotas
Real-Time Updates
- WebSocket/SSE
- Pub/Sub architecture
- Red Flags: Polling, high latency, no real-time

Problem-Solving Approach

Scale Challenges
- High event volume
- Many tenants
- Real-time requirements
- Red Flags: No scale consideration, bottlenecks, poor performance
Edge Cases
- Event loss
- Processing failures
- Database failures
- Red Flags: Ignoring edge cases, no handling
Trade-off Analysis
- Real-time vs batch
- Consistency vs performance
- Red Flags: No trade-offs, dogmatic choices

System Design Skills

Component Design
- Collection service
- Processing service
- Reporting service
- Red Flags: Monolithic, unclear boundaries
Caching Strategy
- Multi-level caching
- Cache invalidation
- Red Flags: No caching, poor strategy, stale data
Data Partitioning
- Tenant-based
- Time-based
- Red Flags: No partitioning, poor queries, bottlenecks

Communication Skills

Event Architecture Explanation
- Can explain event flow
- Understands processing
- Red Flags: No understanding, vague explanations
Scale Explanation
- Can explain scaling strategies
- Understands bottlenecks
- Red Flags: No understanding, vague

Meta-Specific Focus

Event-Driven Systems Expertise
- Event architecture knowledge
- Real-time processing
- Key: Show event-driven systems expertise
Multi-Tenant Design
- Data isolation
- Security focus
- Key: Demonstrate multi-tenant expertise

Conclusion

Designing an embedded report event system requires:

Event Collection: Scalable event ingestion with validation
Event Processing: Real-time and batch processing pipelines
Report Generation: Efficient report generation with caching
Real-Time Updates: WebSocket/SSE for live dashboards
Historical Data: Time-series database for efficient queries
Multi-Tenancy: Data isolation and security
Scalability: Horizontal scaling of all components
Reliability: Fault tolerance and graceful degradation

Key Design Principles:

Separation of Concerns: Separate collection, processing, and reporting
Caching Strategy: Multi-level caching for performance
Data Partitioning: Partition by tenant and time
Real-Time First: Design for real-time with batch fallback
Tenant Isolation: Strong data isolation and security

This system design demonstrates understanding of event-driven architectures, real-time processing, time-series data, and multi-tenant systems—all critical for building production-grade analytics and reporting systems.

Introduction

Table of Contents

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Clarifying Questions

Capacity Estimation

Storage Estimates

Throughput Estimates

Network Bandwidth

Core Entities

Event

Report

Widget

Tenant

API

Event Collection API

Report Generation API

Embedding API

Data Flow

Event Ingestion Flow

Report Generation Flow

Real-Time Update Flow

Database Design

Schema Design

Database Sharding Strategy

High-Level Design

System Components

Core Services

Data Stores

Deep Dive

Component Design

Detailed Design

Event Schema

Database Schema

Event Collection Service

Real-Time Processing Service

Batch Processing Service

Report Generation Service

Embedding Service

Real-Time Streaming Service

Technology Choices

Message Queue: Apache Kafka

Time-Series Database: InfluxDB or TimescaleDB

Aggregated Data Store: PostgreSQL

Cache: Redis

Scalability Design

Horizontal Scaling

Data Partitioning

Caching Strategy

Real-Time Updates

WebSocket Architecture

Server-Sent Events (SSE) Alternative

Multi-Tenancy

Data Isolation

Security

Monitoring and Observability

Key Metrics

Alerting

Trade-offs and Design Decisions

Real-Time vs. Batch Processing

Time-Series DB vs. Relational DB

WebSocket vs. Polling

Failure Scenarios

Event Collection Failure

Event Processing Lag

Database Failure

What Interviewers Look For

Event-Driven Architecture Skills

Distributed Systems Skills

Problem-Solving Approach

System Design Skills

Communication Skills

Meta-Specific Focus

Conclusion

Related Posts

Recent Posts