Introduction
Quickset with UEI is a smart home integration platform that connects Quickset TV mounting solutions with UEI (Universal Electronics Inc) remote controls and smart home devices. The system enables users to control their TV mounts, entertainment systems, and smart home devices through a unified platform, supporting device discovery, real-time control, automation, and voice integration.
This post provides a detailed walkthrough of designing Quickset with UEI, covering IoT device management, device discovery and pairing, real-time control protocols, automation rules, cloud synchronization, and integration with voice assistants. This is a system design interview question that tests your understanding of IoT systems, device management, real-time communication, automation, and handling millions of connected devices.
Table of Contents
- Problem Statement
- Requirements
- Capacity Estimation
- Core Entities
- API
- Data Flow
- Database Design
- High-Level Design
- Deep Dive
- Summary
Problem Statement
Design Quickset with UEI, a smart home integration platform with the following features:
- Device discovery and pairing for Quickset mounts and UEI devices
- Real-time control of TV mounts (tilt, swivel, extend)
- Remote control functionality for entertainment devices via UEI
- Device grouping and scene creation
- Automation rules and scheduling
- Voice assistant integration (Alexa, Google Assistant)
- Mobile app for control and monitoring
- Cloud synchronization and backup
- Multi-user support with permissions
- Firmware updates and device management
Scale Requirements:
- 50 million+ registered users
- 200 million+ connected devices (mounts, remotes, smart devices)
- 10 million+ concurrent active sessions
- 100K+ commands per second
- Must support real-time control with < 200ms latency
- Must handle device discovery in < 5 seconds
- Global deployment across multiple regions
Requirements
Functional Requirements
Core Features:
- Device Discovery: Automatically discover Quickset mounts and UEI devices on local network
- Device Pairing: Secure pairing and registration of devices
- Mount Control: Control TV mount position (tilt, swivel, extend, retract)
- Remote Control: Send IR/RF commands via UEI remotes to entertainment devices
- Device Groups: Group multiple devices for coordinated control
- Scenes: Create and execute scenes (e.g., “Movie Night” - adjust mount, turn on TV, dim lights)
- Automation: Create rules based on time, events, or conditions
- Scheduling: Schedule device actions at specific times
- Voice Control: Integration with Alexa, Google Assistant, Siri
- Mobile App: iOS and Android apps for control
- Multi-User: Support multiple users per household with permissions
- Firmware Updates: Over-the-air firmware updates for devices
- Device Status: Real-time status monitoring and notifications
- History & Analytics: Track device usage and energy consumption
Out of Scope:
- Device manufacturing (assume devices exist)
- Payment processing (assume existing)
- User authentication (assume existing OAuth)
- Video streaming (focus on control)
Non-Functional Requirements
- Availability: 99.9% uptime
- Reliability: No command loss, device state consistency
- Performance:
- Command latency: < 200ms for local devices, < 500ms for cloud
- Device discovery: < 5 seconds
- Scene execution: < 2 seconds
- App responsiveness: < 100ms
- Scalability: Handle 200M+ devices, 10M+ concurrent sessions
- Consistency: Eventual consistency for device state, strong for critical commands
- Security: End-to-end encryption, secure device pairing, authentication
- Offline Support: Local control works offline, cloud sync when online
Capacity Estimation
Traffic Estimates
Assumptions:
- 50 million registered users
- 10 million daily active users (20% DAU rate)
- Average user has 4 devices
- Each device sends 10 status updates per hour
- Each user sends 20 commands per day
- Peak traffic: 3x average
Read Traffic:
- Status queries: 10M users × 4 devices × 10 updates/hour = 400M updates/hour = 111K updates/sec
- Device list queries: 10M users × 5 queries/day = 50M queries/day = 580 queries/sec
- Scene queries: 10M users × 2 queries/day = 20M queries/day = 231 queries/sec
- Total reads: ~112K reads/sec (peak: ~336K reads/sec)
Write Traffic:
- Commands: 10M users × 20 commands/day = 200M commands/day = 2.3K commands/sec
- Device registrations: 1M new devices/day = 12 registrations/sec
- Scene creations: 1M scenes/day = 12 scenes/sec
- Automation rules: 500K rules/day = 6 rules/sec
- Total writes: ~2.3K writes/sec (peak: ~7K writes/sec)
Storage Estimates
Assumptions:
- Device metadata: 1KB per device
- Command history: 100 bytes per command, keep 90 days
- Scene data: 5KB per scene
- Automation rules: 2KB per rule
- User preferences: 2KB per user
- Firmware: 10MB per device type
Storage Calculations:
- Device metadata: 200M devices × 1KB = 200GB
- Command history: 200M commands/day × 90 days × 100 bytes = 1.8TB
- Scenes: 50M scenes × 5KB = 250GB
- Automation rules: 20M rules × 2KB = 40GB
- User preferences: 50M users × 2KB = 100GB
- Firmware: 100 device types × 10MB = 1GB
- Total storage: ~2.2TB (growing ~20GB/day)
Bandwidth Estimates
Assumptions:
- Command: 500 bytes
- Status update: 200 bytes
- Device discovery: 2KB
- Firmware update: 10MB
Bandwidth Calculations:
- Commands: 2.3K commands/sec × 500 bytes = 1.15MB/sec = 9.2Mbps
- Status updates: 111K updates/sec × 200 bytes = 22.2MB/sec = 177.6Mbps
- Device discovery: 12 discoveries/sec × 2KB = 24KB/sec = 0.2Mbps
- Firmware updates: 100K updates/day × 10MB = 1TB/day = 92.6Mbps average
- Total bandwidth: ~280Mbps average, ~1Gbps peak
Core Entities
User
- UserID, Email, Name, CreatedAt, LastLogin, Preferences
Device
- DeviceID, UserID, DeviceType (Mount/Remote/SmartDevice), Model, SerialNumber, MACAddress, IPAddress, FirmwareVersion, Status (Online/Offline), LastSeen, Location, Capabilities
Mount
- DeviceID, MountType (Fixed/Tilt/Swivel/FullMotion), MaxWeight, MaxTVSize, CurrentPosition (TiltAngle, SwivelAngle, Extension), Presets
Remote
- DeviceID, RemoteType (IR/RF/Bluetooth), SupportedDevices, BatteryLevel, LastUsed
DeviceGroup
- GroupID, UserID, Name, DeviceIDs[], CreatedAt
Scene
- SceneID, UserID, Name, Actions[], CreatedAt, LastExecuted
AutomationRule
- RuleID, UserID, Name, Trigger (Time/Event/Condition), Actions[], Enabled, CreatedAt
Command
- CommandID, UserID, DeviceID, CommandType, Parameters, Status (Pending/Success/Failed), Timestamp, ResponseTime
FirmwareUpdate
- UpdateID, DeviceType, Version, FileURL, ReleaseNotes, ReleasedAt, RolloutPercentage
API
Device Management
POST /api/v1/devices/discover
Response: { devices: [Device] }
POST /api/v1/devices/pair
Body: { deviceId, pairingCode }
Response: { deviceId, accessToken }
GET /api/v1/devices
Response: { devices: [Device] }
GET /api/v1/devices/{deviceId}
Response: { device: Device }
PUT /api/v1/devices/{deviceId}
Body: { name, location }
Response: { device: Device }
DELETE /api/v1/devices/{deviceId}
Response: { success: true }
Mount Control
POST /api/v1/devices/{deviceId}/mount/position
Body: { tilt, swivel, extension }
Response: { success: true, position: {...} }
GET /api/v1/devices/{deviceId}/mount/position
Response: { position: { tilt, swivel, extension } }
POST /api/v1/devices/{deviceId}/mount/preset
Body: { presetName, position }
Response: { success: true }
POST /api/v1/devices/{deviceId}/mount/preset/{presetName}/recall
Response: { success: true }
Remote Control
POST /api/v1/devices/{deviceId}/remote/send
Body: { command, deviceType, repeat }
Response: { success: true }
POST /api/v1/devices/{deviceId}/remote/macro
Body: { commands: [{ command, delay }] }
Response: { success: true }
Scenes & Automation
POST /api/v1/scenes
Body: { name, actions: [{ deviceId, action, parameters }] }
Response: { scene: Scene }
POST /api/v1/scenes/{sceneId}/execute
Response: { success: true, executionTime }
GET /api/v1/scenes
Response: { scenes: [Scene] }
POST /api/v1/automations
Body: { name, trigger, actions, enabled }
Response: { automation: AutomationRule }
GET /api/v1/automations
Response: { automations: [AutomationRule] }
Device Groups
POST /api/v1/groups
Body: { name, deviceIds: [] }
Response: { group: DeviceGroup }
POST /api/v1/groups/{groupId}/control
Body: { action, parameters }
Response: { success: true }
Data Flow
Device Discovery Flow
- User opens app and taps “Discover Devices”
- App sends discovery request to Cloud Service
- Cloud Service broadcasts discovery message via MQTT to local gateways
- Local gateway (hub/router) performs mDNS/Bonjour discovery on local network
- Devices respond with device info (type, model, capabilities)
- Gateway aggregates responses and sends to Cloud Service
- Cloud Service returns discovered devices to app
- User selects device to pair
- App initiates pairing flow with device
- Device generates pairing code and displays it
- User enters pairing code in app
- App sends pairing request with code to Cloud Service
- Cloud Service validates code and creates device registration
- Device receives confirmation and establishes secure connection
- App receives device registration confirmation
Command Execution Flow
- User sends command via mobile app (e.g., “Tilt mount 15 degrees”)
- App sends command to Cloud Service API
- Cloud Service validates user permissions and device ownership
- Cloud Service checks if device is online (local or cloud)
- Local Path (device on same network):
- Cloud Service routes command to Local Gateway via MQTT
- Gateway forwards command to device via local protocol (WiFi/Bluetooth/Zigbee)
- Device executes command and sends status update
- Gateway forwards status to Cloud Service
- Cloud Service updates device state in database
- Cloud Service sends status update to app via WebSocket
- Cloud Path (device not on local network):
- Cloud Service sends command directly to device via MQTT/CoAP
- Device executes command and sends status update
- Cloud Service updates device state
- Cloud Service sends status update to app
- App receives status update and updates UI
Scene Execution Flow
- User taps scene button in app (e.g., “Movie Night”)
- App sends scene execution request to Cloud Service
- Cloud Service retrieves scene definition from database
- Cloud Service validates all devices in scene are accessible
- Cloud Service executes scene actions in parallel:
- For each action, follows command execution flow
- Tracks execution status for each action
- Cloud Service aggregates results and sends to app
- App displays execution status
Automation Rule Trigger Flow
- Automation engine checks triggers periodically (time-based) or listens for events (event-based)
- When trigger condition is met:
- Automation engine retrieves rule from database
- Validates rule is enabled and conditions are met
- Executes rule actions (similar to scene execution)
- Logs execution result
- If rule has notifications enabled, sends notification to user
Database Design
Schema Design
users
CREATE TABLE users (
user_id VARCHAR(36) PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_login TIMESTAMP,
preferences JSONB,
INDEX idx_email (email)
);
devices
CREATE TABLE devices (
device_id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
device_type ENUM('MOUNT', 'REMOTE', 'SMART_DEVICE') NOT NULL,
model VARCHAR(100),
serial_number VARCHAR(100) UNIQUE,
mac_address VARCHAR(17) UNIQUE,
ip_address VARCHAR(45),
firmware_version VARCHAR(20),
status ENUM('ONLINE', 'OFFLINE', 'UPDATING') DEFAULT 'OFFLINE',
last_seen TIMESTAMP,
location VARCHAR(100),
capabilities JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id),
INDEX idx_user_id (user_id),
INDEX idx_status (status),
INDEX idx_last_seen (last_seen)
);
mounts
CREATE TABLE mounts (
device_id VARCHAR(36) PRIMARY KEY,
mount_type ENUM('FIXED', 'TILT', 'SWIVEL', 'FULL_MOTION') NOT NULL,
max_weight DECIMAL(5,2),
max_tv_size INT,
current_tilt DECIMAL(5,2),
current_swivel DECIMAL(5,2),
current_extension DECIMAL(5,2),
presets JSONB,
FOREIGN KEY (device_id) REFERENCES devices(device_id) ON DELETE CASCADE
);
remotes
CREATE TABLE remotes (
device_id VARCHAR(36) PRIMARY KEY,
remote_type ENUM('IR', 'RF', 'BLUETOOTH') NOT NULL,
supported_devices JSONB,
battery_level INT,
last_used TIMESTAMP,
FOREIGN KEY (device_id) REFERENCES devices(device_id) ON DELETE CASCADE
);
device_groups
CREATE TABLE device_groups (
group_id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
name VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id),
INDEX idx_user_id (user_id)
);
CREATE TABLE device_group_members (
group_id VARCHAR(36),
device_id VARCHAR(36),
PRIMARY KEY (group_id, device_id),
FOREIGN KEY (group_id) REFERENCES device_groups(group_id) ON DELETE CASCADE,
FOREIGN KEY (device_id) REFERENCES devices(device_id) ON DELETE CASCADE
);
scenes
CREATE TABLE scenes (
scene_id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
name VARCHAR(255) NOT NULL,
actions JSONB NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_executed TIMESTAMP,
execution_count INT DEFAULT 0,
FOREIGN KEY (user_id) REFERENCES users(user_id),
INDEX idx_user_id (user_id)
);
automation_rules
CREATE TABLE automation_rules (
rule_id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
name VARCHAR(255) NOT NULL,
trigger_type ENUM('TIME', 'EVENT', 'CONDITION') NOT NULL,
trigger_config JSONB NOT NULL,
actions JSONB NOT NULL,
enabled BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_triggered TIMESTAMP,
trigger_count INT DEFAULT 0,
FOREIGN KEY (user_id) REFERENCES users(user_id),
INDEX idx_user_id (user_id),
INDEX idx_enabled (enabled),
INDEX idx_trigger_type (trigger_type)
);
commands
CREATE TABLE commands (
command_id VARCHAR(36) PRIMARY KEY,
user_id VARCHAR(36) NOT NULL,
device_id VARCHAR(36),
command_type VARCHAR(50) NOT NULL,
parameters JSONB,
status ENUM('PENDING', 'SUCCESS', 'FAILED', 'TIMEOUT') DEFAULT 'PENDING',
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
response_time INT,
error_message TEXT,
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (device_id) REFERENCES devices(device_id),
INDEX idx_user_id (user_id),
INDEX idx_device_id (device_id),
INDEX idx_timestamp (timestamp),
INDEX idx_status (status)
) PARTITION BY RANGE (timestamp);
firmware_updates
CREATE TABLE firmware_updates (
update_id VARCHAR(36) PRIMARY KEY,
device_type VARCHAR(50) NOT NULL,
version VARCHAR(20) NOT NULL,
file_url VARCHAR(500),
release_notes TEXT,
released_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
rollout_percentage INT DEFAULT 0,
status ENUM('DRAFT', 'ROLLING_OUT', 'COMPLETED', 'ROLLED_BACK') DEFAULT 'DRAFT',
INDEX idx_device_type (device_type),
INDEX idx_status (status)
);
Database Sharding Strategy
Sharding by UserID:
- Shard key:
user_id(hash-based) - Number of shards: 1000 (supports 50M users, ~50K users per shard)
- Benefits:
- User data co-located (devices, scenes, automations)
- Efficient queries for user-specific data
- Easy to scale horizontally
Partitioning for Time-Series Data:
commandstable partitioned bytimestamp(monthly partitions)- Benefits:
- Efficient queries for recent commands
- Easy to archive old data
- Better query performance
Read Replicas:
- 3 read replicas per shard for read scaling
- Read replicas handle status queries, device lists
- Write to primary, read from replicas
High-Level Design
┌─────────────┐
│ Mobile Apps │ (iOS, Android)
└──────┬──────┘
│ HTTPS/WebSocket
│
┌──────▼──────────────────────────────────────────────┐
│ API Gateway / Load Balancer │
└──────┬──────────────────────────────────────────────┘
│
├─────────────────┬─────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ API │ │ WebSocket │ │ Voice │
│ Service │ │ Service │ │ Service │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────┬───────┴─────────────────┘
│
┌────────────▼────────────┐
│ Message Queue (Kafka) │
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌───▼──────────┐ ┌───────▼────────┐
│ Device │ │ Automation │
│ Service │ │ Service │
└───┬──────────┘ └─────────────────┘
│
├──────────────────┬──────────────────┐
│ │ │
┌───▼──────────┐ ┌───▼──────────┐ ┌───▼──────────┐
│ Local │ │ Cloud │ │ Firmware │
│ Gateway │ │ Gateway │ │ Service │
│ Service │ │ Service │ │ │
└───┬──────────┘ └───┬──────────┘ └───┬──────────┘
│ │ │
│ MQTT/CoAP │ MQTT/CoAP │
│ │ │
┌───▼──────────┐ ┌───▼──────────┐ ┌───▼──────────┐
│ Local │ │ Cloud │ │ Devices │
│ Devices │ │ Devices │ │ (via │
│ (WiFi/ │ │ (MQTT/ │ │ Gateway) │
│ Bluetooth) │ │ CoAP) │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
┌─────────────────────────────────────────────────────┐
│ Data Layer │
├─────────────────────────────────────────────────────┤
│ SQL Database (Sharded) │ Redis Cache │ S3 │
│ - Users, Devices │ - Device │ - │
│ - Scenes, Automations │ State │ Firmware│
│ - Commands (partitioned) │ - Sessions │ │
└─────────────────────────────────────────────────────┘
Deep Dive
Component Design
1. API Service
Responsibilities:
- Handle HTTP requests from mobile apps
- Authentication and authorization
- Request validation and rate limiting
- Route requests to appropriate services
- Aggregate responses from multiple services
Technology:
- REST API: Node.js/Go/Python
- API Gateway: AWS API Gateway / Kong
- Rate Limiting: Redis-based rate limiter
- Authentication: JWT tokens
Key Features:
- Request validation and sanitization
- Rate limiting per user (1000 requests/minute)
- Request/response logging
- Circuit breakers for downstream services
- Response caching for device lists
2. WebSocket Service
Responsibilities:
- Maintain persistent connections with mobile apps
- Push real-time device status updates
- Push notification events (automation triggers, device offline)
- Handle connection management and reconnection
Technology:
- WebSocket server: Node.js with Socket.io / Go with Gorilla WebSocket
- Connection management: Redis for connection state
- Message queue: Kafka for event distribution
Key Features:
- Connection pooling and load balancing
- Heartbeat mechanism (ping every 30 seconds)
- Automatic reconnection handling
- Message queuing for offline clients
- Room-based subscriptions (user-specific rooms)
Scalability:
- Horizontal scaling with sticky sessions
- Redis pub/sub for cross-server communication
- Connection limit: 10K connections per server
3. Device Service
Responsibilities:
- Device registration and management
- Device discovery coordination
- Device state management
- Command routing (local vs cloud)
- Device health monitoring
Technology:
- Microservice: Go/Java
- State store: Redis
- Message queue: Kafka
Key Features:
- Device registry with capabilities
- Device state cache (Redis)
- Command queue management
- Device health checks (ping every 60 seconds)
- Offline detection (mark offline after 5 minutes of no response)
Device State Management:
- In-memory cache (Redis) for fast access
- Database for persistence
- Cache invalidation on state updates
- TTL-based expiration for offline devices
4. Local Gateway Service
Responsibilities:
- Bridge between cloud and local network devices
- Handle local device discovery (mDNS/Bonjour)
- Route commands to local devices
- Aggregate device status updates
- Handle local network protocols (WiFi, Bluetooth, Zigbee)
Technology:
- Gateway software: Python/Go
- Local protocols: mDNS, CoAP, MQTT
- Network discovery: Bonjour, UPnP
Key Features:
- Automatic local network discovery
- Protocol translation (cloud MQTT → local protocols)
- Local command execution (low latency)
- Device status aggregation
- Offline operation support
Architecture:
- Runs on user’s router/hub or dedicated gateway device
- Connects to cloud via MQTT
- Maintains local device registry
- Handles local device authentication
5. Cloud Gateway Service
Responsibilities:
- Handle devices connected directly to cloud (not on local network)
- MQTT/CoAP message broker
- Device-to-cloud communication
- Command delivery to cloud-connected devices
Technology:
- MQTT Broker: Mosquitto / AWS IoT Core / HiveMQ
- CoAP Server: CoAP library
- Message queue: Kafka
Key Features:
- MQTT topic management (per device/user)
- QoS levels (0, 1, 2)
- Retained messages for device state
- Last Will and Testament for offline detection
- Message persistence
MQTT Topics:
devices/{deviceId}/commands- Commands to devicedevices/{deviceId}/status- Status updates from deviceusers/{userId}/events- User events (automations, notifications)
6. Automation Service
Responsibilities:
- Evaluate automation rule triggers
- Execute automation actions
- Schedule time-based automations
- Handle event-based triggers
- Log automation executions
Technology:
- Microservice: Python/Go
- Scheduler: Cron / Quartz / Temporal
- Event stream: Kafka
Key Features:
- Time-based trigger evaluation (cron expressions)
- Event-based trigger listening (Kafka consumers)
- Condition evaluation engine
- Action execution orchestration
- Execution logging and monitoring
Trigger Types:
- Time-based: Cron expressions, specific times
- Event-based: Device state changes, user actions
- Condition-based: Device state conditions, sensor readings
Scalability:
- Distributed scheduler (leader election)
- Partition automation rules by user
- Parallel execution of independent automations
7. Firmware Service
Responsibilities:
- Manage firmware versions
- Coordinate firmware rollouts
- Handle firmware downloads
- Track update status
- Rollback failed updates
Technology:
- Microservice: Go/Python
- Storage: S3 for firmware files
- CDN: CloudFront for distribution
Key Features:
- Staged rollouts (1% → 10% → 50% → 100%)
- A/B testing support
- Rollback mechanism
- Update status tracking
- Delta updates (only send changes)
Rollout Strategy:
- Canary deployment: 1% → monitor → 10% → monitor → full rollout
- Device filtering: by model, region, firmware version
- Automatic rollback on high failure rate (>5%)
Detailed Design
Device Discovery Process
Step 1: Local Network Discovery
- User initiates discovery from mobile app
- App sends discovery request to API Service
- API Service checks if user has local gateway
- If gateway exists:
- API Service sends discovery command to Local Gateway via MQTT
- Gateway performs mDNS/Bonjour scan on local network
- Gateway sends discovered devices to API Service
- If no gateway:
- API Service performs cloud-based discovery (slower)
- Checks device registry for unpaired devices in user’s area
Step 2: Device Pairing
- User selects device from discovered list
- Device displays pairing code (6 digits)
- User enters code in app
- App sends pairing request:
POST /api/v1/devices/pair { deviceId, pairingCode } - API Service validates pairing code with device
- Device generates access token and returns to API Service
- API Service creates device record in database
- API Service establishes MQTT subscription for device
- App receives device registration confirmation
Security:
- Pairing codes expire after 5 minutes
- One-time use pairing codes
- TLS encryption for pairing process
- Device authentication via certificates
Command Execution
Local Command Path (Optimized):
- User sends command via app
- API Service checks device location (local vs cloud)
- If device is local:
- API Service sends command to Local Gateway via MQTT
- Gateway forwards to device via local protocol (WiFi/Bluetooth)
- Device executes command (< 50ms latency)
- Device sends status update to gateway
- Gateway forwards to API Service via MQTT
- API Service updates device state in Redis and database
- API Service pushes status update to app via WebSocket
- Total latency: < 200ms
Cloud Command Path:
- User sends command via app
- API Service routes to Cloud Gateway
- Cloud Gateway publishes command to MQTT topic:
devices/{deviceId}/commands - Device receives command via MQTT
- Device executes command
- Device publishes status to:
devices/{deviceId}/status - Cloud Gateway forwards to API Service
- API Service updates state and pushes to app
- Total latency: < 500ms
Command Reliability:
- MQTT QoS 1 (at least once delivery)
- Command acknowledgment required
- Retry mechanism (3 retries, exponential backoff)
- Command timeout (5 seconds)
- Command status tracking in database
Scene Execution
Parallel Execution:
- User triggers scene
- API Service retrieves scene definition
- API Service validates all devices are accessible
- API Service executes actions in parallel:
- Creates command tasks for each action
- Executes commands concurrently
- Tracks execution status
- API Service aggregates results
- API Service sends execution summary to app
Error Handling:
- If device is offline: skip action, log error
- If command fails: retry once, then mark failed
- Partial success: return success with failed actions list
- Rollback: optional rollback on critical failures
Automation Rule Engine
Time-Based Triggers:
- Distributed cron scheduler
- Evaluates rules every minute
- Executes matching rules
- Handles timezone correctly
Event-Based Triggers:
- Kafka consumer listens to device events
- Filters events matching trigger conditions
- Evaluates additional conditions
- Executes actions if all conditions met
Condition Evaluation:
- Supports: equals, greater than, less than, contains, regex
- Boolean logic: AND, OR, NOT
- Device state conditions
- Time-based conditions
- User presence conditions
Execution:
- Similar to scene execution
- Logs all executions
- Rate limiting (max 10 executions per rule per hour)
- Failure notifications
Scalability Considerations
Horizontal Scaling
API Service:
- Stateless design enables horizontal scaling
- Load balancer distributes requests
- Session state in Redis (not server memory)
- Auto-scaling based on CPU/memory metrics
WebSocket Service:
- Sticky sessions for connection affinity
- Redis pub/sub for cross-server communication
- Connection limit: 10K per server
- Auto-scaling based on connection count
Device Service:
- Partitioned by user ID
- Stateless design
- Cache layer (Redis) for device state
- Database read replicas for scaling reads
Message Queue:
- Kafka partitions for parallel processing
- Partition by user ID for ordering
- Consumer groups for parallel consumption
- Auto-scaling consumers
Caching Strategy
Redis Cache Layers:
- Device State Cache: TTL 5 minutes
- Key:
device:{deviceId}:state - Value: JSON device state
- Key:
- User Device List Cache: TTL 1 minute
- Key:
user:{userId}:devices - Value: List of device IDs
- Key:
- Scene Cache: TTL 5 minutes
- Key:
scene:{sceneId} - Value: Scene definition
- Key:
- Session Cache: TTL 24 hours
- Key:
session:{sessionId} - Value: User session data
- Key:
Cache Invalidation:
- Device state: invalidate on state update
- Device list: invalidate on device add/remove
- Scene: invalidate on scene update
- Write-through cache for critical data
Database Optimization
Indexing:
- User ID indexes for user queries
- Device ID indexes for device queries
- Timestamp indexes for time-range queries
- Composite indexes for common query patterns
Query Optimization:
- Avoid N+1 queries (batch loading)
- Use database connection pooling
- Read from replicas for non-critical queries
- Pagination for large result sets
Partitioning:
- Commands table partitioned by timestamp (monthly)
- Archive old partitions to cold storage
- Reduces query time for recent data
Security Considerations
Authentication & Authorization
User Authentication:
- OAuth 2.0 / JWT tokens
- Token expiration: 24 hours
- Refresh tokens: 30 days
- Multi-factor authentication (optional)
Device Authentication:
- Certificate-based authentication
- Device certificates issued during pairing
- Certificate rotation every 90 days
- Revocation list for compromised devices
Authorization:
- Role-based access control (RBAC)
- User permissions: Owner, Admin, Guest
- Device-level permissions
- Scene/automation permissions
Data Encryption
In Transit:
- TLS 1.3 for all API communication
- MQTT over TLS (MQTTS)
- WebSocket over TLS (WSS)
- End-to-end encryption for sensitive commands
At Rest:
- Database encryption (AES-256)
- Encrypted backups
- Key management via AWS KMS / HashiCorp Vault
Device Security
Pairing Security:
- Time-limited pairing codes (5 minutes)
- One-time use codes
- Rate limiting on pairing attempts
- IP-based restrictions
Command Security:
- Command signing (HMAC)
- Replay attack prevention (nonces)
- Command validation on device
- Rate limiting per device
Firmware Security:
- Signed firmware updates
- Secure boot verification
- Rollback protection
- Update authentication
Network Security
Local Network:
- Device isolation (VLAN)
- Firewall rules
- Intrusion detection
- Network segmentation
Cloud Security:
- DDoS protection (Cloudflare)
- Rate limiting
- IP whitelisting for gateways
- VPN for gateway connections
Monitoring & Observability
Metrics
System Metrics:
- Request rate (QPS)
- Error rate (4xx, 5xx)
- Latency (p50, p95, p99)
- Throughput
Device Metrics:
- Online device count
- Command success rate
- Average command latency
- Device discovery time
- Firmware update success rate
Business Metrics:
- Daily active users
- Devices per user
- Commands per user
- Scene executions
- Automation triggers
Monitoring Tools:
- Prometheus for metrics collection
- Grafana for visualization
- CloudWatch / Datadog for cloud metrics
- Custom dashboards
Logging
Log Levels:
- ERROR: Failures, exceptions
- WARN: Degraded performance, retries
- INFO: Important events, state changes
- DEBUG: Detailed debugging information
Log Aggregation:
- Centralized logging (ELK Stack / Splunk)
- Structured logging (JSON)
- Log retention: 30 days
- Search and analysis capabilities
Key Log Events:
- Device discovery and pairing
- Command executions
- Scene executions
- Automation triggers
- Firmware updates
- Error conditions
Alerting
Critical Alerts:
- Service downtime
- High error rate (>1%)
- High latency (p95 > 1s)
- Database connection failures
- Message queue backlog
Warning Alerts:
- Elevated error rate (>0.5%)
- High latency (p95 > 500ms)
- Low device online rate
- High command failure rate
Alert Channels:
- PagerDuty for critical alerts
- Slack for warnings
- Email for informational alerts
- SMS for on-call engineers
Distributed Tracing
Tracing:
- OpenTelemetry / Jaeger
- Trace requests across services
- Identify bottlenecks
- Debug distributed issues
Trace Points:
- API request entry
- Service calls
- Database queries
- Message queue operations
- External API calls
Trade-offs and Optimizations
Local vs Cloud Control
Trade-off:
- Local control: Lower latency (< 200ms) but requires gateway
- Cloud control: Higher latency (< 500ms) but works everywhere
Optimization:
- Prefer local control when gateway available
- Fallback to cloud control
- Hybrid approach: local for commands, cloud for state sync
Consistency vs Availability
Trade-off:
- Strong consistency: Slower, more complex
- Eventual consistency: Faster, simpler
Decision:
- Eventual consistency for device state (acceptable delay)
- Strong consistency for critical commands (mount position)
- Conflict resolution for concurrent updates
Caching vs Freshness
Trade-off:
- Aggressive caching: Better performance, stale data
- Less caching: Fresh data, higher latency
Optimization:
- Cache device state (TTL 5 minutes)
- Invalidate on updates
- Use WebSocket for real-time updates
- Cache device lists (TTL 1 minute)
Batch vs Individual Commands
Trade-off:
- Individual commands: Simpler, higher overhead
- Batch commands: More efficient, more complex
Optimization:
- Batch commands in scenes
- Individual commands for single actions
- Command queuing for offline devices
Synchronous vs Asynchronous Processing
Trade-off:
- Synchronous: Simpler, blocks request
- Asynchronous: More complex, non-blocking
Decision:
- Synchronous for simple commands
- Asynchronous for scenes and automations
- Background processing for firmware updates
What Interviewers Look For
IoT Systems Skills
- Device Communication
- Hybrid local/cloud architecture
- MQTT protocol
- Low-latency local control
- Red Flags: Cloud-only, high latency, no offline support
- Device Discovery
- mDNS/Bonjour for local discovery
- Cloud-based fallback
- Red Flags: No discovery, manual setup, poor UX
- Offline Support
- Local gateway functionality
- Offline command execution
- Red Flags: No offline, network required, poor UX
Real-Time Systems Skills
- WebSocket Architecture
- Real-time app updates
- Bidirectional communication
- Red Flags: Polling, high latency, no real-time
- Command Processing
- Low-latency local commands
- Reliable delivery
- Red Flags: High latency, unreliable, no retry
- State Synchronization
- Device state consistency
- Real-time updates
- Red Flags: Stale state, inconsistent, no sync
Distributed Systems Skills
- Microservices Architecture
- Service decomposition
- Clear boundaries
- Red Flags: Monolithic, unclear boundaries, tight coupling
- Scalability Design
- Horizontal scaling
- Database sharding
- Message queues
- Red Flags: Vertical scaling, no sharding, bottlenecks
- Fault Tolerance
- Device failure handling
- Graceful degradation
- Red Flags: No fault tolerance, system failure, poor UX
Problem-Solving Approach
- Hybrid Architecture
- Local for performance
- Cloud for accessibility
- Red Flags: Single approach, no optimization, poor trade-offs
- Edge Cases
- Network failures
- Device offline
- Command failures
- Red Flags: Ignoring edge cases, no handling
- Trade-off Analysis
- Latency vs consistency
- Local vs cloud
- Red Flags: No trade-offs, dogmatic choices
System Design Skills
- Component Design
- API Gateway
- Device service
- Automation service
- Red Flags: Monolithic, unclear boundaries
- Security Design
- End-to-end encryption
- Device authentication
- Secure pairing
- Red Flags: No encryption, insecure, no auth
- Automation System
- Rule engine
- Event-based triggers
- Red Flags: No automation, manual only, poor UX
Communication Skills
- Architecture Explanation
- Can explain hybrid architecture
- Understands local/cloud trade-offs
- Red Flags: No understanding, vague
- Protocol Explanation
- Can explain MQTT/WebSocket
- Understands device communication
- Red Flags: No understanding, vague
Meta-Specific Focus
- Real-Time Systems Expertise
- WebSocket knowledge
- Low-latency design
- Key: Show real-time systems knowledge
- Hybrid Architecture Expertise
- Local/cloud optimization
- Offline-first design
- Key: Demonstrate architecture expertise
Summary
Key Takeaways:
-
Architecture: Microservices architecture with API Gateway, WebSocket service, Device service, Local/Cloud gateways, Automation service, and Firmware service
-
Device Communication: Hybrid approach with local gateway for low-latency local control and cloud gateway for remote access
-
Scalability: Horizontal scaling with stateless services, caching (Redis), database sharding, and message queues (Kafka)
-
Real-Time Updates: WebSocket for app updates, MQTT for device communication, Redis pub/sub for service communication
-
Device Discovery: mDNS/Bonjour for local discovery, cloud-based discovery as fallback
-
Automation: Distributed rule engine with time-based, event-based, and condition-based triggers
-
Security: End-to-end encryption, certificate-based device authentication, secure pairing, command signing
-
Reliability: Command retries, offline support, device health monitoring, graceful degradation
Design Highlights:
- Low Latency: Local gateway enables < 200ms command latency
- Offline Support: Local control works offline, cloud sync when online
- Scalability: Handles 200M+ devices, 10M+ concurrent sessions
- Reliability: 99.9% uptime, no command loss, device state consistency
- Security: End-to-end encryption, secure device pairing, authentication
Common Interview Topics Covered:
- IoT device management
- Real-time communication (WebSocket, MQTT)
- Device discovery and pairing
- Automation and scheduling
- Microservices architecture
- Caching strategies
- Database sharding
- Message queues
- Security and encryption
- Monitoring and observability
This design demonstrates how to build a scalable, reliable, and secure IoT platform that connects millions of devices while maintaining low latency and high availability. The hybrid local/cloud architecture optimizes for both performance and accessibility.