Design a File Storage and Sync System (Dropbox/Google Drive) - System Design Interview

Introduction

A file storage and synchronization system enables users to store files in the cloud and access them from multiple devices with automatic synchronization. Systems like Dropbox and Google Drive handle petabytes of data, millions of users, and billions of file operations per day, requiring efficient storage, real-time synchronization, conflict resolution, and bandwidth optimization.

This post provides a detailed walkthrough of designing a scalable file storage and sync system, covering key architectural decisions, delta synchronization, conflict resolution, versioning, and scalability challenges. This is one of the most common system design interview questions that tests your understanding of distributed systems, storage systems, real-time synchronization, and handling massive scale.

Problem Statement
Requirements
- Functional Requirements
- Non-Functional Requirements
Capacity Estimation
Core Entities
API
Data Flow
Database Design
- Schema Design
- Database Sharding Strategy
High-Level Design
Deep Dive
Summary

Problem Statement

Design a file storage and synchronization system similar to Dropbox or Google Drive with the following features:

File upload and download
File synchronization across multiple devices
Real-time sync (changes propagate quickly)
Conflict resolution when same file edited on multiple devices
File versioning and history
File sharing and collaboration
Folder hierarchy and organization
Search functionality
Delta sync (only sync changed parts)
Offline access with sync when online

Scale Requirements:

500 million+ users
100 million+ daily active users
1 billion+ files
10 petabytes+ total storage
100 million+ file operations per day
Peak: 1 million concurrent users
Average file size: 10MB
Large files: Up to 10GB

Requirements

Functional Requirements

Core Features:

File Management: Upload, download, delete, rename files
Folder Management: Create, delete, organize folders
Synchronization: Sync files across devices automatically
Versioning: Keep file history and allow rollback
Sharing: Share files/folders with other users
Collaboration: Multiple users edit same file
Search: Search files by name, content, metadata
Offline Access: Access files offline, sync when online
Delta Sync: Only sync changed parts of files
Conflict Resolution: Handle conflicts when same file edited simultaneously

Out of Scope:

Real-time collaborative editing (like Google Docs)
Advanced file preview/editing
Video streaming
Mobile app (focus on web API)

Non-Functional Requirements

Availability: 99.9% uptime
Reliability: No data loss, guaranteed file storage
Performance:
- Upload: < 5 seconds for 10MB file
- Download: < 3 seconds for 10MB file
- Sync latency: < 1 second for small changes
Scalability: Handle 1B+ files, 10PB+ storage
Consistency: Strong consistency for metadata, eventual for sync
Durability: 99.999999999% (11 9’s) durability
Bandwidth Efficiency: Minimize data transfer with delta sync

Capacity Estimation

Traffic Estimates

Total Users: 500 million
Daily Active Users (DAU): 100 million
File Operations per Day: 100 million
Uploads per Day: 50 million
Downloads per Day: 200 million
Sync Operations per Day: 500 million
Peak Concurrent Users: 1 million
Peak QPS: 10,000 operations/second
Normal QPS: 1,000 operations/second

Storage Estimates

File Storage:

1 billion files × 10MB average = 10PB
With replication (3x): 30PB
With versioning (5 versions average): 50PB

Metadata Storage:

1B files × 1KB metadata = 1TB
User metadata: 500M users × 2KB = 1TB
Sharing metadata: 100M shares × 500 bytes = 50GB
Total metadata: ~2TB

Total Storage: ~50PB

Bandwidth Estimates

Upload Bandwidth:

50M uploads/day × 10MB = 500TB/day = 5.8GB/s average
Peak: 10K uploads/sec × 10MB = 100GB/s = 800Gbps

Download Bandwidth:

200M downloads/day × 10MB = 2PB/day = 23GB/s average
Peak: 10K downloads/sec × 10MB = 100GB/s = 800Gbps

Sync Bandwidth:

500M sync ops/day × 1MB (delta) = 500TB/day = 5.8GB/s average
Peak: 10K sync ops/sec × 1MB = 10GB/s = 80Gbps

Total Peak Bandwidth: ~1.68Tbps

Core Entities

User

user_id (UUID)
email
username
password_hash
storage_quota (bytes)
storage_used (bytes)
created_at
updated_at

File

file_id (UUID)
user_id
name
path (full path including folders)
parent_folder_id
file_type (file, folder)
size (bytes)
content_hash (SHA-256)
mime_type
version (INT)
created_at
updated_at
deleted_at

File Version

version_id (UUID)
file_id
version_number (INT)
content_hash
size
created_at
created_by (user_id)

File Chunk

chunk_id (UUID)
file_id
version_id
chunk_index (INT)
chunk_hash (SHA-256)
size (bytes)
storage_url (S3 URL)

Device

device_id (UUID)
user_id
device_name
device_type (desktop, mobile, web)
last_sync_at
sync_token (for incremental sync)

Sync Event

event_id (UUID)
file_id
user_id
device_id
event_type (create, update, delete, move)
timestamp
metadata (JSON)

share_id (UUID)
file_id
owner_id
shared_with_id (user_id)
permission (read, write, admin)
created_at
expires_at

API

1. Upload File

POST /api/v1/files/upload
Request: multipart/form-data
  - file: file data
  - path: /folder/file.txt
  - parent_folder_id: uuid

Response:
{
  "file_id": "uuid",
  "name": "file.txt",
  "path": "/folder/file.txt",
  "size": 1048576,
  "version": 1,
  "created_at": "2025-11-13T10:00:00Z"
}

2. Download File

GET /api/v1/files/{file_id}/download
Response: file stream with appropriate headers

3. Get File Metadata

GET /api/v1/files/{file_id}
Response:
{
  "file_id": "uuid",
  "name": "file.txt",
  "path": "/folder/file.txt",
  "size": 1048576,
  "version": 5,
  "mime_type": "text/plain",
  "created_at": "2025-11-13T10:00:00Z",
  "updated_at": "2025-11-13T15:30:00Z"
}

4. Sync Files (Get Changes)

GET /api/v1/sync?device_id=uuid&sync_token=token&limit=100
Response:
{
  "sync_token": "new_token",
  "changes": [
    {
      "file_id": "uuid",
      "event_type": "update",
      "path": "/folder/file.txt",
      "version": 6,
      "timestamp": "2025-11-13T15:30:00Z"
    }
  ],
  "has_more": false
}

5. Update File (Delta Sync)

POST /api/v1/files/{file_id}/update
Request:
{
  "chunks": [
    {
      "chunk_index": 0,
      "chunk_hash": "abc123",
      "data": "base64_encoded_data"
    }
  ],
  "new_content_hash": "def456"
}

Response:
{
  "file_id": "uuid",
  "version": 6,
  "updated_at": "2025-11-13T15:30:00Z"
}

POST /api/v1/files/{file_id}/share
Request:
{
  "shared_with_id": "user_uuid",
  "permission": "read"
}

Response:
{
  "share_id": "uuid",
  "file_id": "uuid",
  "shared_with_id": "user_uuid",
  "permission": "read",
  "created_at": "2025-11-13T10:00:00Z"
}

7. Get File Versions

GET /api/v1/files/{file_id}/versions
Response:
{
  "versions": [
    {
      "version_id": "uuid",
      "version_number": 5,
      "size": 1048576,
      "created_at": "2025-11-13T15:30:00Z",
      "created_by": "user_uuid"
    }
  ]
}

8. Restore File Version

POST /api/v1/files/{file_id}/restore
Request:
{
  "version_number": 3
}

Response:
{
  "file_id": "uuid",
  "version": 6,
  "restored_from_version": 3
}

Data Flow

File Upload Flow

Client Uploads File:
- Client sends file to API Gateway
- API Gateway routes to File Service
- File Service validates file and user quota
File Processing:
- File Service:
  - Chunks file into blocks (e.g., 4MB chunks)
  - Computes hash for each chunk
  - Checks for duplicate chunks (deduplication)
  - Uploads unique chunks to Object Storage (S3)
  - Creates file metadata record
Metadata Storage:
- File Service stores file metadata in Metadata DB
- Creates file version record
- Updates user storage quota
- Publishes sync event to Message Queue
Sync Notification:
- Sync Service consumes sync event
- Notifies other devices of file change
- Updates sync tokens

File Download Flow

Client Requests Download:
- Client requests file download
- File Service retrieves file metadata
- Gets chunk list from metadata
Chunk Retrieval:
- File Service retrieves chunks from Object Storage
- Assembles chunks in order
- Streams file to client
Caching:
- CDN caches popular files
- Reduces load on object storage
- Improves download speed

Sync Flow

Client Requests Sync:
- Client sends sync request with device_id and sync_token
- Sync Service queries changes since last sync
Change Detection:
- Sync Service:
  - Queries sync events since sync_token
  - Filters events for user’s files
  - Returns list of changes
Client Processes Changes:
- Client receives changes
- Downloads updated files
- Applies changes locally
- Updates sync_token

Delta Sync Flow

File Changed Locally:
- Client detects file change
- Computes file chunks and hashes
- Compares with server version
Delta Calculation:
- Client identifies changed chunks
- Uploads only changed chunks
- Sends chunk metadata to server
Server Processing:
- File Service receives delta update
- Merges chunks with existing file
- Creates new file version
- Publishes sync event

Database Design

Schema Design

Users Table:

CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    email VARCHAR(100) UNIQUE NOT NULL,
    username VARCHAR(50) UNIQUE,
    password_hash VARCHAR(255) NOT NULL,
    storage_quota BIGINT DEFAULT 10737418240, -- 10GB default
    storage_used BIGINT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    INDEX idx_email (email)
);

Files Table:

CREATE TABLE files (
    file_id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    name VARCHAR(500) NOT NULL,
    path VARCHAR(2000) NOT NULL,
    parent_folder_id UUID,
    file_type ENUM('file', 'folder') NOT NULL,
    size BIGINT DEFAULT 0,
    content_hash VARCHAR(64),
    mime_type VARCHAR(100),
    version INT DEFAULT 1,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    deleted_at TIMESTAMP NULL,
    INDEX idx_user_id (user_id),
    INDEX idx_path (path),
    INDEX idx_parent_folder (parent_folder_id),
    INDEX idx_user_path (user_id, path),
    INDEX idx_deleted_at (deleted_at)
);

File Versions Table:

CREATE TABLE file_versions (
    version_id UUID PRIMARY KEY,
    file_id UUID NOT NULL,
    version_number INT NOT NULL,
    content_hash VARCHAR(64) NOT NULL,
    size BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    created_by UUID,
    INDEX idx_file_id (file_id),
    INDEX idx_file_version (file_id, version_number),
    UNIQUE KEY uk_file_version (file_id, version_number)
);

File Chunks Table:

CREATE TABLE file_chunks (
    chunk_id UUID PRIMARY KEY,
    file_id UUID NOT NULL,
    version_id UUID NOT NULL,
    chunk_index INT NOT NULL,
    chunk_hash VARCHAR(64) NOT NULL,
    size BIGINT NOT NULL,
    storage_url VARCHAR(500) NOT NULL,
    INDEX idx_file_version (file_id, version_id),
    INDEX idx_chunk_hash (chunk_hash),
    INDEX idx_file_chunk (file_id, chunk_index)
);

Devices Table:

CREATE TABLE devices (
    device_id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    device_name VARCHAR(200),
    device_type VARCHAR(50),
    last_sync_at TIMESTAMP,
    sync_token VARCHAR(100),
    created_at TIMESTAMP DEFAULT NOW(),
    INDEX idx_user_id (user_id),
    INDEX idx_sync_token (sync_token)
);

Sync Events Table:

CREATE TABLE sync_events (
    event_id UUID PRIMARY KEY,
    file_id UUID NOT NULL,
    user_id UUID NOT NULL,
    device_id UUID,
    event_type ENUM('create', 'update', 'delete', 'move', 'rename') NOT NULL,
    timestamp TIMESTAMP DEFAULT NOW(),
    metadata JSON,
    INDEX idx_user_timestamp (user_id, timestamp DESC),
    INDEX idx_file_id (file_id),
    INDEX idx_timestamp (timestamp DESC)
);

Shares Table:

CREATE TABLE shares (
    share_id UUID PRIMARY KEY,
    file_id UUID NOT NULL,
    owner_id UUID NOT NULL,
    shared_with_id UUID NOT NULL,
    permission ENUM('read', 'write', 'admin') DEFAULT 'read',
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP NULL,
    INDEX idx_file_id (file_id),
    INDEX idx_shared_with (shared_with_id),
    INDEX idx_owner (owner_id)
);

Database Sharding Strategy

Files Table Sharding:

Shard by user_id using consistent hashing
1000 shards: shard_id = hash(user_id) % 1000
All files for a user in same shard
Enables efficient user queries

Sync Events Sharding:

Shard by user_id (same as files)
Enables efficient sync queries
Time-based partitioning for cleanup

Shard Key Selection:

user_id ensures all user data in same shard
Enables efficient queries for user’s files
Prevents cross-shard queries for single user

Replication:

Each shard replicated 3x for high availability
Master-replica setup for read scaling
Writes go to master, reads can go to replicas

High-Level Design

┌─────────────┐
│   Client    │
│  (Web App)  │
└──────┬──────┘
       │
       │ HTTP/HTTPS
       │
┌──────▼──────────────────────────────────────────────┐
│        API Gateway / Load Balancer                   │
│        - Rate Limiting                               │
│        - Request Routing                             │
└──────┬──────────────────────────────────────────────┘
       │
       ├──────────────┬──────────────┬──────────────┐
       │              │              │              │
┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│   File      │ │   Sync     │ │  Share     │ │  Search   │
│   Service   │ │   Service  │ │  Service   │ │  Service  │
└──────┬──────┘ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘
       │              │              │              │
       │              │              │              │
┌──────▼──────────────▼──────────────▼──────────────▼──────┐
│              Message Queue (Kafka)                        │
│              - File events                                │
│              - Sync events                                │
│              - Share events                                │
└──────┬─────────────────────────────────────────────────────┘
       │
       │
┌──────▼─────────────────────────────────────────────────────┐
│         Database Cluster (Sharded by user_id)              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                │
│  │ Shard 0   │  │ Shard 1   │  │ Shard N   │                │
│  │ Files     │  │ Files     │  │ Files     │                │
│  │ Metadata  │  │ Metadata  │  │ Metadata  │                │
│  └──────────┘  └──────────┘  └──────────┘                │
│                                                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                │
│  │ Sync     │  │ Shares   │  │ Users    │                │
│  │ Events   │  │ DB       │  │ DB       │                │
│  └──────────┘  └──────────┘  └──────────┘                │
└───────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────┐
│              Object Storage (S3)                            │
│  - File chunks                                             │
│  - Deduplication                                           │
│  - Versioning                                              │
└───────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────┐
│              Cache Layer (Redis)                           │
│  - File metadata cache                                     │
│  - Sync token cache                                        │
│  - Chunk hash cache (deduplication)                        │
└───────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────┐
│              CDN                                           │
│  - Popular file caching                                    │
│  - Download acceleration                                   │
└───────────────────────────────────────────────────────────┘

Deep Dive

Component Design

1. File Service

Responsibilities:

Handle file upload/download
Manage file metadata
Chunk files for storage
Handle deduplication
Manage file versions

Key Design Decisions:

Chunking: Split files into fixed-size chunks (4MB)
Deduplication: Store chunks by hash, reuse across files
Versioning: Create new version on each update
Metadata: Store metadata separately from file content

Scalability:

Stateless service, horizontally scalable
Connection pooling for database
Async processing for large files

2. Sync Service

Responsibilities:

Track file changes
Provide sync API for clients
Manage sync tokens
Handle conflict detection
Notify devices of changes

Key Design Decisions:

Sync Events: Store all file changes as events
Sync Tokens: Use tokens for incremental sync
Conflict Detection: Detect simultaneous edits
Real-Time: Use WebSocket for real-time notifications

Implementation:

def get_sync_changes(user_id, device_id, sync_token):
    # Get events since last sync
    if sync_token:
        events = get_events_since_token(user_id, sync_token)
    else:
        # First sync: get all files
        events = get_all_file_events(user_id)
    
    # Filter events for this user
    changes = []
    for event in events:
        change = {
            'file_id': event.file_id,
            'event_type': event.event_type,
            'path': get_file_path(event.file_id),
            'version': get_file_version(event.file_id),
            'timestamp': event.timestamp
        }
        changes.append(change)
    
    # Generate new sync token
    new_sync_token = generate_sync_token(user_id, device_id)
    
    return {
        'sync_token': new_sync_token,
        'changes': changes,
        'has_more': len(changes) >= 100
    }

3. Chunking and Deduplication

Challenge: Store files efficiently and avoid duplicate storage

Solution:

Fixed-Size Chunking: Split files into fixed-size chunks (4MB)
Content-Based Hashing: Hash each chunk (SHA-256)
Deduplication: Store chunks by hash, reuse across files
Chunk Index: Maintain index of chunks per file

Implementation:

def upload_file(file_data, user_id, path):
    # Chunk file
    chunks = chunk_file(file_data, chunk_size=4*1024*1024)  # 4MB
    
    # Compute hashes and upload unique chunks
    chunk_hashes = []
    chunk_urls = []
    
    for chunk in chunks:
        chunk_hash = sha256(chunk).hexdigest()
        
        # Check if chunk already exists
        existing_chunk = get_chunk_by_hash(chunk_hash)
        
        if existing_chunk:
            # Reuse existing chunk
            chunk_url = existing_chunk.storage_url
        else:
            # Upload new chunk
            chunk_url = upload_to_s3(chunk, chunk_hash)
            store_chunk_metadata(chunk_hash, chunk_url, len(chunk))
        
        chunk_hashes.append(chunk_hash)
        chunk_urls.append(chunk_url)
    
    # Create file metadata
    file_id = create_file_metadata(
        user_id=user_id,
        path=path,
        size=len(file_data),
        content_hash=sha256(file_data).hexdigest(),
        chunks=chunk_hashes
    )
    
    return file_id

4. Delta Sync (Incremental Sync)

Challenge: Sync only changed parts of files

Solution:

Chunk-Based Delta: Compare chunks between versions
Rolling Hash: Use rolling hash for efficient comparison
Upload Only Changes: Upload only changed chunks
Merge on Server: Merge chunks on server side

Implementation:

def delta_sync(file_id, new_file_data, device_id):
    # Get current file version
    current_version = get_file_version(file_id)
    current_chunks = get_file_chunks(file_id, current_version)
    
    # Chunk new file
    new_chunks = chunk_file(new_file_data, chunk_size=4*1024*1024)
    new_chunk_hashes = [sha256(chunk).hexdigest() for chunk in new_chunks]
    
    # Find changed chunks
    changed_chunks = []
    for i, (new_hash, new_chunk) in enumerate(zip(new_chunk_hashes, new_chunks)):
        if i >= len(current_chunks) or new_hash != current_chunks[i].chunk_hash:
            changed_chunks.append({
                'index': i,
                'hash': new_hash,
                'data': new_chunk
            })
    
    # Upload changed chunks
    chunk_urls = []
    for chunk_info in changed_chunks:
        chunk_url = upload_to_s3(chunk_info['data'], chunk_info['hash'])
        chunk_urls.append((chunk_info['index'], chunk_url))
    
    # Create new version with merged chunks
    new_version = create_file_version(
        file_id=file_id,
        chunks=merge_chunks(current_chunks, chunk_urls),
        content_hash=sha256(new_file_data).hexdigest()
    )
    
    # Publish sync event
    publish_sync_event(file_id, 'update', device_id)
    
    return new_version

5. Conflict Resolution

Challenge: Handle conflicts when same file edited on multiple devices

Solution:

Last-Write-Wins: Default strategy (simple)
Version Vectors: Track versions per device
Conflict Detection: Detect simultaneous edits
Conflict Files: Create conflict copies when detected

Implementation:

def handle_file_update(file_id, new_version, device_id):
    current_file = get_file(file_id)
    
    # Check for conflicts
    if current_file.version != new_version.base_version:
        # Conflict detected: versions don't match
        conflict_file = create_conflict_file(
            file_id=file_id,
            conflicting_version=new_version,
            device_id=device_id
        )
        
        # Notify user of conflict
        notify_user_conflict(file_id, conflict_file)
        
        return {
            'status': 'conflict',
            'conflict_file_id': conflict_file.file_id
        }
    
    # No conflict: update file
    updated_file = update_file(file_id, new_version)
    
    return {
        'status': 'success',
        'file_id': file_id,
        'version': updated_file.version
    }

Detailed Design

File Chunking Strategy

Fixed-Size Chunking:

Chunk size: 4MB (balance between overhead and efficiency)
Simple and fast
Good for random access
Less efficient for insertions

Content-Defined Chunking (CDC):

Variable-size chunks based on content
Better deduplication
More complex
Better for insertions

Decision: Use fixed-size chunking for simplicity and performance

Deduplication Strategy

Chunk-Level Deduplication:

Store chunks by hash
Multiple files can share chunks
Significant storage savings (30-50%)
Reduces storage costs

File-Level Deduplication:

Store files by hash
Entire file deduplication
Simpler but less efficient
Used for small files

Implementation:

def get_or_upload_chunk(chunk_data):
    chunk_hash = sha256(chunk_data).hexdigest()
    
    # Check cache first
    cached_chunk = cache.get(f"chunk:{chunk_hash}")
    if cached_chunk:
        return cached_chunk['url']
    
    # Check database
    existing_chunk = db.query("SELECT storage_url FROM chunks WHERE chunk_hash = ?", chunk_hash)
    if existing_chunk:
        cache.set(f"chunk:{chunk_hash}", {'url': existing_chunk.storage_url})
        return existing_chunk.storage_url
    
    # Upload new chunk
    chunk_url = upload_to_s3(chunk_data, chunk_hash)
    
    # Store metadata
    db.execute(
        "INSERT INTO chunks (chunk_hash, storage_url, size) VALUES (?, ?, ?)",
        chunk_hash, chunk_url, len(chunk_data)
    )
    
    cache.set(f"chunk:{chunk_hash}", {'url': chunk_url})
    return chunk_url

Versioning Strategy

Snapshot Versioning:

Store full file for each version
Simple but storage-intensive
Fast restore

Delta Versioning:

Store only changes between versions
Storage-efficient
More complex restore

Decision: Use chunk-based versioning (store new chunks, reference old chunks)

Sync Token Management

Challenge: Efficiently track sync state per device

Solution:

Sync Tokens: Opaque tokens representing sync state
Event-Based: Token represents last processed event ID
Incremental: Only return changes since token
Expiration: Tokens expire after inactivity

Implementation:

def generate_sync_token(user_id, device_id):
    # Get last event ID for user
    last_event = get_last_sync_event(user_id)
    
    # Create token with event ID and timestamp
    token_data = {
        'user_id': user_id,
        'device_id': device_id,
        'last_event_id': last_event.event_id if last_event else None,
        'timestamp': now()
    }
    
    # Encode token
    token = base64_encode(json.dumps(token_data))
    
    # Store token
    update_device_sync_token(device_id, token)
    
    return token

def get_events_since_token(user_id, sync_token):
    # Decode token
    token_data = json.loads(base64_decode(sync_token))
    last_event_id = token_data.get('last_event_id')
    
    if last_event_id:
        # Get events since last event
        events = db.query(
            "SELECT * FROM sync_events WHERE user_id = ? AND event_id > ? ORDER BY timestamp",
            user_id, last_event_id
        )
    else:
        # First sync: get all events
        events = db.query(
            "SELECT * FROM sync_events WHERE user_id = ? ORDER BY timestamp",
            user_id
        )
    
    return events

Scalability Considerations

Horizontal Scaling

File Service:

Stateless service, horizontally scalable
Load balancer distributes requests
Connection pooling for database
Async processing for large files

Sync Service:

Stateless service, horizontally scalable
Shared database for sync events
WebSocket connections for real-time sync
Connection pooling

Storage Scaling

Object Storage:

S3 for file chunks
Automatic scaling
CDN for popular files
Lifecycle policies for old versions

Database:

Shard by user_id
Read replicas for read scaling
Connection pooling
Index optimization

Caching Strategy

Redis Cache:

File Metadata: TTL 1 hour
Chunk Hashes: TTL 24 hours (for deduplication)
Sync Tokens: TTL 1 hour
User Quota: TTL 5 minutes

Cache Invalidation:

Invalidate on file updates
Invalidate on sync events
Use cache-aside pattern

Security Considerations

Authentication & Authorization

JWT Tokens: Use JWT for API authentication
File Access Control: Verify user owns file or has share permission
Rate Limiting: Limit requests per user/IP
Quota Enforcement: Enforce storage quotas

Data Security

Encryption at Rest: Encrypt files in object storage
Encryption in Transit: TLS for all communications
Access Control: Fine-grained permissions
Audit Logging: Log all file operations

Share Links: Generate secure share links
Permission Levels: Read, write, admin permissions
Expiration: Share links expire after time
Access Logging: Track share access

Monitoring & Observability

Key Metrics

System Metrics:

Upload/download rate (files/second)
Storage usage (total, per user)
Sync latency (p50, p95, p99)
Chunk deduplication rate
Cache hit rate

Business Metrics:

Total files stored
Total storage used
Active users
Files shared
Storage savings from deduplication

Logging

Structured Logging: JSON logs for parsing
File Operations: Log uploads, downloads, deletes
Sync Operations: Log sync events
Error Logging: Log errors with context

Alerting

High Storage Usage: Alert if storage > 80% capacity
High Latency: Alert if p99 latency > 5 seconds
High Error Rate: Alert if error rate > 1%
Sync Failures: Alert on sync failures

Trade-offs and Optimizations

Trade-offs

1. Chunk Size: Small vs Large

Small (1MB): Better deduplication, more overhead
Large (8MB): Less overhead, worse deduplication
Decision: 4MB balance

2. Sync Frequency: Real-Time vs Batch

Real-Time: Lower latency, higher cost
Batch: Higher latency, lower cost
Decision: Real-time with batching optimization

3. Versioning: Many vs Few Versions

Many: Better history, more storage
Few: Less storage, less history
Decision: Keep 5 versions, archive older

4. Deduplication: Aggressive vs Conservative

Aggressive: More storage savings, more computation
Conservative: Less computation, less savings
Decision: Aggressive chunk-level deduplication

Optimizations

1. Chunk Deduplication

Cache chunk hashes
Batch chunk lookups
Reduce storage costs by 30-50%

2. Delta Sync

Only sync changed chunks
Reduce bandwidth by 80-90%
Faster sync for large files

3. CDN Caching

Cache popular files
Reduce load on object storage
Improve download speed

4. Compression

Compress chunks before storage
Reduce storage by 30-50%
Trade-off: CPU usage

What Interviewers Look For

Storage Optimization Skills

Chunking and Deduplication
- Chunk-based storage
- Chunk-level deduplication
- Storage efficiency
- Red Flags: File-level storage, no deduplication, inefficient
Delta Sync
- Only sync changed chunks
- Bandwidth optimization
- Red Flags: Full file sync, no delta, high bandwidth
Versioning
- Chunk-based versioning
- Efficient history tracking
- Red Flags: File-level versioning, inefficient, high storage

Distributed Systems Skills

Conflict Resolution
- Simultaneous edit handling
- Merge strategies
- Red Flags: No conflict handling, data loss, overwrites
Sync Mechanism
- Event-based sync
- Incremental updates
- Token-based sync
- Red Flags: Polling, full sync, no incremental
Scalability Design
- Horizontal scaling
- Database sharding
- Object storage
- Red Flags: Vertical scaling, no sharding, file system storage

Problem-Solving Approach

Storage Efficiency
- 30-50% storage savings
- Deduplication strategy
- Red Flags: No optimization, high storage costs
Edge Cases
- Concurrent edits
- Network failures
- Large files
- Red Flags: Ignoring edge cases, no handling
Trade-off Analysis
- Storage vs computation
- Latency vs bandwidth
- Red Flags: No trade-offs, dogmatic choices

System Design Skills

Component Design
- Sync service
- Storage service
- Conflict resolution service
- Red Flags: Monolithic, unclear boundaries
Database Design
- Proper indexing
- Sharding strategy
- Red Flags: Missing indexes, no sharding
CDN Integration
- Popular file caching
- Efficient delivery
- Red Flags: No CDN, slow delivery

Communication Skills

Deduplication Explanation
- Can explain chunking
- Understands deduplication
- Red Flags: No understanding, vague
Sync Algorithm Explanation
- Can explain delta sync
- Understands versioning
- Red Flags: No understanding, vague

Meta-Specific Focus

Storage Optimization Expertise
- Deduplication knowledge
- Efficient storage design
- Key: Show storage expertise
Correctness Focus
- Conflict resolution
- Data integrity
- Key: Demonstrate reliability focus

Summary

Designing a file storage and sync system at scale requires careful consideration of:

Chunking and Deduplication: Efficient storage with chunk-level deduplication
Delta Sync: Only sync changed parts for bandwidth efficiency
Versioning: Track file history with chunk-based versioning
Conflict Resolution: Handle simultaneous edits gracefully
Sync Mechanism: Event-based sync with tokens for incremental updates
Scalability: Horizontal scaling with sharding and caching
Storage Efficiency: Deduplication saves 30-50% storage
Real-Time Sync: Sub-second sync latency for small changes

Key architectural decisions:

Chunk-Based Storage (4MB chunks) for efficiency
Chunk-Level Deduplication for storage savings
Delta Sync for bandwidth optimization
Event-Based Sync with tokens for incremental updates
Sharded Database by user_id for scalability
Object Storage (S3) for file chunks
CDN for popular file caching

The system handles 1 billion files, 10 petabytes of storage, and 100 million file operations per day with efficient deduplication and delta sync.

Introduction

Table of Contents

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Capacity Estimation

Traffic Estimates

Storage Estimates

Bandwidth Estimates

Core Entities

User

File

File Version

File Chunk

Device

Sync Event

Share

API

1. Upload File

2. Download File

3. Get File Metadata

4. Sync Files (Get Changes)

5. Update File (Delta Sync)

6. Share File

7. Get File Versions

8. Restore File Version

Data Flow

File Upload Flow

File Download Flow

Sync Flow

Delta Sync Flow

Database Design

Schema Design

Database Sharding Strategy

High-Level Design

Deep Dive

Component Design

1. File Service

2. Sync Service

3. Chunking and Deduplication

4. Delta Sync (Incremental Sync)

5. Conflict Resolution

Detailed Design

File Chunking Strategy

Deduplication Strategy

Versioning Strategy

Sync Token Management

Scalability Considerations

Horizontal Scaling

Storage Scaling

Caching Strategy

Security Considerations

Authentication & Authorization

Data Security

File Sharing Security

Monitoring & Observability

Key Metrics

Logging

Alerting

Trade-offs and Optimizations

Trade-offs

Optimizations

What Interviewers Look For

Storage Optimization Skills

Distributed Systems Skills

Problem-Solving Approach

System Design Skills

Communication Skills

Meta-Specific Focus

Summary

Related Posts

Recent Posts