Introduction
Object storage (blob storage) is a fundamental building block of modern cloud infrastructure. Systems like Amazon S3, Azure Blob Storage, and Google Cloud Storage handle exabytes of data, billions of objects, and millions of requests per second while maintaining 99.999999999% (11 9’s) durability.
This post designs a scalable blob storage system that can handle massive scale, ensure data durability, and provide high availability while maintaining low latency and cost efficiency.
Table of Contents
- Requirements
- Capacity Estimation
- Core Entities
- API
- Data Flow
- Database Design
- High-Level Design
- Deep Dive
Requirements
Functional Requirements
- Object Operations
- PUT: Upload objects (files) with metadata
- GET: Download objects
- DELETE: Delete objects
- LIST: List objects in a bucket/prefix
- HEAD: Get object metadata without downloading
- Bucket Management
- Create/delete buckets (containers)
- List buckets
- Set bucket policies and ACLs
- Metadata Management
- Store object metadata (size, content-type, etag, timestamps)
- Custom metadata (key-value pairs)
- Versioning support
- Access Control
- Authentication (API keys, IAM)
- Authorization (ACLs, bucket policies)
- Signed URLs for temporary access
Non-Functional Requirements
- Durability: 99.999999999% (11 9’s) - lose 1 object per 10,000 years
- Availability: 99.99% uptime (4 9’s)
- Scalability: Handle exabytes of data, billions of objects
- Throughput: Millions of requests per second
- Latency: < 100ms for metadata operations, < 1s for object retrieval
- Cost Efficiency: Optimize storage and bandwidth costs
- Consistency: Eventually consistent (or strong consistency option)
Capacity Estimation
Storage Scale
- Total Storage: 100+ exabytes
- Objects: 100+ billion objects
- Average Object Size: 1MB (ranges from bytes to 5TB)
- Daily Uploads: 1 billion objects = 1PB/day
- Daily Downloads: 10 billion objects = 10PB/day
Request Scale
- Read Requests: 100M requests/second (peak)
- Write Requests: 10M requests/second (peak)
- Metadata Operations: 200M operations/second (peak)
User Scale
- Active Users: 10 million
- Concurrent Users: 1 million
- Buckets: 100 million buckets
Core Entities
Bucket
- Attributes: bucket_id, bucket_name, owner_id, region, created_at, policy, acl
- Relationships: Contains objects, belongs to user
Object
- Attributes: object_id, bucket_id, key, size, content_type, etag, version_id, created_at, updated_at
- Relationships: Belongs to bucket, has metadata, has versions
Object Metadata
- Attributes: metadata_id, object_id, key, value, created_at
- Relationships: Belongs to object
Object Version
- Attributes: version_id, object_id, version_number, is_latest, created_at
- Relationships: Belongs to object
API
PUT Object (Upload)
PUT /{bucket}/{key}
Headers:
- Content-Type: application/json
- Content-Length: 1024
- x-amz-meta-custom-key: custom-value
Body: [object data]
Response: 200 OK
Headers:
- ETag: "d41d8cd98f00b204e9800998ecf8427e"
- x-amz-version-id: version-id
GET Object (Download)
GET /{bucket}/{key}
Headers:
- Range: bytes=0-1023 (optional)
Response: 200 OK
Headers:
- Content-Type: application/json
- Content-Length: 1024
- ETag: "d41d8cd98f00b204e9800998ecf8427e"
- Last-Modified: Wed, 01 Jan 2025 12:00:00 GMT
Body: [object data]
DELETE Object
DELETE /{bucket}/{key}
Response: 204 No Content
LIST Objects
GET /{bucket}?prefix=photos/&max-keys=1000&marker=photos/photo-100.jpg
Response: 200 OK
{
"Contents": [
{
"Key": "photos/photo-1.jpg",
"Size": 1024,
"LastModified": "2025-01-01T12:00:00Z",
"ETag": "etag-1"
}
],
"IsTruncated": false,
"NextMarker": null
}
HEAD Object (Get Metadata)
HEAD /{bucket}/{key}
Response: 200 OK
Headers:
- Content-Type: application/json
- Content-Length: 1024
- ETag: "d41d8cd98f00b204e9800998ecf8427e"
- Last-Modified: Wed, 01 Jan 2025 12:00:00 GMT
- x-amz-meta-custom-key: custom-value
Data Flow
PUT Object (Upload) Flow
- Client uploads object → API Gateway
- API Gateway → Metadata Service (validate bucket, permissions)
- Metadata Service → Storage Service (get storage node locations)
- Storage Service → Storage Nodes (write object data)
- Storage Nodes → Replication Service (replicate to other nodes)
- Storage Service → Metadata Service (confirm write success)
- Metadata Service → Metadata Database (store object metadata)
- Metadata Service → API Gateway (return success with ETag)
- API Gateway → Client (return response)
GET Object (Download) Flow
- Client requests object → API Gateway
- API Gateway → Metadata Service (get object metadata)
- Metadata Service → Metadata Database (query metadata)
- Metadata Service → Storage Service (get object location)
- Storage Service → Storage Nodes (read object data)
- Storage Nodes → Storage Service (return object data)
- Storage Service → API Gateway (stream object data)
- API Gateway → Client (return object)
LIST Objects Flow
- Client requests list → API Gateway
- API Gateway → Metadata Service
- Metadata Service → Metadata Database (query objects by prefix)
- Metadata Database → Metadata Service (return object list)
- Metadata Service → API Gateway (return paginated results)
- API Gateway → Client (return object list)
Database Design
Schema Design
Buckets Table:
CREATE TABLE buckets (
bucket_id VARCHAR(36) PRIMARY KEY,
bucket_name VARCHAR(255) UNIQUE NOT NULL,
owner_id VARCHAR(36) NOT NULL,
region VARCHAR(50) NOT NULL,
created_at TIMESTAMP,
policy JSON,
acl JSON,
INDEX idx_owner_id (owner_id),
INDEX idx_bucket_name (bucket_name)
);
Objects Table:
CREATE TABLE objects (
object_id VARCHAR(36) PRIMARY KEY,
bucket_id VARCHAR(36) NOT NULL,
key VARCHAR(1024) NOT NULL,
size BIGINT NOT NULL,
content_type VARCHAR(255),
etag VARCHAR(64) NOT NULL,
version_id VARCHAR(36),
is_latest BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP,
updated_at TIMESTAMP,
UNIQUE KEY unique_bucket_key_version (bucket_id, key, version_id),
INDEX idx_bucket_key (bucket_id, key),
INDEX idx_bucket_prefix (bucket_id, key(255)),
FOREIGN KEY (bucket_id) REFERENCES buckets(bucket_id)
);
Object Metadata Table:
CREATE TABLE object_metadata (
metadata_id VARCHAR(36) PRIMARY KEY,
object_id VARCHAR(36) NOT NULL,
meta_key VARCHAR(255) NOT NULL,
meta_value TEXT,
created_at TIMESTAMP,
INDEX idx_object_id (object_id),
FOREIGN KEY (object_id) REFERENCES objects(object_id)
);
Database Sharding Strategy
Shard by Bucket ID:
- Buckets and their objects on same shard
- Enables efficient bucket operations
- Use consistent hashing for distribution
Partition Objects Table:
- Partition by bucket_id for large buckets
- Enables parallel queries within bucket
High-Level Design
┌─────────────────────────────────────────┐
│ Client Applications │
│ (REST API, SDK, CLI Tools) │
└──────────────┬──────────────────────────┘
│
│ HTTPS
│
┌──────────────▼──────────────────────────┐
│ Load Balancer / API Gateway │
└──────────────┬──────────────────────────┘
│
┌───────┴───────┐
│ │
┌──────▼──────┐ ┌─────▼──────┐
│ Metadata │ │ Storage │
│ Service │ │ Service │
│ (Metadata │ │ (Object │
│ DB) │ │ Storage) │
└─────────────┘ └────────────┘
│ │
│ │
┌──────▼──────┐ ┌─────▼──────┐
│ Metadata │ │ Storage │
│ Database │ │ Nodes │
│ (Cassandra) │ │ (Distributed│
│ │ │ Disks) │
└─────────────┘ └────────────┘
Core Components
- API Gateway / Load Balancer: Routes requests, handles authentication
- Metadata Service: Manages object metadata, bucket information
- Storage Service: Handles actual object storage and retrieval
- Metadata Database: Stores metadata (Cassandra, DynamoDB)
- Storage Nodes: Physical storage servers with disks
- Replication Service: Ensures data durability through replication
- Monitoring & Analytics: System health and usage metrics
Deep Dive
Component Design
1. API Gateway / Load Balancer
Responsibilities:
- Request routing to appropriate services
- Authentication and authorization
- Rate limiting and throttling
- Request/response transformation
- SSL/TLS termination
Features:
- RESTful API (S3-compatible)
- Request signing (HMAC-SHA256)
- CORS support
- Request logging
Technology:
- AWS API Gateway / Azure API Management
- Envoy Proxy / NGINX
- Custom API service
API Endpoints:
PUT /{bucket}/{key} # Upload object
GET /{bucket}/{key} # Download object
DELETE /{bucket}/{key} # Delete object
HEAD /{bucket}/{key} # Get metadata
GET /{bucket}?prefix={prefix} # List objects
2. Metadata Service
Responsibilities:
- Object metadata management
- Bucket management
- Access control (ACLs, policies)
- Versioning
- Lifecycle management
Metadata Stored:
- Object key (path)
- Bucket name
- Size (bytes)
- Content-Type
- ETag (MD5 hash)
- Creation timestamp
- Last modified timestamp
- Storage location (node IDs)
- Replication status
- Custom metadata (key-value pairs)
- Access control lists (ACLs)
API Operations:
- Create/update/delete object metadata
- List objects by prefix
- Get bucket information
- Manage ACLs and policies
Technology:
- Microservice architecture
- Stateless design
- Horizontal scaling
- Caching layer (Redis)
3. Metadata Database
Database Choice: Cassandra / DynamoDB
Why:
- High write throughput
- Horizontal scaling
- Low latency reads
- No single point of failure
- Tunable consistency
Schema Design (Cassandra):
Objects Table:
CREATE TABLE objects (
bucket_name TEXT,
object_key TEXT,
version_id UUID,
size_bytes BIGINT,
content_type TEXT,
etag TEXT,
created_at TIMESTAMP,
modified_at TIMESTAMP,
storage_nodes LIST<TEXT>, -- Node IDs storing replicas
replication_status TEXT,
custom_metadata MAP<TEXT, TEXT>,
acl_permissions MAP<TEXT, TEXT>,
PRIMARY KEY ((bucket_name, object_key), version_id)
) WITH CLUSTERING ORDER BY (version_id DESC);
Buckets Table:
CREATE TABLE buckets (
bucket_name TEXT PRIMARY KEY,
owner_id TEXT,
created_at TIMESTAMP,
region TEXT,
versioning_enabled BOOLEAN,
lifecycle_policy TEXT,
acl_permissions MAP<TEXT, TEXT>,
custom_metadata MAP<TEXT, TEXT>
);
Indexes:
- Secondary index on
prefixfor listing - Materialized view for common queries
- Partition by bucket_name for locality
Consistency:
- Write: QUORUM (2 of 3 replicas)
- Read: QUORUM for strong consistency
- Read: ONE for eventual consistency (faster)
Why Cassandra? Can We Use SQL Instead?
This is a critical design decision. Let’s analyze both options:
Why Cassandra is Preferred for Blob Storage Metadata:
- Write-Intensive Workload
- Scale: 10M writes/second (peak)
- Cassandra: Optimized for high write throughput
- Append-only writes (no updates in place)
- No locking overhead
- Sequential disk writes
- SQL: Struggles with high write throughput
- Row-level locking creates contention
- Index maintenance overhead
- Transaction log writes add latency
- Horizontal Scaling
- Cassandra: Built-in horizontal scaling
- Add nodes, automatic data distribution
- No manual sharding required
- Linear scalability
- SQL: Vertical scaling + complex sharding
- Sharding requires application logic
- Cross-shard queries are complex
- Rebalancing is difficult
- Cassandra: Built-in horizontal scaling
- No Complex Joins
- Blob Storage Queries: Simple lookups
- Get object by bucket+key
- List objects by prefix
- No joins needed
- Cassandra: Perfect for simple queries
- Fast key-based lookups
- Efficient prefix queries
- SQL: Joins are powerful but unnecessary overhead
- Join overhead not needed
- Simpler queries = better performance
- Blob Storage Queries: Simple lookups
- Flexible Schema
- Metadata: Varies by object type
- Custom metadata (key-value pairs)
- Different fields for different objects
- Cassandra: Schema flexibility
- MAP, LIST, SET types
- Easy to add new fields
- SQL: Rigid schema
- Schema changes expensive
- EAV pattern needed for flexible metadata
- Metadata: Varies by object type
- Multi-Region Support
- Cassandra: Multi-region replication
- Built-in multi-datacenter support
- Tunable consistency per region
- SQL: Complex multi-region setup
- Requires replication solutions
- Cross-region transactions are slow
- Cassandra: Multi-region replication
When SQL Databases Can Work:
SQL databases (PostgreSQL, MySQL) can be used for blob storage metadata in certain scenarios:
✅ Suitable When:
- Smaller Scale: < 1M writes/second
- Strong Consistency Required: Need ACID transactions
- Complex Queries: Need joins, aggregations, analytics
- Team Familiarity: Team more comfortable with SQL
- Existing Infrastructure: Already using SQL databases
SQL Schema Design (PostgreSQL):
-- Objects table
CREATE TABLE objects (
bucket_name VARCHAR(255) NOT NULL,
object_key VARCHAR(1024) NOT NULL,
version_id UUID NOT NULL,
size_bytes BIGINT NOT NULL,
content_type VARCHAR(255),
etag VARCHAR(64),
created_at TIMESTAMP NOT NULL,
modified_at TIMESTAMP NOT NULL,
storage_nodes TEXT[], -- Array of node IDs
replication_status VARCHAR(50),
custom_metadata JSONB, -- Flexible metadata
acl_permissions JSONB,
PRIMARY KEY (bucket_name, object_key, version_id)
);
-- Indexes for common queries
CREATE INDEX idx_objects_bucket_key ON objects(bucket_name, object_key);
CREATE INDEX idx_objects_prefix ON objects(bucket_name, object_key text_pattern_ops);
CREATE INDEX idx_objects_created ON objects(created_at);
-- Buckets table
CREATE TABLE buckets (
bucket_name VARCHAR(255) PRIMARY KEY,
owner_id VARCHAR(255) NOT NULL,
created_at TIMESTAMP NOT NULL,
region VARCHAR(50),
versioning_enabled BOOLEAN DEFAULT FALSE,
lifecycle_policy JSONB,
acl_permissions JSONB,
custom_metadata JSONB
);
SQL Optimization Strategies:
- Partitioning
-- Partition by bucket_name for better performance CREATE TABLE objects ( ... ) PARTITION BY HASH (bucket_name); CREATE TABLE objects_bucket_0 PARTITION OF objects FOR VALUES WITH (MODULUS 4, REMAINDER 0); - Read Replicas
- Use read replicas for LIST operations
- Primary handles writes
- Replicas handle reads
- Connection Pooling
- Use connection pools (PgBouncer)
- Reduce connection overhead
- Caching
- Cache frequently accessed metadata in Redis
- Reduce database load
Trade-offs: SQL vs Cassandra
| Aspect | Cassandra | SQL (PostgreSQL) |
|---|---|---|
| Write Throughput | Very High (10M+/sec) | High (1M/sec) |
| Read Latency | Low (< 10ms) | Low (< 10ms) |
| Horizontal Scaling | Easy (add nodes) | Complex (sharding) |
| Consistency | Tunable (eventual/strong) | Strong (ACID) |
| Complex Queries | Limited | Excellent (joins, SQL) |
| Schema Flexibility | High | Low (rigid) |
| Multi-Region | Built-in | Complex setup |
| Operational Complexity | Medium | Low (familiar) |
| Cost | Lower (commodity hardware) | Higher (vertical scaling) |
Hybrid Approach:
Some systems use both:
- Cassandra for Metadata
- High write throughput
- Simple lookups
- Horizontal scaling
- SQL for Analytics/Reporting
- Complex queries
- Aggregations
- Business intelligence
- Sync Between Systems
- ETL pipeline syncs data
- Cassandra → SQL for analytics
- Use appropriate tool for each use case
Recommendation:
- For Large-Scale Blob Storage (S3-like): Use Cassandra
- High write throughput requirement
- Simple query patterns
- Need horizontal scaling
- Multi-region support
- For Smaller Scale or Strong Consistency: Use SQL (PostgreSQL)
- < 1M writes/second
- Need ACID transactions
- Complex queries needed
- Team familiarity with SQL
- For Best of Both Worlds: Use Hybrid
- Cassandra for metadata operations
- SQL for analytics/reporting
- Sync between systems
Real-World Examples:
- Amazon S3: Uses DynamoDB (NoSQL) for metadata
- Azure Blob Storage: Uses SQL Server for metadata (but with heavy optimization)
- Google Cloud Storage: Uses Bigtable (NoSQL) for metadata
Most large-scale blob storage systems use NoSQL databases for metadata due to write throughput and scalability requirements.
4. Storage Service
Responsibilities:
- Object storage and retrieval
- Replication coordination
- Data integrity verification
- Storage node management
Architecture:
- Stateless service layer
- Routes requests to storage nodes
- Handles replication logic
- Manages erasure coding
Operations:
- PUT: Store object, trigger replication
- GET: Retrieve object from storage nodes
- DELETE: Mark for deletion, garbage collection
- Replication: Copy objects to multiple nodes
5. Storage Nodes
Physical Architecture:
- Servers with multiple disks (HDD/SSD)
- Network attached storage
- Distributed across data centers
Node Responsibilities:
- Store object chunks/blocks
- Serve read requests
- Report health status
- Participate in replication
Storage Organization:
/storage/
├── node-001/
│ ├── bucket1/
│ │ ├── object1/chunk1
│ │ ├── object1/chunk2
│ │ └── object2/chunk1
│ └── bucket2/
└── node-002/
└── ...
Object Storage Format:
- Objects split into chunks (e.g., 64MB chunks)
- Chunks stored on different nodes
- Erasure coding for durability
- Checksums for integrity
6. Data Durability & Replication
Replication Strategies
1. Multi-Copy Replication
- Store N copies of each object (typically 3)
- Copies stored on different nodes/data centers
- Simple to implement and understand
- Storage overhead: Nx (3x for 3 copies)
Example:
Object → Replicate to 3 nodes
├─ Node A (Data Center 1)
├─ Node B (Data Center 2)
└─ Node C (Data Center 3)
2. Erasure Coding (EC)
- Split object into K data chunks
- Generate M parity chunks
- Store K+M chunks across nodes
- Can recover from M failures
- Storage overhead: (K+M)/K (e.g., 1.5x for 6+3)
Example (6+3 Erasure Coding):
Object (60MB) → Split into:
- 6 data chunks (10MB each)
- 3 parity chunks (10MB each)
- Store 9 chunks across 9 nodes
- Can lose any 3 chunks and still recover
- Storage overhead: 1.5x (vs 3x for replication)
3. Hybrid Approach
- Hot data: 3x replication (fast access)
- Warm data: Erasure coding (cost efficient)
- Cold data: Erasure coding + compression
Replication Process
Write Path:
1. Client uploads object
2. Storage service receives object
3. Split into chunks (if using EC)
4. Write chunks to primary nodes
5. Trigger replication to secondary nodes
6. Wait for quorum (e.g., 2 of 3 replicas)
7. Update metadata with storage locations
8. Return success to client
Read Path:
1. Client requests object
2. Metadata service returns storage locations
3. Storage service reads from nearest node
4. If node unavailable, read from replica
5. Verify checksum
6. Return object to client
Failure Handling:
- Detect failed nodes (health checks)
- Trigger re-replication from healthy replicas
- Maintain replication factor
- Background repair process
7. Consistency Model
Eventual Consistency (Default)
Behavior:
- Writes may take time to propagate
- Reads may see stale data temporarily
- Eventually all replicas converge
- Best for high availability and performance
Use Cases:
- Most object storage operations
- Non-critical data
- High-throughput scenarios
Example:
Write to Node A → Replicate to B, C
Read from Node B (may see old data)
Eventually all nodes have latest data
Strong Consistency (Optional)
Behavior:
- Read-after-write consistency
- All reads see latest write
- Higher latency
- Lower throughput
Implementation:
- Read from quorum (2 of 3 replicas)
- Compare versions/timestamps
- Return latest version
Use Cases:
- Critical data
- When consistency is required
- Lower throughput acceptable
Scalability Design
Horizontal Scaling
Metadata Service:
- Stateless design
- Load balancer distributes requests
- Scale by adding instances
- No shared state
Storage Nodes:
- Add nodes to increase capacity
- Automatic data rebalancing
- Consistent hashing for object placement
- No single bottleneck
Metadata Database:
- Cassandra: Add nodes, automatic sharding
- Partition by bucket_name
- Replication across data centers
Partitioning Strategy
Object Placement:
- Consistent hashing based on object key
- Ensures even distribution
- Minimal rebalancing on node addition/removal
Example:
Hash(object_key) % num_nodes → Storage node
Object "photos/img1.jpg" → Hash → Node 5
Object "photos/img2.jpg" → Hash → Node 2
Metadata Partitioning:
- Partition by (bucket_name, object_key)
- Ensures metadata locality
- Efficient prefix queries
Load Balancing
Strategies:
- Round Robin: Distribute evenly
- Least Connections: Route to least busy
- Geographic: Route to nearest data center
- Consistent Hashing: Same object → same node
Multi-Level:
- Global load balancer (DNS)
- Regional load balancers
- Service-level load balancers
Performance Optimization
Caching Strategy
1. Metadata Cache (Redis)
- Cache frequently accessed metadata
- TTL: 5-15 minutes
- Invalidate on updates
2. CDN (Content Delivery Network)
- Cache popular objects at edge
- Reduce latency for global users
- Cache-Control headers
3. Object Cache
- Cache hot objects in memory
- LRU eviction policy
- Reduce disk I/O
Compression
Storage Compression:
- Compress objects before storage
- Gzip, Snappy, LZ4
- Trade-off: CPU vs storage cost
Transfer Compression:
- Compress during transfer
- Reduce bandwidth costs
- Client can request compression
Parallel Operations
Multi-Part Upload:
- Split large objects into parts
- Upload parts in parallel
- Merge parts on server
- Resume interrupted uploads
Parallel Reads:
- Read from multiple replicas
- Combine chunks in parallel
- Faster retrieval
Network Optimization
Bandwidth Management:
- Throttling per user/bucket
- Quality of Service (QoS)
- Prioritize critical operations
Connection Pooling:
- Reuse connections
- Reduce connection overhead
- Keep-alive connections
Data Integrity
Checksums
ETag (Entity Tag):
- MD5 hash of object content
- Verify integrity on upload/download
- Detect corruption
Process:
Upload: Client calculates MD5 → Server verifies
Download: Server sends ETag → Client verifies
Storage-Level Checksums:
- Checksum per chunk
- Verify on read
- Background verification
Corruption Detection
Background Scans:
- Periodically verify all objects
- Compare checksums
- Trigger repair if corruption detected
Repair Process:
- Detect corrupted chunk
- Recover from other replicas
- Re-replicate to maintain durability
Cost Optimization
Storage Tiers
1. Hot Storage
- Frequently accessed data
- Low latency (SSD)
- Higher cost
- Use case: Active data
2. Warm Storage
- Occasionally accessed data
- Medium latency (HDD)
- Medium cost
- Use case: Archives
3. Cold Storage
- Rarely accessed data
- High latency (tape/glacier)
- Low cost
- Use case: Long-term backup
Lifecycle Policies:
Hot → Warm (after 30 days)
Warm → Cold (after 90 days)
Delete (after 1 year)
Erasure Coding
Cost Savings:
- 3x replication: 3x storage cost
- 6+3 erasure coding: 1.5x storage cost
- 50% cost reduction
Trade-off:
- Higher CPU for encoding/decoding
- Slower recovery (need to reconstruct)
- Good for warm/cold data
Compression
Storage Savings:
- Compress objects before storage
- 30-70% reduction typical
- Trade-off: CPU cost
Deduplication:
- Detect duplicate objects
- Store once, reference multiple times
- Significant savings for backups
Security
Authentication
Methods:
- API Keys: Access key + secret key
- IAM: Identity and Access Management
- OAuth 2.0: Token-based
- Federated Identity: SAML, OpenID Connect
Request Signing:
- HMAC-SHA256 signature
- Include timestamp (prevent replay)
- Include request parameters
Authorization
Access Control Lists (ACLs):
- Per-object permissions
- Read, write, delete permissions
- User/group-based
Bucket Policies:
- JSON policy documents
- Fine-grained permissions
- IP-based restrictions
Signed URLs:
- Temporary access URLs
- Expiration time
- No authentication needed
- Use case: Public sharing
Encryption
Encryption at Rest:
- AES-256 encryption
- Per-object encryption keys
- Key management service (KMS)
Encryption in Transit:
- TLS 1.3 for HTTPS
- End-to-end encryption option
- Certificate management
Key Management:
- Centralized key service
- Key rotation
- Audit logging
Monitoring & Observability
Key Metrics
Performance:
- Request latency (p50, p95, p99)
- Throughput (requests/second)
- Error rates
- Cache hit ratio
Storage:
- Total storage used
- Storage by tier
- Replication status
- Node capacity
Durability:
- Replication factor
- Failed replications
- Corruption events
- Repair progress
Cost:
- Storage costs by tier
- Bandwidth costs
- Request costs
- Cost per GB stored
Logging
Structured Logging:
- All API requests logged
- Include: user, bucket, key, timestamp, latency
- Centralized logging (ELK stack)
Audit Logging:
- All data access logged
- Compliance requirements
- Security monitoring
Alerting
Critical Alerts:
- Service downtime
- High error rates
- Storage capacity thresholds
- Replication failures
- Security breaches
Disaster Recovery
Backup Strategy
Metadata Backups:
- Daily snapshots
- Point-in-time recovery
- Cross-region backups
Object Backups:
- Replication across regions
- Versioning enabled
- Immutable backups
Recovery Procedures
RTO (Recovery Time Objective):
- Target: < 1 hour
- Automated failover
- Health checks
RPO (Recovery Point Objective):
- Target: < 15 minutes
- Continuous replication
- Transaction logs
Multi-Region Deployment
Active-Active:
- Multiple regions serving traffic
- Data replicated across regions
- Automatic failover
Active-Passive:
- Primary region active
- Secondary region standby
- Failover on primary failure
Technology Stack
| Component | Technology | Rationale |
|---|---|---|
| API Gateway | Envoy / NGINX | High performance, routing |
| Metadata DB | Cassandra / DynamoDB | High write throughput, scalability |
| Cache | Redis Cluster | Fast metadata caching |
| Storage Nodes | Custom / Ceph | Distributed storage |
| Load Balancer | HAProxy / AWS ELB | Request distribution |
| CDN | CloudFront / Cloudflare | Global content delivery |
| Monitoring | Prometheus + Grafana | Metrics and visualization |
| Logging | ELK Stack | Centralized logging |
Comparison with Real Systems
Amazon S3
Similarities:
- Object storage model
- RESTful API
- High durability (11 9’s)
- Eventual consistency
- Lifecycle policies
Differences:
- S3 uses proprietary storage layer
- More advanced features (Lambda triggers, etc.)
- Global infrastructure
Azure Blob Storage
Similarities:
- Blob storage model
- Tiered storage
- Access tiers (Hot, Cool, Archive)
Differences:
- Integrated with Azure ecosystem
- Different API design
Google Cloud Storage
Similarities:
- Object storage
- Multi-region support
- Strong consistency option
Differences:
- Different API design
- Integrated with GCP
Future Enhancements
- Advanced Features
- Object versioning
- Object locking (WORM)
- Cross-region replication
- Event notifications
- Performance
- Multi-part upload optimization
- Transfer acceleration
- Intelligent tiering
- Integration
- Database backups
- Data lake integration
- Analytics integration
- AI/ML
- Content analysis
- Automatic tagging
- Anomaly detection
What Interviewers Look For
Storage Systems Skills
- Object Storage Architecture
- Metadata vs. data separation
- Consistent hashing
- Multi-copy replication
- Red Flags: No separation, poor distribution, single copy
- Durability Design
- 11 9’s durability target
- Erasure coding
- Multi-region replication
- Red Flags: No durability strategy, single region, no redundancy
- Scalability Design
- Horizontal scaling
- Partitioning strategy
- Load distribution
- Red Flags: Vertical scaling, bottlenecks, poor distribution
Distributed Systems Skills
- Consistency Models
- Eventual consistency (default)
- Strong consistency (optional)
- Red Flags: Wrong consistency model, no understanding
- Replication Strategy
- Multi-copy for hot data
- Erasure coding for cold data
- Red Flags: Single strategy, no optimization, high costs
- Metadata Management
- High write throughput
- Efficient queries
- Red Flags: Wrong database, slow queries, bottlenecks
Problem-Solving Approach
- Cost Optimization
- Storage tiers
- Erasure coding
- Compression
- Red Flags: No optimization, high costs, single tier
- Edge Cases
- Node failures
- Network partitions
- Data corruption
- Red Flags: Ignoring failures, no recovery, data loss
- Trade-off Analysis
- Durability vs. cost
- Consistency vs. performance
- Red Flags: No trade-offs, dogmatic choices
System Design Skills
- Component Design
- Metadata service
- Storage service
- API gateway
- Red Flags: Monolithic, unclear boundaries
- Performance Optimization
- Multi-layer caching
- CDN integration
- Parallel operations
- Red Flags: No caching, slow operations, no optimization
- Security Design
- Encryption at rest
- Encryption in transit
- Access control
- Red Flags: No encryption, insecure, no access control
Communication Skills
- Architecture Explanation
- Can explain metadata/data separation
- Understands durability strategies
- Red Flags: No understanding, vague explanations
- Trade-off Justification
- Explains design decisions
- Discusses alternatives
- Red Flags: No justification, no alternatives
Meta-Specific Focus
- Storage Systems Expertise
- Deep understanding of object storage
- Durability and consistency
- Key: Show storage systems expertise
- Scale Thinking
- Exabytes of data
- Billions of objects
- Key: Demonstrate scale expertise
Conclusion
Designing a blob storage system like S3 requires careful consideration of:
- Architecture: Separate metadata and storage layers
- Durability: Multi-copy replication or erasure coding
- Scalability: Horizontal scaling, consistent hashing
- Performance: Caching, CDN, parallel operations
- Cost: Storage tiers, erasure coding, compression
- Security: Encryption, authentication, authorization
Key Design Decisions:
- Metadata DB: Cassandra for high write throughput
- Replication: 3x replication for hot data, erasure coding for warm/cold
- Consistency: Eventual consistency (default), strong consistency (optional)
- Caching: Multi-layer caching (Redis, CDN)
- Storage: Distributed nodes with consistent hashing
The system is designed to scale to exabytes of data while maintaining high durability, availability, and performance at reasonable cost.