Introduction
Designing a key-value store for a local system with specific hardware constraints (2TB disk, 8 CPU cores, 32GB RAM) requires careful consideration of memory management, disk I/O optimization, and efficient data structures. Scaling this system horizontally across multiple nodes introduces challenges in sharding, replication, consistency, and distributed coordination.
This post provides a detailed walkthrough of designing a key-value store for a single-node system and then scaling it horizontally, covering storage architecture, indexing strategies, memory management, sharding algorithms, replication strategies, consistency models, and distributed coordination. This is a common system design interview question that tests your understanding of storage systems, distributed systems, database internals, and scalability patterns.
Table of Contents
- Problem Statement
- Requirements
- Capacity Estimation
- Core Entities
- API
- Data Flow
- Database Design
- High-Level Design
- Deep Dive
- Summary
Problem Statement
Design a key-value store for a local system with the following hardware constraints:
Single Node Specifications:
- Storage: 2TB hard drive
- CPU: 8 cores
- Memory: 32GB RAM
Requirements:
- Store key-value pairs efficiently
- Support fast reads and writes
- Handle more data than memory capacity
- Support range queries
- Handle concurrent operations
- Provide durability guarantees
- Support TTL (time-to-live) for keys
- Support batch operations
Horizontal Scaling Requirements:
- Scale across multiple nodes
- Distribute data evenly
- Handle node failures
- Maintain consistency
- Support replication
- Handle rebalancing
Scale Requirements:
- Single Node: 1 billion+ key-value pairs
- Total Data: 1TB+ (fits in 2TB disk)
- Average Value Size: 1KB
- Peak Operations: 100,000 ops/second per node
- Read Latency: < 1ms (memory), < 10ms (disk)
- Write Latency: < 5ms (memory), < 50ms (disk)
- Horizontal Scaling: 10-1000 nodes
Requirements
Functional Requirements
Core Features:
- Basic Operations: Get, Put, Delete key-value pairs
- Range Queries: Query by key range
- TTL Support: Expire keys after TTL
- Batch Operations: Batch get/put/delete
- Durability: Persist data to disk
- Concurrency: Handle concurrent operations
- Compaction: Compact old data files
Horizontal Scaling Features:
- Sharding: Distribute data across nodes
- Replication: Replicate data for availability
- Consistency: Maintain consistency across replicas
- Rebalancing: Rebalance data on node addition/removal
- Failure Handling: Handle node failures gracefully
- Load Balancing: Distribute requests across nodes
Out of Scope:
- Secondary indexes (focus on primary key operations)
- Transactions (focus on single-key operations)
- Complex queries (focus on key-value operations)
- Multi-region deployment (focus on single region)
Non-Functional Requirements
Single Node:
- Performance:
- Read: < 1ms (memory), < 10ms (disk)
- Write: < 5ms (memory), < 50ms (disk)
- Durability: All writes persisted to disk
- Memory Efficiency: Efficient memory usage
- Disk Efficiency: Efficient disk usage
Horizontal Scaling:
- Availability: 99.9% uptime
- Consistency: Eventual or strong consistency
- Scalability: Linear scaling with nodes
- Performance: Maintain low latency with scaling
Capacity Estimation
Traffic Estimates
Single Node:
- Total Keys: 1 billion
- Average Value Size: 1KB
- Total Data: 1TB
- Peak Operations: 100,000 ops/second
- Read/Write Ratio: 80/20
- Reads: 80,000/sec
- Writes: 20,000/sec
Multi-Node (100 nodes):
- Total Keys: 100 billion
- Total Data: 100TB
- Peak Operations: 10 million ops/second
- Per Node: 100,000 ops/second
Storage Estimates
Single Node:
- Data: 1TB
- Indexes: 10GB (1% of data)
- Metadata: 1GB
- WAL (Write-Ahead Log): 10GB
- Total: ~1TB (fits in 2TB disk)
Multi-Node (100 nodes):
- Data per Node: 1TB
- Replication Factor 3: 3TB per node
- Total Storage: 300TB across 100 nodes
Bandwidth Estimates
Single Node:
- Disk I/O: 100,000 ops/sec × 1KB = 100MB/s
- Memory Bandwidth: 100,000 ops/sec × 1KB = 100MB/s
Multi-Node:
- Network I/O: 10M ops/sec × 1KB = 10GB/s
- Replication Traffic: 3x data = 30GB/s
Core Entities
Key-Value Entry
key(BINARY/VARCHAR, primary key)value(BINARY/BLOB)value_size(INT, bytes)timestamp(TIMESTAMP)ttl(INT, seconds, optional)version(INT, for MVCC)
Index Entry
key(BINARY/VARCHAR)file_id(INT)offset(BIGINT, byte offset in file)size(INT, value size)
Data File
file_id(INT)file_path(VARCHAR)file_size(BIGINT)key_count(INT)created_at(TIMESTAMP)compacted(BOOLEAN)
Node Metadata
node_id(VARCHAR)address(VARCHAR, IP:port)shard_range_start(VARCHAR)shard_range_end(VARCHAR)status(active, inactive, recovering)last_heartbeat(TIMESTAMP)
Replica Group
group_id(VARCHAR)shard_key_range(VARCHAR)primary_node_id(VARCHAR)replica_node_ids(JSON array)replication_factor(INT)
API
1. Put Key-Value
PUT /api/v1/kv/{key}
Request Body:
{
"value": "base64_encoded_value",
"ttl": 3600 // optional, seconds
}
Response:
{
"key": "user:123",
"status": "success",
"version": 1
}
2. Get Key-Value
GET /api/v1/kv/{key}
Response:
{
"key": "user:123",
"value": "base64_encoded_value",
"version": 1,
"ttl": 3600,
"expires_at": "2025-11-14T10:00:00Z"
}
3. Delete Key-Value
DELETE /api/v1/kv/{key}
Response:
{
"key": "user:123",
"status": "deleted"
}
4. Range Query
GET /api/v1/kv/range?start={start_key}&end={end_key}&limit=100
Response:
{
"keys": [
{
"key": "user:100",
"value": "...",
"version": 1
},
{
"key": "user:101",
"value": "...",
"version": 1
}
],
"total": 100,
"limit": 100
}
5. Batch Put
POST /api/v1/kv/batch
Request Body:
{
"entries": [
{"key": "user:1", "value": "..."},
{"key": "user:2", "value": "..."}
]
}
Response:
{
"succeeded": 2,
"failed": 0
}
Data Flow
Write Flow (Single Node)
- Client Writes Key-Value:
- Client sends PUT request
- API Server receives request
- Memory Write:
- Write Buffer:
- Stores key-value in memory buffer
- Updates in-memory index
- Returns success immediately
- Write Buffer:
- Disk Write (Async):
- WAL Writer:
- Appends to Write-Ahead Log (WAL)
- Ensures durability
- WAL Writer:
- SSTable Flush:
- Compaction Manager:
- Periodically flushes buffer to SSTable
- Creates new SSTable file
- Updates index
- Compaction Manager:
- Response:
- API Server returns success
- Client continues
Read Flow (Single Node)
- Client Reads Key-Value:
- Client sends GET request
- API Server receives request
- Memory Lookup:
- Read Path:
- Checks memory buffer first
- Checks in-memory index
- Returns if found
- Read Path:
- Disk Lookup:
- SSTable Reader:
- Searches SSTable files
- Uses index to find file
- Reads value from disk
- SSTable Reader:
- Response:
- API Server returns value
- Client receives data
Write Flow (Multi-Node)
- Client Writes Key-Value:
- Client sends PUT request
- Load Balancer routes to appropriate node
- Shard Selection:
- Router:
- Hashes key to determine shard
- Routes to primary node for shard
- Router:
- Primary Write:
- Primary Node:
- Writes to local storage
- Updates local index
- Primary Node:
- Replication:
- Primary Node:
- Replicates to replica nodes
- Waits for acknowledgments
- Returns success
- Primary Node:
- Response:
- Primary Node returns success
- Client receives confirmation
Read Flow (Multi-Node)
- Client Reads Key-Value:
- Client sends GET request
- Load Balancer routes request
- Shard Selection:
- Router:
- Hashes key to determine shard
- Routes to any node in replica group
- Router:
- Node Read:
- Node:
- Reads from local storage
- Returns value
- Node:
- Response:
- Node returns value
- Client receives data
Database Design
Schema Design
Key-Value Storage (SSTable Format):
File Structure:
- Header (metadata, version, key_count)
- Data Blocks (key-value pairs, sorted by key)
- Index Block (key -> offset mapping)
- Footer (checksum, magic number)
In-Memory Index:
# B-Tree or Hash Table
{
"key": {
"file_id": 1,
"offset": 1024,
"size": 1024,
"timestamp": "2025-11-13T10:00:00Z"
}
}
WAL (Write-Ahead Log) Format:
Entry Format:
- Operation Type (PUT/DELETE)
- Key Length (4 bytes)
- Key (variable)
- Value Length (4 bytes)
- Value (variable)
- Timestamp (8 bytes)
- Checksum (4 bytes)
Database Sharding Strategy
Consistent Hashing:
- Hash key to determine shard
- Distribute keys evenly across nodes
- Handle node addition/removal gracefully
Shard Assignment:
- Each node responsible for key range
- Virtual nodes for better distribution
- Rebalance on node changes
High-Level Design
Single Node Architecture
┌─────────────────────────────────────────────────────────────┐
│ API Server │
│ - Request handling │
│ - Concurrency control │
└────────────────────┬────────────────────────────────────────┘
│
│
┌─────────────────────▼───────────────────────────────────────┐
│ Memory Layer (32GB RAM) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Write Buffer │ │ Read Cache │ │ Index │ │
│ │ (MemTable) │ │ (LRU) │ │ (B-Tree) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────┬───────────────────────────────────────┘
│
│ Flush / Read
│
┌─────────────────────▼───────────────────────────────────────┐
│ Disk Layer (2TB) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WAL │ │ SSTables │ │ Index Files │ │
│ │ (Write Log) │ │ (Immutable) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Compaction Manager │
│ - Merge SSTables │
│ - Remove duplicates │
│ - Optimize storage │
└─────────────────────────────────────────────────────────────┘
Multi-Node Architecture
┌─────────────┐
│ Client │
└──────┬──────┘
│
│ HTTP
│
┌──────▼──────────────────────────────────────────────┐
│ Load Balancer │
│ - Request routing │
└──────┬──────────────────────────────────────────────┘
│
│
┌──────▼──────────────────────────────────────────────┐
│ Router / Coordinator │
│ - Consistent hashing │
│ - Shard mapping │
│ - Node discovery │
└──────┬──────────────────────────────────────────────┘
│
│
├──────────────┬──────────────┬──────────────┐
│ │ │ │
┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N │
│ │ │ │ │ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Memory │ │ │ │ Memory │ │ │ │ Memory │ │ │ │ Memory │ │
│ │ (32GB) │ │ │ │ (32GB) │ │ │ │ (32GB) │ │ │ │ (32GB) │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Disk │ │ │ │ Disk │ │ │ │ Disk │ │ │ │ Disk │ │
│ │ (2TB) │ │ │ │ (2TB) │ │ │ │ (2TB) │ │ │ │ (2TB) │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└──────┬──────┘ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │ │
│ │ │ │
└──────────────┴──────────────┴──────────────┘
│
│ Replication
│
┌─────────────────────▼───────────────────────────────────────┐
│ Replication Layer │
│ - Primary-Replica replication │
│ - Consistency guarantees │
└─────────────────────────────────────────────────────────────┘
Deep Dive
Component Design
1. Memory Management (Single Node)
Responsibilities:
- Manage 32GB RAM efficiently
- Store hot data in memory
- Evict cold data to disk
- Maintain indexes in memory
Key Design Decisions:
- Write Buffer: 1-2GB for writes
- Read Cache: 10-20GB LRU cache
- Index: 5-10GB for indexes
- Eviction: LRU for cache, flush for buffer
Implementation:
class MemoryManager:
def __init__(self, total_memory_gb=32):
self.total_memory = total_memory_gb * 1024 * 1024 * 1024 # bytes
self.write_buffer_size = 2 * 1024 * 1024 * 1024 # 2GB
self.read_cache_size = 20 * 1024 * 1024 * 1024 # 20GB
self.index_size = 10 * 1024 * 1024 * 1024 # 10GB
self.write_buffer = {} # MemTable
self.read_cache = LRUCache(capacity=self.read_cache_size)
self.index = BTree() # In-memory index
def put(self, key, value):
"""Put key-value in write buffer"""
# Check if buffer is full
if self.get_buffer_size() >= self.write_buffer_size:
self.flush_buffer()
# Add to buffer
self.write_buffer[key] = {
'value': value,
'timestamp': time.time()
}
# Update index
self.index.put(key, {
'in_memory': True,
'buffer': True
})
def get(self, key):
"""Get key-value from memory"""
# Check write buffer first
if key in self.write_buffer:
value = self.write_buffer[key]['value']
self.read_cache.put(key, value)
return value
# Check read cache
cached = self.read_cache.get(key)
if cached:
return cached
# Not in memory, return None (will read from disk)
return None
def flush_buffer(self):
"""Flush write buffer to disk"""
# Create SSTable from buffer
sstable = self.create_sstable(self.write_buffer)
# Write to disk
self.disk_manager.write_sstable(sstable)
# Clear buffer
self.write_buffer.clear()
# Update index
self.update_index_after_flush(sstable)
2. Disk Storage (Single Node)
Responsibilities:
- Store data efficiently on 2TB disk
- Organize data in SSTables
- Maintain WAL for durability
- Support efficient reads
Key Design Decisions:
- SSTable Format: Immutable sorted files
- WAL: Write-Ahead Log for durability
- Compaction: Merge SSTables periodically
- Index Files: Separate index files for fast lookup
Implementation:
class DiskStorage:
def __init__(self, disk_path, max_disk_size=2 * 1024 * 1024 * 1024 * 1024):
self.disk_path = disk_path
self.max_disk_size = max_disk_size
self.sstables = [] # List of SSTable files
self.wal = WriteAheadLog(os.path.join(disk_path, 'wal'))
def write_sstable(self, data):
"""Write data to new SSTable"""
# Sort data by key
sorted_data = sorted(data.items(), key=lambda x: x[0])
# Create SSTable file
sstable_id = len(self.sstables)
sstable_path = os.path.join(self.disk_path, f'sstable_{sstable_id}.sst')
# Write SSTable
with open(sstable_path, 'wb') as f:
# Write header
header = {
'version': 1,
'key_count': len(sorted_data),
'data_start': 1024 # Header size
}
f.write(struct.pack('I I Q', header['version'],
header['key_count'], header['data_start']))
# Write data blocks
offset = header['data_start']
index = {}
for key, value in sorted_data:
# Write key-value pair
key_bytes = key.encode() if isinstance(key, str) else key
value_bytes = value.encode() if isinstance(value, str) else value
# Entry format: key_len (4) + key + value_len (4) + value
entry = struct.pack('I', len(key_bytes)) + key_bytes + \
struct.pack('I', len(value_bytes)) + value_bytes
f.write(entry)
# Update index
index[key] = offset
offset += len(entry)
# Write index at end
index_offset = offset
index_data = json.dumps(index).encode()
f.write(index_data)
# Write footer with index offset
footer = struct.pack('Q', index_offset)
f.write(footer)
# Add to SSTable list
self.sstables.append({
'id': sstable_id,
'path': sstable_path,
'key_count': len(sorted_data),
'size': os.path.getsize(sstable_path)
})
def read(self, key):
"""Read key from SSTables"""
# Search SSTables in reverse order (newest first)
for sstable in reversed(self.sstables):
value = self.read_from_sstable(sstable, key)
if value is not None:
return value
return None
def read_from_sstable(self, sstable, key):
"""Read key from specific SSTable"""
with open(sstable['path'], 'rb') as f:
# Read header
header_data = f.read(1024)
version, key_count, data_start = struct.unpack('I I Q', header_data[:16])
# Read footer to get index offset
f.seek(-8, 2) # Seek to last 8 bytes
index_offset = struct.unpack('Q', f.read(8))[0]
# Read index
f.seek(index_offset)
index_data = f.read()
index = json.loads(index_data.decode())
# Check if key exists
if key not in index:
return None
# Read value
offset = index[key]
f.seek(offset)
# Read key
key_len = struct.unpack('I', f.read(4))[0]
key_bytes = f.read(key_len)
# Read value
value_len = struct.unpack('I', f.read(4))[0]
value_bytes = f.read(value_len)
return value_bytes
3. Sharding (Multi-Node)
Responsibilities:
- Distribute keys across nodes
- Handle node addition/removal
- Rebalance data when needed
- Route requests to correct node
Key Design Decisions:
- Consistent Hashing: Use consistent hashing for sharding
- Virtual Nodes: Use virtual nodes for better distribution
- Replication: Replicate each shard to multiple nodes
- Rebalancing: Rebalance on node changes
Implementation:
class ShardingManager:
def __init__(self, nodes, replication_factor=3):
self.nodes = nodes
self.replication_factor = replication_factor
self.virtual_nodes_per_node = 100
self.ring = ConsistentHashRing()
# Add nodes to ring
for node in nodes:
for i in range(self.virtual_nodes_per_node):
virtual_node_id = f"{node['id']}:{i}"
self.ring.add_node(virtual_node_id, node)
def get_shard_nodes(self, key):
"""Get nodes responsible for key"""
# Hash key
key_hash = self.hash_key(key)
# Find primary node
primary_virtual = self.ring.get_node(key_hash)
primary_node = primary_virtual['node']
# Find replica nodes
replica_nodes = []
current_virtual = primary_virtual
for _ in range(self.replication_factor - 1):
current_virtual = self.ring.get_next_node(current_virtual)
replica_nodes.append(current_virtual['node'])
return {
'primary': primary_node,
'replicas': replica_nodes
}
def hash_key(self, key):
"""Hash key to determine shard"""
return hashlib.md5(str(key).encode()).hexdigest()
def add_node(self, node):
"""Add new node to cluster"""
# Add virtual nodes
for i in range(self.virtual_nodes_per_node):
virtual_node_id = f"{node['id']}:{i}"
self.ring.add_node(virtual_node_id, node)
# Trigger rebalancing
self.rebalance()
def remove_node(self, node_id):
"""Remove node from cluster"""
# Remove virtual nodes
for i in range(self.virtual_nodes_per_node):
virtual_node_id = f"{node_id}:{i}"
self.ring.remove_node(virtual_node_id)
# Trigger rebalancing
self.rebalance()
def rebalance(self):
"""Rebalance data across nodes"""
# Identify keys that need to move
keys_to_move = self.identify_keys_to_move()
# Move keys to new nodes
for key, old_node, new_node in keys_to_move:
self.move_key(key, old_node, new_node)
4. Replication (Multi-Node)
Responsibilities:
- Replicate data across nodes
- Maintain consistency
- Handle replica failures
- Elect new primary on failure
Key Design Decisions:
- Primary-Replica: One primary, multiple replicas
- Synchronous Replication: Wait for majority acknowledgment
- Quorum: Require majority for writes
- Leader Election: Elect new primary on failure
Implementation:
class ReplicationManager:
def __init__(self, sharding_manager):
self.sharding_manager = sharding_manager
self.replication_factor = 3
self.quorum = (self.replication_factor // 2) + 1
def replicate_write(self, key, value, shard_nodes):
"""Replicate write to all replicas"""
primary = shard_nodes['primary']
replicas = shard_nodes['replicas']
# Write to primary
primary_result = self.write_to_node(primary, key, value)
if not primary_result:
raise ReplicationError("Primary write failed")
# Replicate to replicas
replica_results = []
for replica in replicas:
try:
result = self.write_to_node(replica, key, value)
replica_results.append(result)
except Exception as e:
# Log error but continue
pass
# Check quorum
successful_writes = 1 + len([r for r in replica_results if r])
if successful_writes < self.quorum:
raise ReplicationError("Quorum not reached")
return True
def write_to_node(self, node, key, value):
"""Write to specific node"""
# Make HTTP request to node
response = requests.put(
f"http://{node['address']}/api/v1/kv/{key}",
json={'value': base64.b64encode(value).decode()},
timeout=5
)
return response.status_code == 200
def read_with_consistency(self, key, shard_nodes, consistency_level='strong'):
"""Read with consistency level"""
if consistency_level == 'strong':
# Read from primary
primary = shard_nodes['primary']
return self.read_from_node(primary, key)
else: # eventual
# Read from any node
nodes = [shard_nodes['primary']] + shard_nodes['replicas']
for node in nodes:
try:
return self.read_from_node(node, key)
except:
continue
raise ReadError("All nodes failed")
def read_from_node(self, node, key):
"""Read from specific node"""
response = requests.get(
f"http://{node['address']}/api/v1/kv/{key}",
timeout=5
)
if response.status_code == 200:
data = response.json()
return base64.b64decode(data['value'])
else:
raise ReadError(f"Read failed: {response.status_code}")
Detailed Design
Consistent Hashing Implementation
Challenge: Distribute keys evenly across nodes
Solution:
- Consistent Hashing Ring: Hash nodes and keys to ring
- Virtual Nodes: Multiple virtual nodes per physical node
- Even Distribution: Better key distribution
Implementation:
class ConsistentHashRing:
def __init__(self):
self.ring = {} # hash -> node mapping
self.sorted_keys = [] # Sorted hash keys
def add_node(self, node_id, node_data):
"""Add node to ring"""
for i in range(100): # Virtual nodes
hash_key = self.hash(f"{node_id}:{i}")
self.ring[hash_key] = node_data
self.sorted_keys.append(hash_key)
self.sorted_keys.sort()
def get_node(self, key_hash):
"""Get node for key hash"""
if not self.sorted_keys:
return None
# Find first node with hash >= key_hash
for hash_key in self.sorted_keys:
if hash_key >= key_hash:
return self.ring[hash_key]
# Wrap around to first node
return self.ring[self.sorted_keys[0]]
def hash(self, value):
"""Hash value"""
return int(hashlib.md5(value.encode()).hexdigest(), 16)
Compaction Strategy
Challenge: Manage disk space efficiently
Solution:
- Leveled Compaction: Organize SSTables in levels
- Merge Strategy: Merge SSTables at same level
- Size Limits: Limit SSTable size per level
Implementation:
class CompactionManager:
def __init__(self, disk_storage):
self.disk_storage = disk_storage
self.level_sizes = [10, 100, 1000, 10000] # Max SSTables per level
def compact(self):
"""Compact SSTables"""
# Group SSTables by level
levels = self.group_by_level()
# Compact each level
for level, sstables in enumerate(levels):
if len(sstables) > self.level_sizes[level]:
self.compact_level(level, sstables)
def compact_level(self, level, sstables):
"""Compact SSTables at level"""
# Merge SSTables
merged_data = {}
for sstable in sstables:
data = self.read_sstable(sstable)
merged_data.update(data)
# Remove old SSTables
for sstable in sstables:
os.remove(sstable['path'])
# Write merged SSTable to next level
self.disk_storage.write_sstable(merged_data)
Scalability Considerations
Horizontal Scaling
Adding Nodes:
- Add node to consistent hash ring
- Identify keys to move
- Move keys to new node
- Update routing table
Removing Nodes:
- Identify keys on node
- Move keys to other nodes
- Remove node from ring
- Update routing table
Performance Optimization
Single Node:
- Memory: Use memory efficiently
- Disk I/O: Minimize disk seeks
- Compression: Compress values
- Batch Operations: Batch writes
Multi-Node:
- Load Balancing: Distribute load evenly
- Caching: Cache frequently accessed data
- Network Optimization: Minimize network hops
- Parallel Processing: Process requests in parallel
Security Considerations
Data Security
- Encryption: Encrypt data at rest
- Access Control: Control access to nodes
- Network Security: Secure node communication
- Audit Logging: Log all operations
Monitoring & Observability
Key Metrics
Single Node:
- Memory usage
- Disk usage
- Read/write latency
- Cache hit rate
- SSTable count
Multi-Node:
- Node health
- Replication lag
- Shard distribution
- Request routing
- Rebalancing status
Trade-offs and Optimizations
Trade-offs
1. Consistency: Strong vs Eventual
- Strong: Higher latency, better consistency
- Eventual: Lower latency, eventual consistency
- Decision: Configurable per operation
2. Replication: Synchronous vs Asynchronous
- Synchronous: Better consistency, higher latency
- Asynchronous: Lower latency, eventual consistency
- Decision: Synchronous for critical data
3. Compaction: Aggressive vs Lazy
- Aggressive: Less disk space, higher CPU
- Lazy: More disk space, lower CPU
- Decision: Balanced approach
Optimizations
1. Compression
- Compress values before storage
- Reduce disk usage
- Trade CPU for storage
2. Bloom Filters
- Use bloom filters for SSTables
- Reduce unnecessary disk reads
- Improve read performance
3. Read-Ahead
- Prefetch data for range queries
- Improve sequential read performance
What Interviewers Look For
Storage Systems Knowledge
- Storage Architecture Understanding
- SSTable format design
- Write buffer (memtable) usage
- Compaction strategies
- Red Flags: No understanding of storage formats, poor design
- Indexing Strategies
- Sparse index design
- Bloom filter usage
- Efficient key lookup
- Red Flags: Inefficient indexing, no optimization
- Memory Management
- Efficient memory usage
- Write buffer management
- Cache strategies
- Red Flags: Memory leaks, inefficient usage
Distributed Systems Skills
- Sharding Design
- Consistent hashing understanding
- Virtual nodes usage
- Rebalancing strategies
- Red Flags: Poor sharding, no rebalancing
- Replication
- Primary-replica design
- Consistency models
- Failure handling
- Red Flags: No replication strategy, poor consistency
- Consistency Trade-offs
- Strong vs eventual consistency
- Quorum-based reads/writes
- CAP theorem understanding
- Red Flags: No consistency understanding, wrong choices
Problem-Solving Approach
- Scalability Thinking
- Horizontal scaling design
- Bottleneck identification
- Performance optimization
- Red Flags: No scalability thinking, vertical scaling only
- Edge Cases
- Node failures
- Network partitions
- Data rebalancing
- Red Flags: Ignoring failures, no handling
- Trade-off Analysis
- Consistency vs availability
- Performance vs durability
- Red Flags: No trade-off discussion, dogmatic choices
Code Quality
- Implementation Correctness
- Correct storage logic
- Proper compaction
- Red Flags: Bugs, incorrect logic
- Error Handling
- Failure scenarios
- Recovery mechanisms
- Red Flags: No error handling, no recovery
Meta-Specific Focus
- Storage Systems Expertise
- Deep understanding of storage
- Database internals knowledge
- Key: Show systems expertise
- Distributed Systems Knowledge
- Sharding and replication
- Consistency models
- Key: Demonstrate distributed systems understanding
Summary
Designing a key-value store for a local system and scaling it horizontally requires careful consideration of:
Single Node:
- Memory Management: Efficient use of 32GB RAM
- Disk Storage: Efficient storage on 2TB disk
- Write Buffer: Fast in-memory writes
- SSTable Format: Immutable sorted files
- Compaction: Merge SSTables periodically
- Indexing: Fast key lookup
Horizontal Scaling:
- Sharding: Consistent hashing for distribution
- Replication: Primary-replica replication
- Consistency: Configurable consistency levels
- Rebalancing: Handle node addition/removal
- Failure Handling: Graceful node failure handling
- Load Balancing: Distribute requests evenly
Key architectural decisions:
- SSTable Format for efficient disk storage
- Write Buffer for fast writes
- Consistent Hashing for sharding
- Primary-Replica Replication for availability
- Quorum-Based Writes for consistency
- Leveled Compaction for disk management
- Virtual Nodes for even distribution
- Horizontal Scaling by adding nodes
The system handles 100,000 operations per second per node, stores 1TB+ data per node, scales linearly with nodes, maintains 99.9% availability, and provides sub-millisecond read latency for cached data.