Introduction
MongoDB is a popular NoSQL document database that stores data in flexible, JSON-like documents. It’s designed for scalability, performance, and developer productivity. Understanding MongoDB is essential for system design interviews and building modern applications that require flexible schemas and horizontal scaling.
This guide covers:
- MongoDB Fundamentals: Document model, collections, and BSON
- Sharding: Horizontal scaling with automatic sharding
- Replication: Replica sets for high availability
- Indexing: Index types and optimization
- Aggregation Pipeline: Complex data processing
- Best Practices: Schema design, performance, and security
What is MongoDB?
MongoDB is a NoSQL document database that:
- Document Storage: Stores data as BSON documents (JSON-like)
- Flexible Schema: Schema-less, easy to evolve
- Horizontal Scaling: Built-in sharding
- High Availability: Replica sets with automatic failover
- Rich Query Language: Powerful querying and aggregation
Key Concepts
Database: Container for collections
Collection: Group of documents (similar to table)
Document: Record stored as BSON (similar to row)
Field: Key-value pair in a document (similar to column)
BSON: Binary JSON format
Replica Set: Group of MongoDB servers with replication
Shard: Partition of data in a sharded cluster
Index: Data structure for fast lookups
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Client │────▶│ Client │
│ Application │ │ Application │ │ Application │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
▼
┌─────────────────────────┐
│ MongoDB Driver │
│ (Connection Pool) │
└──────┬──────────────────┘
│
▼
┌─────────────────────────┐
│ MongoDB Cluster │
│ │
│ ┌──────────┐ │
│ │ Primary │ │
│ │ Node │ │
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │ Secondary│ │
│ │ Nodes │ │
│ └──────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ Replica Set │ │
│ │ (Collections) │ │
│ └───────────────────┘ │
└─────────────────────────┘
Explanation:
- Client Applications: Applications that connect to MongoDB to store and retrieve documents (e.g., web applications, microservices).
- MongoDB Driver: Client library that manages connections, connection pooling, and query execution.
- MongoDB Cluster: A collection of MongoDB nodes working together. Can be a standalone server, replica set, or sharded cluster.
- Primary Node: The node in a replica set that handles all write operations.
- Secondary Nodes: Nodes that replicate data from the primary and can handle read operations.
- Replica Set (Collections): Groups of documents stored in databases. Collections are distributed across nodes in a sharded cluster.
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ MongoDB Application │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ MongoDB Driver │ │
│ │ (Connection Pooling, Query Execution) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ MongoDB Server (mongod) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Storage │ │ Indexes │ │ │
│ │ │ Engine │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Query │ │ Aggregation│ │ │
│ │ │ Engine │ │ Pipeline │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Document Model
Document Structure
Example Document:
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "John Doe",
"email": "john@example.com",
"age": 30,
"address": {
"street": "123 Main St",
"city": "New York",
"zip": "10001"
},
"tags": ["developer", "mongodb"],
"created_at": ISODate("2024-01-01T00:00:00Z")
}
Key Features:
- Embedded Documents: Nested objects
- Arrays: Lists of values
- Flexible Schema: Different documents can have different fields
- _id Field: Unique identifier (auto-generated if not provided)
Collections
Collection Types:
- Regular Collections: Standard collections
- Capped Collections: Fixed-size collections (FIFO)
- Time-Series Collections: Optimized for time-series data
Example:
// Create collection
db.users.insertOne({
name: "John Doe",
email: "john@example.com"
});
// Query collection
db.users.find({ email: "john@example.com" });
Sharding
Sharded Cluster Architecture
┌─────────────────────────────────────────────────────────┐
│ Application │
└────────────────────┬────────────────────────────────────┘
│
┌────────────▼────────────┐
│ Mongos (Router) │
│ (Query Routing) │
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌───────▼────────┐ ┌─────────▼──────────┐
│ Config Server │ │ Config Server │
│ (Replica Set) │ │ (Replica Set) │
└────────────────┘ └────────────────────┘
│ │
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌───────▼────────┐ ┌─────────▼──────────┐
│ Shard 1 │ │ Shard 2 │
│ (Replica Set) │ │ (Replica Set) │
└────────────────┘ └────────────────────┘
Shard Key
Shard Key Selection:
// Shard by user_id
sh.shardCollection("mydb.users", { user_id: 1 });
// Compound shard key
sh.shardCollection("mydb.orders", { user_id: 1, order_date: 1 });
Shard Key Requirements:
- High cardinality
- Even distribution
- Supports common queries
- Avoids hotspots
Sharding Strategies:
1. Range-Based Sharding:
// Shard by date range
Shard 1: 2024-01-01 to 2024-06-30
Shard 2: 2024-07-01 to 2024-12-31
2. Hash-Based Sharding:
// Shard by hash of user_id
sh.shardCollection("mydb.users", { user_id: "hashed" });
Sharding Operations
Enable Sharding:
// Enable sharding on database
sh.enableSharding("mydb");
// Shard collection
sh.shardCollection("mydb.users", { user_id: 1 });
Balancing:
- Automatic chunk migration
- Balancer process
- Configurable thresholds
Replication
Replica Set Architecture
Primary (Write)
│
├──→ Secondary 1 (Read)
├──→ Secondary 2 (Read)
└──→ Arbiter (Voting Only)
Replica Set Configuration
Initialize Replica Set:
// On primary
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "mongodb1:27017" },
{ _id: 1, host: "mongodb2:27017" },
{ _id: 2, host: "mongodb3:27017", arbiterOnly: true }
]
});
Read Preferences:
// Read from primary (default)
db.collection.find().readPref("primary");
// Read from secondary
db.collection.find().readPref("secondary");
// Read from nearest
db.collection.find().readPref("nearest");
Write Concern:
// Write to primary only
db.collection.insertOne({ name: "John" }, { w: 1 });
// Write to primary and wait for 2 secondaries
db.collection.insertOne({ name: "John" }, { w: 3 });
// Write with timeout
db.collection.insertOne({ name: "John" }, { w: "majority", wtimeout: 5000 });
Failover
Automatic Failover:
- Heartbeat mechanism
- Election process
- Primary promotion
- Client reconnection
Election Process:
- Detect primary failure
- Secondary nodes vote
- Elect new primary
- Update configuration
- Clients reconnect
Indexing
Index Types
Single Field Index:
// Create index
db.users.createIndex({ email: 1 });
// Query uses index
db.users.find({ email: "john@example.com" });
Compound Index:
// Create compound index
db.users.createIndex({ name: 1, email: 1 });
// Query uses index
db.users.find({ name: "John", email: "john@example.com" });
Multikey Index:
// Index on array field
db.users.createIndex({ tags: 1 });
// Query uses index
db.users.find({ tags: "developer" });
Text Index:
// Create text index
db.articles.createIndex({ title: "text", content: "text" });
// Text search
db.articles.find({ $text: { $search: "mongodb guide" } });
Geospatial Index:
// Create 2dsphere index
db.places.createIndex({ location: "2dsphere" });
// Geospatial query
db.places.find({
location: {
$near: {
$geometry: { type: "Point", coordinates: [-73.97, 40.77] },
$maxDistance: 1000
}
}
});
Index Optimization
Index Selection:
- MongoDB automatically selects best index
- Use explain() to see index usage
- Create indexes for common queries
Index Best Practices:
- Index frequently queried fields
- Use compound indexes for multi-field queries
- Avoid over-indexing (slows writes)
- Monitor index usage
Example:
// Explain query
db.users.find({ email: "john@example.com" }).explain("executionStats");
// Check index usage
db.users.getIndexes();
Aggregation Pipeline
Pipeline Stages
$match:
db.orders.aggregate([
{ $match: { status: "completed" } }
]);
$group:
db.orders.aggregate([
{ $group: {
_id: "$user_id",
total: { $sum: "$amount" },
count: { $sum: 1 }
}}
]);
$project:
db.users.aggregate([
{ $project: {
name: 1,
email: 1,
age: 1
}}
]);
$sort:
db.orders.aggregate([
{ $sort: { created_at: -1 } }
]);
$limit:
db.orders.aggregate([
{ $limit: 10 }
]);
Complex Aggregation
Example:
db.orders.aggregate([
// Match completed orders
{ $match: { status: "completed" } },
// Group by user
{ $group: {
_id: "$user_id",
total_amount: { $sum: "$amount" },
order_count: { $sum: 1 }
}},
// Join with users collection
{ $lookup: {
from: "users",
localField: "_id",
foreignField: "_id",
as: "user"
}},
// Unwind user array
{ $unwind: "$user" },
// Project fields
{ $project: {
user_name: "$user.name",
total_amount: 1,
order_count: 1
}},
// Sort by total amount
{ $sort: { total_amount: -1 } },
// Limit to top 10
{ $limit: 10 }
]);
Transactions
Multi-Document Transactions
ACID Transactions:
const session = client.startSession();
try {
session.startTransaction();
await accounts.updateOne(
{ _id: account1 },
{ $inc: { balance: -100 } },
{ session }
);
await accounts.updateOne(
{ _id: account2 },
{ $inc: { balance: 100 } },
{ session }
);
await session.commitTransaction();
} catch (error) {
await session.abortTransaction();
throw error;
} finally {
session.endSession();
}
Transaction Limitations:
- Replica sets (not standalone)
- Sharded clusters (limited)
- Performance overhead
- Timeout limits
Schema Design
Embedding vs Referencing
Embedding (Denormalization):
// Embed comments in post
{
_id: ObjectId("..."),
title: "Post Title",
content: "Post content",
comments: [
{ user: "John", text: "Great post!" },
{ user: "Jane", text: "Thanks!" }
]
}
When to Embed:
- One-to-few relationships
- Frequently accessed together
- Small documents
- No independent access needed
Referencing (Normalization):
// Reference comments
{
_id: ObjectId("..."),
title: "Post Title",
content: "Post content",
comment_ids: [ObjectId("..."), ObjectId("...")]
}
When to Reference:
- One-to-many relationships
- Large documents
- Independent access needed
- Frequently updated
Schema Patterns
1. One-to-Many:
// Embed for small arrays
{
user_id: 1,
orders: [order1, order2, order3]
}
// Reference for large arrays
{
user_id: 1,
order_ids: [id1, id2, id3, ...]
}
2. Many-to-Many:
// Reference both sides
{
user_id: 1,
tag_ids: [tag1, tag2, tag3]
}
{
tag_id: 1,
user_ids: [user1, user2, user3]
}
3. Tree Structure:
// Parent references
{
_id: ObjectId("..."),
name: "Node",
parent_id: ObjectId("...")
}
// Child references
{
_id: ObjectId("..."),
name: "Node",
children: [ObjectId("..."), ObjectId("...")]
}
Performance Characteristics
Maximum Read & Write Throughput
Single Node (Replica Set Primary):
- Max Read Throughput:
- Simple queries (indexed): 10K-50K queries/sec
- Complex queries (aggregations): 1K-10K queries/sec
- With read replicas: 10K-50K queries/sec per replica (linear scaling)
- Max Write Throughput:
- Simple inserts: 5K-20K writes/sec
- Updates: 3K-15K writes/sec
- Bulk inserts: 50K-200K documents/sec
Sharded Cluster:
- Max Read Throughput: 10K-50K queries/sec per shard (linear scaling)
- Max Write Throughput: 5K-20K writes/sec per shard (linear scaling)
- Example: 10-shard cluster can handle 50K-200K queries/sec and 50K-200K writes/sec total
Factors Affecting Throughput:
- Hardware (CPU, RAM, disk I/O, SSD vs HDD)
- Document size (larger documents = lower throughput)
- Index usage and complexity
- Write concern level (w:1 vs w:majority)
- Shard key distribution
- Replication lag
- Connection pooling
- WiredTiger cache size
Optimized Configuration:
- Max Read Throughput: 50K-100K queries/sec (with proper indexing, sharding, and read replicas)
- Max Write Throughput: 20K-50K writes/sec (with optimized sharding and write concern)
Horizontal Scaling:
- Read Scaling: Add read replicas or shards for linear read scaling
- Write Scaling: Add shards for linear write scaling
Performance Optimization
Query Optimization
Use Indexes:
// Create index
db.users.createIndex({ email: 1 });
// Query uses index
db.users.find({ email: "john@example.com" });
Projection:
// Only return needed fields
db.users.find({ email: "john@example.com" }, { name: 1, email: 1 });
Limit Results:
// Limit result set
db.users.find().limit(10);
Avoid Large Arrays:
// Use $slice for large arrays
db.posts.find({}, { comments: { $slice: 10 } });
Write Optimization
Bulk Operations:
// Bulk insert
db.users.insertMany([
{ name: "John", email: "john@example.com" },
{ name: "Jane", email: "jane@example.com" }
]);
Ordered vs Unordered:
// Unordered (faster, continues on error)
db.users.insertMany([...], { ordered: false });
Write Concern:
// Lower write concern for performance
db.users.insertOne({ name: "John" }, { w: 1 });
Security
Authentication
Create User:
use admin;
db.createUser({
user: "admin",
pwd: "password",
roles: ["root"]
});
Enable Authentication:
// mongod.conf
security:
authorization: enabled
Authorization
Roles:
- read: Read-only access
- readWrite: Read and write access
- dbAdmin: Database administration
- userAdmin: User management
- clusterAdmin: Cluster administration
Example:
db.createUser({
user: "app_user",
pwd: "password",
roles: [
{ role: "readWrite", db: "mydb" }
]
});
Encryption
Data at Rest:
- Encryption at storage level
- Encrypted filesystem
Data in Transit:
- TLS/SSL connections
- Encrypted replication
Best Practices
Schema Design
- Choose Embedding vs Referencing:
- Embed for small, frequently accessed data
- Reference for large, independent data
- Use Appropriate Data Types:
- ObjectId for references
- ISODate for dates
- Number for numeric values
- Index Strategically:
- Index frequently queried fields
- Use compound indexes for multi-field queries
Performance
- Use Indexes:
- Create indexes for common queries
- Monitor index usage
- Remove unused indexes
- Optimize Queries:
- Use projection to limit fields
- Use limit() to reduce results
- Avoid large array operations
- Sharding:
- Choose good shard key
- Monitor shard distribution
- Rebalance when needed
High Availability
- Replica Sets:
- Minimum 3 nodes
- Use odd number of nodes
- Configure read preferences
- Backup:
- Regular backups
- Test restore procedures
- Monitor backup status
What Interviewers Look For
NoSQL Understanding
- Document Model
- Understanding of document structure
- Embedding vs referencing
- Schema flexibility
- Red Flags: No document understanding, wrong patterns, rigid schema
- Query Language
- MongoDB query syntax
- Aggregation pipeline
- Index usage
- Red Flags: Poor queries, no aggregation, no indexes
- Scalability
- Sharding strategies
- Replication setup
- Horizontal scaling
- Red Flags: No sharding, single node, no replication
Problem-Solving Approach
- Schema Design
- Embedding vs referencing decisions
- Index design
- Performance optimization
- Red Flags: Poor schema, no indexes, no optimization
- Scalability
- Shard key selection
- Replication configuration
- Performance tuning
- Red Flags: Wrong shard key, no replication, poor performance
System Design Skills
- Database Architecture
- Sharded cluster design
- Replica set configuration
- High availability
- Red Flags: Single node, no sharding, no HA
- Trade-off Analysis
- Consistency vs availability
- Embedding vs referencing
- Read vs write optimization
- Red Flags: No trade-offs, dogmatic choices, no analysis
Communication Skills
- Clear Explanation
- Explains MongoDB concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- NoSQL Expertise
- Deep MongoDB knowledge
- Document model understanding
- Scalability patterns
- Key: Demonstrate NoSQL expertise
- System Design Skills
- Can design MongoDB architecture
- Understands scaling challenges
- Makes informed trade-offs
- Key: Show practical NoSQL design skills
Summary
MongoDB Key Points:
- Document Database: Flexible JSON-like documents
- Horizontal Scaling: Built-in sharding
- High Availability: Replica sets with automatic failover
- Rich Querying: Powerful query language and aggregation
- Flexible Schema: Schema-less, easy to evolve
- Performance: Indexing, query optimization, caching
Common Use Cases:
- Content management systems
- Real-time analytics
- Mobile applications
- IoT data storage
- User profiles and sessions
- Catalog and product data
Best Practices:
- Choose appropriate embedding vs referencing
- Index frequently queried fields
- Use replica sets for HA
- Shard for horizontal scaling
- Optimize queries with projection and limits
- Monitor performance metrics
- Implement proper security
MongoDB is a powerful NoSQL database that excels at handling flexible schemas, horizontal scaling, and developer productivity.