Introduction

MongoDB is a popular NoSQL document database that stores data in flexible, JSON-like documents. It’s designed for scalability, performance, and developer productivity. Understanding MongoDB is essential for system design interviews and building modern applications that require flexible schemas and horizontal scaling.

This guide covers:

  • MongoDB Fundamentals: Document model, collections, and BSON
  • Sharding: Horizontal scaling with automatic sharding
  • Replication: Replica sets for high availability
  • Indexing: Index types and optimization
  • Aggregation Pipeline: Complex data processing
  • Best Practices: Schema design, performance, and security

What is MongoDB?

MongoDB is a NoSQL document database that:

  • Document Storage: Stores data as BSON documents (JSON-like)
  • Flexible Schema: Schema-less, easy to evolve
  • Horizontal Scaling: Built-in sharding
  • High Availability: Replica sets with automatic failover
  • Rich Query Language: Powerful querying and aggregation

Key Concepts

Database: Container for collections

Collection: Group of documents (similar to table)

Document: Record stored as BSON (similar to row)

Field: Key-value pair in a document (similar to column)

BSON: Binary JSON format

Replica Set: Group of MongoDB servers with replication

Shard: Partition of data in a sharded cluster

Index: Data structure for fast lookups

Architecture

High-Level Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│   Client    │────▶│   Client    │
│ Application │     │ Application │     │ Application │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │   MongoDB Driver        │
              │   (Connection Pool)    │
              └──────┬──────────────────┘
                     │
                     ▼
              ┌─────────────────────────┐
              │   MongoDB Cluster      │
              │                         │
              │  ┌──────────┐           │
              │  │  Primary │           │
              │  │  Node    │           │
              │  └────┬─────┘           │
              │       │                 │
              │  ┌────┴─────┐           │
              │  │ Secondary│           │
              │  │  Nodes   │           │
              │  └──────────┘           │
              │                         │
              │  ┌───────────────────┐  │
              │  │  Replica Set      │  │
              │  │  (Collections)    │  │
              │  └───────────────────┘  │
              └─────────────────────────┘

Explanation:

  • Client Applications: Applications that connect to MongoDB to store and retrieve documents (e.g., web applications, microservices).
  • MongoDB Driver: Client library that manages connections, connection pooling, and query execution.
  • MongoDB Cluster: A collection of MongoDB nodes working together. Can be a standalone server, replica set, or sharded cluster.
  • Primary Node: The node in a replica set that handles all write operations.
  • Secondary Nodes: Nodes that replicate data from the primary and can handle read operations.
  • Replica Set (Collections): Groups of documents stored in databases. Collections are distributed across nodes in a sharded cluster.

Core Architecture

┌─────────────────────────────────────────────────────────┐
│              MongoDB Application                        │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │         MongoDB Driver                          │   │
│  │  (Connection Pooling, Query Execution)          │   │
│  └──────────────────────────────────────────────────┘   │
│                          │                               │
│  ┌──────────────────────────────────────────────────┐   │
│  │         MongoDB Server (mongod)                  │   │
│  │                                                  │   │
│  │  ┌──────────────┐  ┌──────────────┐            │   │
│  │  │   Storage    │  │   Indexes    │            │   │
│  │  │   Engine      │  │              │            │   │
│  │  └──────────────┘  └──────────────┘            │   │
│  │                                                  │   │
│  │  ┌──────────────┐  ┌──────────────┐            │   │
│  │  │   Query      │  │   Aggregation│            │   │
│  │  │   Engine     │  │   Pipeline   │            │   │
│  │  └──────────────┘  └──────────────┘            │   │
│  └──────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────┘

Document Model

Document Structure

Example Document:

{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "John Doe",
  "email": "john@example.com",
  "age": 30,
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "zip": "10001"
  },
  "tags": ["developer", "mongodb"],
  "created_at": ISODate("2024-01-01T00:00:00Z")
}

Key Features:

  • Embedded Documents: Nested objects
  • Arrays: Lists of values
  • Flexible Schema: Different documents can have different fields
  • _id Field: Unique identifier (auto-generated if not provided)

Collections

Collection Types:

  • Regular Collections: Standard collections
  • Capped Collections: Fixed-size collections (FIFO)
  • Time-Series Collections: Optimized for time-series data

Example:

// Create collection
db.users.insertOne({
  name: "John Doe",
  email: "john@example.com"
});

// Query collection
db.users.find({ email: "john@example.com" });

Sharding

Sharded Cluster Architecture

┌─────────────────────────────────────────────────────────┐
│                    Application                           │
└────────────────────┬────────────────────────────────────┘
                     │
        ┌────────────▼────────────┐
        │    Mongos (Router)      │
        │  (Query Routing)         │
        └────────────┬────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
┌───────▼────────┐      ┌─────────▼──────────┐
│  Config Server │      │   Config Server    │
│  (Replica Set) │      │   (Replica Set)     │
└────────────────┘      └────────────────────┘
        │                         │
        └────────────┬────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
┌───────▼────────┐      ┌─────────▼──────────┐
│  Shard 1       │      │   Shard 2          │
│  (Replica Set) │      │   (Replica Set)     │
└────────────────┘      └────────────────────┘

Shard Key

Shard Key Selection:

// Shard by user_id
sh.shardCollection("mydb.users", { user_id: 1 });

// Compound shard key
sh.shardCollection("mydb.orders", { user_id: 1, order_date: 1 });

Shard Key Requirements:

  • High cardinality
  • Even distribution
  • Supports common queries
  • Avoids hotspots

Sharding Strategies:

1. Range-Based Sharding:

// Shard by date range
Shard 1: 2024-01-01 to 2024-06-30
Shard 2: 2024-07-01 to 2024-12-31

2. Hash-Based Sharding:

// Shard by hash of user_id
sh.shardCollection("mydb.users", { user_id: "hashed" });

Sharding Operations

Enable Sharding:

// Enable sharding on database
sh.enableSharding("mydb");

// Shard collection
sh.shardCollection("mydb.users", { user_id: 1 });

Balancing:

  • Automatic chunk migration
  • Balancer process
  • Configurable thresholds

Replication

Replica Set Architecture

Primary (Write)
    │
    ├──→ Secondary 1 (Read)
    ├──→ Secondary 2 (Read)
    └──→ Arbiter (Voting Only)

Replica Set Configuration

Initialize Replica Set:

// On primary
rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "mongodb1:27017" },
    { _id: 1, host: "mongodb2:27017" },
    { _id: 2, host: "mongodb3:27017", arbiterOnly: true }
  ]
});

Read Preferences:

// Read from primary (default)
db.collection.find().readPref("primary");

// Read from secondary
db.collection.find().readPref("secondary");

// Read from nearest
db.collection.find().readPref("nearest");

Write Concern:

// Write to primary only
db.collection.insertOne({ name: "John" }, { w: 1 });

// Write to primary and wait for 2 secondaries
db.collection.insertOne({ name: "John" }, { w: 3 });

// Write with timeout
db.collection.insertOne({ name: "John" }, { w: "majority", wtimeout: 5000 });

Failover

Automatic Failover:

  • Heartbeat mechanism
  • Election process
  • Primary promotion
  • Client reconnection

Election Process:

  1. Detect primary failure
  2. Secondary nodes vote
  3. Elect new primary
  4. Update configuration
  5. Clients reconnect

Indexing

Index Types

Single Field Index:

// Create index
db.users.createIndex({ email: 1 });

// Query uses index
db.users.find({ email: "john@example.com" });

Compound Index:

// Create compound index
db.users.createIndex({ name: 1, email: 1 });

// Query uses index
db.users.find({ name: "John", email: "john@example.com" });

Multikey Index:

// Index on array field
db.users.createIndex({ tags: 1 });

// Query uses index
db.users.find({ tags: "developer" });

Text Index:

// Create text index
db.articles.createIndex({ title: "text", content: "text" });

// Text search
db.articles.find({ $text: { $search: "mongodb guide" } });

Geospatial Index:

// Create 2dsphere index
db.places.createIndex({ location: "2dsphere" });

// Geospatial query
db.places.find({
  location: {
    $near: {
      $geometry: { type: "Point", coordinates: [-73.97, 40.77] },
      $maxDistance: 1000
    }
  }
});

Index Optimization

Index Selection:

  • MongoDB automatically selects best index
  • Use explain() to see index usage
  • Create indexes for common queries

Index Best Practices:

  • Index frequently queried fields
  • Use compound indexes for multi-field queries
  • Avoid over-indexing (slows writes)
  • Monitor index usage

Example:

// Explain query
db.users.find({ email: "john@example.com" }).explain("executionStats");

// Check index usage
db.users.getIndexes();

Aggregation Pipeline

Pipeline Stages

$match:

db.orders.aggregate([
  { $match: { status: "completed" } }
]);

$group:

db.orders.aggregate([
  { $group: {
    _id: "$user_id",
    total: { $sum: "$amount" },
    count: { $sum: 1 }
  }}
]);

$project:

db.users.aggregate([
  { $project: {
    name: 1,
    email: 1,
    age: 1
  }}
]);

$sort:

db.orders.aggregate([
  { $sort: { created_at: -1 } }
]);

$limit:

db.orders.aggregate([
  { $limit: 10 }
]);

Complex Aggregation

Example:

db.orders.aggregate([
  // Match completed orders
  { $match: { status: "completed" } },
  
  // Group by user
  { $group: {
    _id: "$user_id",
    total_amount: { $sum: "$amount" },
    order_count: { $sum: 1 }
  }},
  
  // Join with users collection
  { $lookup: {
    from: "users",
    localField: "_id",
    foreignField: "_id",
    as: "user"
  }},
  
  // Unwind user array
  { $unwind: "$user" },
  
  // Project fields
  { $project: {
    user_name: "$user.name",
    total_amount: 1,
    order_count: 1
  }},
  
  // Sort by total amount
  { $sort: { total_amount: -1 } },
  
  // Limit to top 10
  { $limit: 10 }
]);

Transactions

Multi-Document Transactions

ACID Transactions:

const session = client.startSession();

try {
  session.startTransaction();
  
  await accounts.updateOne(
    { _id: account1 },
    { $inc: { balance: -100 } },
    { session }
  );
  
  await accounts.updateOne(
    { _id: account2 },
    { $inc: { balance: 100 } },
    { session }
  );
  
  await session.commitTransaction();
} catch (error) {
  await session.abortTransaction();
  throw error;
} finally {
  session.endSession();
}

Transaction Limitations:

  • Replica sets (not standalone)
  • Sharded clusters (limited)
  • Performance overhead
  • Timeout limits

Schema Design

Embedding vs Referencing

Embedding (Denormalization):

// Embed comments in post
{
  _id: ObjectId("..."),
  title: "Post Title",
  content: "Post content",
  comments: [
    { user: "John", text: "Great post!" },
    { user: "Jane", text: "Thanks!" }
  ]
}

When to Embed:

  • One-to-few relationships
  • Frequently accessed together
  • Small documents
  • No independent access needed

Referencing (Normalization):

// Reference comments
{
  _id: ObjectId("..."),
  title: "Post Title",
  content: "Post content",
  comment_ids: [ObjectId("..."), ObjectId("...")]
}

When to Reference:

  • One-to-many relationships
  • Large documents
  • Independent access needed
  • Frequently updated

Schema Patterns

1. One-to-Many:

// Embed for small arrays
{
  user_id: 1,
  orders: [order1, order2, order3]
}

// Reference for large arrays
{
  user_id: 1,
  order_ids: [id1, id2, id3, ...]
}

2. Many-to-Many:

// Reference both sides
{
  user_id: 1,
  tag_ids: [tag1, tag2, tag3]
}

{
  tag_id: 1,
  user_ids: [user1, user2, user3]
}

3. Tree Structure:

// Parent references
{
  _id: ObjectId("..."),
  name: "Node",
  parent_id: ObjectId("...")
}

// Child references
{
  _id: ObjectId("..."),
  name: "Node",
  children: [ObjectId("..."), ObjectId("...")]
}

Performance Characteristics

Maximum Read & Write Throughput

Single Node (Replica Set Primary):

  • Max Read Throughput:
    • Simple queries (indexed): 10K-50K queries/sec
    • Complex queries (aggregations): 1K-10K queries/sec
    • With read replicas: 10K-50K queries/sec per replica (linear scaling)
  • Max Write Throughput:
    • Simple inserts: 5K-20K writes/sec
    • Updates: 3K-15K writes/sec
    • Bulk inserts: 50K-200K documents/sec

Sharded Cluster:

  • Max Read Throughput: 10K-50K queries/sec per shard (linear scaling)
  • Max Write Throughput: 5K-20K writes/sec per shard (linear scaling)
  • Example: 10-shard cluster can handle 50K-200K queries/sec and 50K-200K writes/sec total

Factors Affecting Throughput:

  • Hardware (CPU, RAM, disk I/O, SSD vs HDD)
  • Document size (larger documents = lower throughput)
  • Index usage and complexity
  • Write concern level (w:1 vs w:majority)
  • Shard key distribution
  • Replication lag
  • Connection pooling
  • WiredTiger cache size

Optimized Configuration:

  • Max Read Throughput: 50K-100K queries/sec (with proper indexing, sharding, and read replicas)
  • Max Write Throughput: 20K-50K writes/sec (with optimized sharding and write concern)

Horizontal Scaling:

  • Read Scaling: Add read replicas or shards for linear read scaling
  • Write Scaling: Add shards for linear write scaling

Performance Optimization

Query Optimization

Use Indexes:

// Create index
db.users.createIndex({ email: 1 });

// Query uses index
db.users.find({ email: "john@example.com" });

Projection:

// Only return needed fields
db.users.find({ email: "john@example.com" }, { name: 1, email: 1 });

Limit Results:

// Limit result set
db.users.find().limit(10);

Avoid Large Arrays:

// Use $slice for large arrays
db.posts.find({}, { comments: { $slice: 10 } });

Write Optimization

Bulk Operations:

// Bulk insert
db.users.insertMany([
  { name: "John", email: "john@example.com" },
  { name: "Jane", email: "jane@example.com" }
]);

Ordered vs Unordered:

// Unordered (faster, continues on error)
db.users.insertMany([...], { ordered: false });

Write Concern:

// Lower write concern for performance
db.users.insertOne({ name: "John" }, { w: 1 });

Security

Authentication

Create User:

use admin;
db.createUser({
  user: "admin",
  pwd: "password",
  roles: ["root"]
});

Enable Authentication:

// mongod.conf
security:
  authorization: enabled

Authorization

Roles:

  • read: Read-only access
  • readWrite: Read and write access
  • dbAdmin: Database administration
  • userAdmin: User management
  • clusterAdmin: Cluster administration

Example:

db.createUser({
  user: "app_user",
  pwd: "password",
  roles: [
    { role: "readWrite", db: "mydb" }
  ]
});

Encryption

Data at Rest:

  • Encryption at storage level
  • Encrypted filesystem

Data in Transit:

  • TLS/SSL connections
  • Encrypted replication

Best Practices

Schema Design

  1. Choose Embedding vs Referencing:
    • Embed for small, frequently accessed data
    • Reference for large, independent data
  2. Use Appropriate Data Types:
    • ObjectId for references
    • ISODate for dates
    • Number for numeric values
  3. Index Strategically:
    • Index frequently queried fields
    • Use compound indexes for multi-field queries

Performance

  1. Use Indexes:
    • Create indexes for common queries
    • Monitor index usage
    • Remove unused indexes
  2. Optimize Queries:
    • Use projection to limit fields
    • Use limit() to reduce results
    • Avoid large array operations
  3. Sharding:
    • Choose good shard key
    • Monitor shard distribution
    • Rebalance when needed

High Availability

  1. Replica Sets:
    • Minimum 3 nodes
    • Use odd number of nodes
    • Configure read preferences
  2. Backup:
    • Regular backups
    • Test restore procedures
    • Monitor backup status

What Interviewers Look For

NoSQL Understanding

  1. Document Model
    • Understanding of document structure
    • Embedding vs referencing
    • Schema flexibility
    • Red Flags: No document understanding, wrong patterns, rigid schema
  2. Query Language
    • MongoDB query syntax
    • Aggregation pipeline
    • Index usage
    • Red Flags: Poor queries, no aggregation, no indexes
  3. Scalability
    • Sharding strategies
    • Replication setup
    • Horizontal scaling
    • Red Flags: No sharding, single node, no replication

Problem-Solving Approach

  1. Schema Design
    • Embedding vs referencing decisions
    • Index design
    • Performance optimization
    • Red Flags: Poor schema, no indexes, no optimization
  2. Scalability
    • Shard key selection
    • Replication configuration
    • Performance tuning
    • Red Flags: Wrong shard key, no replication, poor performance

System Design Skills

  1. Database Architecture
    • Sharded cluster design
    • Replica set configuration
    • High availability
    • Red Flags: Single node, no sharding, no HA
  2. Trade-off Analysis
    • Consistency vs availability
    • Embedding vs referencing
    • Read vs write optimization
    • Red Flags: No trade-offs, dogmatic choices, no analysis

Communication Skills

  1. Clear Explanation
    • Explains MongoDB concepts
    • Discusses trade-offs
    • Justifies design decisions
    • Red Flags: Unclear explanations, no justification, confusing

Meta-Specific Focus

  1. NoSQL Expertise
    • Deep MongoDB knowledge
    • Document model understanding
    • Scalability patterns
    • Key: Demonstrate NoSQL expertise
  2. System Design Skills
    • Can design MongoDB architecture
    • Understands scaling challenges
    • Makes informed trade-offs
    • Key: Show practical NoSQL design skills

Summary

MongoDB Key Points:

  • Document Database: Flexible JSON-like documents
  • Horizontal Scaling: Built-in sharding
  • High Availability: Replica sets with automatic failover
  • Rich Querying: Powerful query language and aggregation
  • Flexible Schema: Schema-less, easy to evolve
  • Performance: Indexing, query optimization, caching

Common Use Cases:

  • Content management systems
  • Real-time analytics
  • Mobile applications
  • IoT data storage
  • User profiles and sessions
  • Catalog and product data

Best Practices:

  • Choose appropriate embedding vs referencing
  • Index frequently queried fields
  • Use replica sets for HA
  • Shard for horizontal scaling
  • Optimize queries with projection and limits
  • Monitor performance metrics
  • Implement proper security

MongoDB is a powerful NoSQL database that excels at handling flexible schemas, horizontal scaling, and developer productivity.