Introduction

Apache Storm is a distributed real-time computation system for processing unbounded streams of data. It’s designed for high-throughput, low-latency stream processing. Understanding Storm is essential for system design interviews involving real-time analytics and event processing.

This guide covers:

  • Storm Fundamentals: Topologies, spouts, bolts, and tuples
  • Stream Grouping: Shuffle, fields, all, and global grouping
  • Reliability: Guaranteed message processing and acking
  • Scaling: Parallelism and resource allocation
  • Best Practices: Topology design, error handling, and performance

What is Apache Storm?

Apache Storm is a stream processing framework that:

  • Real-Time Processing: Processes streams in real-time
  • Distributed: Runs across multiple nodes
  • Fault Tolerant: Automatic failure recovery
  • Scalable: Horizontal scaling
  • Guaranteed Processing: At-least-once or exactly-once semantics

Key Concepts

Topology: Graph of computation (DAG)

Spout: Source of streams (reads from external systems)

Bolt: Processing unit (transforms, filters, aggregates)

Tuple: Unit of data flowing through topology

Stream: Sequence of tuples

Stream Grouping: How tuples are distributed to bolts

Architecture

High-Level Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Data      │────▶│   Data      │────▶│   Data      │
│   Source    │     │   Source    │     │   Source    │
│  (Kafka)    │     │  (Kinesis)  │     │  (Files)    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │   Storm Nimbus          │
              │   (Master/Coordinator)  │
              └──────┬──────────────────┘
                     │
                     ▼
              ┌─────────────────────────┐
              │   Supervisor Nodes      │
              │   (Workers)             │
              │                         │
              │  ┌──────────┐           │
              │  │ Spouts   │           │
              │  │ (Input)  │           │
              │  └────┬─────┘           │
              │       │                 │
              │  ┌────┴─────┐           │
              │  │  Bolts   │           │
              │  │(Process) │           │
              │  └──────────┘           │
              └──────┬──────────────────┘
                     │
       ┌─────────────┴─────────────┐
       │                           │
┌──────▼──────┐           ┌───────▼──────┐
│   Data      │           │   Data       │
│   Sink      │           │   Sink       │
│ (Database)  │           │  (Kafka)     │
└─────────────┘           └─────────────┘

Explanation:

  • Data Sources: Systems that produce streaming data (e.g., Kafka, Kinesis, file systems, databases).
  • Storm Nimbus: Master node that coordinates the cluster, submits topologies, and monitors execution.
  • Supervisor Nodes: Worker nodes that execute tasks. Each supervisor runs worker processes that execute spouts and bolts.
  • Spouts: Sources of data streams that read from external systems and emit tuples.
  • Bolts: Processing units that consume tuples, perform transformations, and emit new tuples.
  • Data Sinks: Systems that consume processed data (e.g., databases, message queues, file systems).

Core Architecture

┌─────────────────────────────────────────────────────────┐
│              Storm Cluster                               │
│                                                           │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Nimbus (Master)                          │    │
│  │  (Topology Submission, Coordination)               │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │         ZooKeeper                                 │    │
│  │  (Coordination, State Management)                  │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Supervisor Nodes                          │    │
│  │  (Worker Processes, Task Execution)                 │    │
│  └──────────────────────────────────────────────────┘    │
└───────────────────────────────────────────────────────────┘

Topology

Basic Topology

Java:

TopologyBuilder builder = new TopologyBuilder();

// Spout
builder.setSpout("spout", new RandomSentenceSpout(), 1);

// Bolt
builder.setBolt("split", new SplitSentenceBolt(), 2)
       .shuffleGrouping("spout");

builder.setBolt("count", new WordCountBolt(), 2)
       .fieldsGrouping("split", new Fields("word"));

Config config = new Config();
StormSubmitter.submitTopology("word-count", config, builder.createTopology());

Python:

from streamparse import Topology

class WordCountTopology(Topology):
    spout = RandomSentenceSpout.spec(par=1)
    split = SplitSentenceBolt.spec(par=2, inputs={spout: ShuffleGrouping()})
    count = WordCountBolt.spec(par=2, inputs={split: FieldsGrouping('word')})

Spouts

Basic Spout

Java:

public class RandomSentenceSpout extends BaseRichSpout {
    private SpoutOutputCollector collector;
    private Random random;
    
    @Override
    public void open(Map conf, TopologyContext context, 
                     SpoutOutputCollector collector) {
        this.collector = collector;
        this.random = new Random();
    }
    
    @Override
    public void nextTuple() {
        String[] sentences = {
            "the cow jumped over the moon",
            "an apple a day keeps the doctor away"
        };
        String sentence = sentences[random.nextInt(sentences.length)];
        collector.emit(new Values(sentence));
    }
    
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("sentence"));
    }
}

Kafka Spout

Java:

SpoutConfig spoutConfig = new SpoutConfig(
    new ZkHosts("localhost:2181"),
    "topic-name",
    "/kafka-storm",
    "spout-id"
);

KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
builder.setSpout("kafka-spout", kafkaSpout, 1);

Bolts

Basic Bolt

Java:

public class SplitSentenceBolt extends BaseRichBolt {
    private OutputCollector collector;
    
    @Override
    public void prepare(Map conf, TopologyContext context, 
                        OutputCollector collector) {
        this.collector = collector;
    }
    
    @Override
    public void execute(Tuple tuple) {
        String sentence = tuple.getStringByField("sentence");
        String[] words = sentence.split(" ");
        
        for (String word : words) {
            collector.emit(new Values(word));
        }
        
        collector.ack(tuple);
    }
    
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word"));
    }
}

Aggregation Bolt

Java:

public class WordCountBolt extends BaseRichBolt {
    private OutputCollector collector;
    private Map<String, Integer> counts = new HashMap<>();
    
    @Override
    public void execute(Tuple tuple) {
        String word = tuple.getStringByField("word");
        counts.put(word, counts.getOrDefault(word, 0) + 1);
        
        collector.emit(new Values(word, counts.get(word)));
        collector.ack(tuple);
    }
    
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word", "count"));
    }
}

Stream Grouping

Shuffle Grouping

Random Distribution:

builder.setBolt("bolt", new MyBolt(), 2)
       .shuffleGrouping("spout");

Fields Grouping

Group by Field:

builder.setBolt("bolt", new MyBolt(), 2)
       .fieldsGrouping("spout", new Fields("user_id"));

All Grouping

Broadcast to All:

builder.setBolt("bolt", new MyBolt(), 2)
       .allGrouping("spout");

Global Grouping

Single Target:

builder.setBolt("bolt", new MyBolt(), 2)
       .globalGrouping("spout");

Reliability

Acknowledgment

Ack Tuple:

@Override
public void execute(Tuple tuple) {
    // Process tuple
    processTuple(tuple);
    
    // Acknowledge
    collector.ack(tuple);
}

Fail Tuple:

@Override
public void execute(Tuple tuple) {
    try {
        processTuple(tuple);
        collector.ack(tuple);
    } catch (Exception e) {
        collector.fail(tuple);
    }
}

Anchoring

Anchor Tuple:

@Override
public void execute(Tuple tuple) {
    String word = tuple.getStringByField("word");
    
    // Emit with anchor
    collector.emit(tuple, new Values(word.toUpperCase()));
    collector.ack(tuple);
}

Performance Optimization

Parallelism

Set Parallelism:

builder.setSpout("spout", new MySpout(), 4);  // 4 executors
builder.setBolt("bolt", new MyBolt(), 8);     // 8 executors

Resource Allocation

Set Resources:

Config config = new Config();
config.setNumWorkers(4);
config.setMaxSpoutPending(1000);

Backpressure

Handle Backpressure:

if (collector.getPendingCount() > threshold) {
    // Slow down or buffer
}

Best Practices

1. Topology Design

  • Keep topologies simple
  • Use appropriate parallelism
  • Minimize network hops
  • Design for failure

2. Reliability

  • Implement proper acking
  • Handle failures gracefully
  • Use anchoring for tracking
  • Monitor tuple processing

3. Performance

  • Tune parallelism
  • Optimize stream grouping
  • Minimize serialization
  • Monitor backpressure

4. Error Handling

  • Catch exceptions
  • Fail tuples appropriately
  • Implement retry logic
  • Log errors

What Interviewers Look For

Stream Processing Understanding

  1. Storm Concepts
    • Understanding of topologies, spouts, bolts
    • Stream grouping
    • Reliability mechanisms
    • Red Flags: No Storm understanding, wrong concepts, no reliability
  2. Real-Time Processing
    • Low-latency processing
    • Throughput optimization
    • Backpressure handling
    • Red Flags: No latency awareness, poor throughput, no backpressure
  3. Fault Tolerance
    • Acknowledgment mechanisms
    • Failure recovery
    • Exactly-once semantics
    • Red Flags: No acking, poor recovery, no semantics

Problem-Solving Approach

  1. Topology Design
    • Spout and bolt organization
    • Stream grouping selection
    • Parallelism tuning
    • Red Flags: Poor organization, wrong grouping, no parallelism
  2. Reliability Design
    • Acknowledgment strategy
    • Failure handling
    • Exactly-once processing
    • Red Flags: No acking, poor handling, no exactly-once

System Design Skills

  1. Stream Processing Architecture
    • Storm cluster design
    • Topology organization
    • Resource allocation
    • Red Flags: No architecture, poor organization, no resources
  2. Scalability
    • Horizontal scaling
    • Parallelism tuning
    • Performance optimization
    • Red Flags: No scaling, poor parallelism, no optimization

Communication Skills

  1. Clear Explanation
    • Explains Storm concepts
    • Discusses trade-offs
    • Justifies design decisions
    • Red Flags: Unclear explanations, no justification, confusing

Meta-Specific Focus

  1. Stream Processing Expertise
    • Understanding of real-time processing
    • Storm mastery
    • Performance optimization
    • Key: Demonstrate stream processing expertise
  2. System Design Skills
    • Can design stream processing systems
    • Understands real-time challenges
    • Makes informed trade-offs
    • Key: Show practical stream processing design skills

Summary

Apache Storm Key Points:

  • Real-Time Processing: Processes streams in real-time
  • Topology-Based: Graph of spouts and bolts
  • Fault Tolerant: Automatic failure recovery
  • Guaranteed Processing: At-least-once or exactly-once
  • Scalable: Horizontal scaling with parallelism

Common Use Cases:

  • Real-time analytics
  • Event processing
  • Stream aggregation
  • ETL pipelines
  • Fraud detection
  • Alerting systems

Best Practices:

  • Design simple topologies
  • Use appropriate stream grouping
  • Implement proper acking
  • Tune parallelism
  • Handle backpressure
  • Monitor performance
  • Design for failure

Apache Storm is a powerful framework for building real-time stream processing applications with guaranteed message processing.