Introduction
Apache Storm is a distributed real-time computation system for processing unbounded streams of data. It’s designed for high-throughput, low-latency stream processing. Understanding Storm is essential for system design interviews involving real-time analytics and event processing.
This guide covers:
- Storm Fundamentals: Topologies, spouts, bolts, and tuples
- Stream Grouping: Shuffle, fields, all, and global grouping
- Reliability: Guaranteed message processing and acking
- Scaling: Parallelism and resource allocation
- Best Practices: Topology design, error handling, and performance
What is Apache Storm?
Apache Storm is a stream processing framework that:
- Real-Time Processing: Processes streams in real-time
- Distributed: Runs across multiple nodes
- Fault Tolerant: Automatic failure recovery
- Scalable: Horizontal scaling
- Guaranteed Processing: At-least-once or exactly-once semantics
Key Concepts
Topology: Graph of computation (DAG)
Spout: Source of streams (reads from external systems)
Bolt: Processing unit (transforms, filters, aggregates)
Tuple: Unit of data flowing through topology
Stream: Sequence of tuples
Stream Grouping: How tuples are distributed to bolts
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Data │────▶│ Data │────▶│ Data │
│ Source │ │ Source │ │ Source │
│ (Kafka) │ │ (Kinesis) │ │ (Files) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
▼
┌─────────────────────────┐
│ Storm Nimbus │
│ (Master/Coordinator) │
└──────┬──────────────────┘
│
▼
┌─────────────────────────┐
│ Supervisor Nodes │
│ (Workers) │
│ │
│ ┌──────────┐ │
│ │ Spouts │ │
│ │ (Input) │ │
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │ Bolts │ │
│ │(Process) │ │
│ └──────────┘ │
└──────┬──────────────────┘
│
┌─────────────┴─────────────┐
│ │
┌──────▼──────┐ ┌───────▼──────┐
│ Data │ │ Data │
│ Sink │ │ Sink │
│ (Database) │ │ (Kafka) │
└─────────────┘ └─────────────┘
Explanation:
- Data Sources: Systems that produce streaming data (e.g., Kafka, Kinesis, file systems, databases).
- Storm Nimbus: Master node that coordinates the cluster, submits topologies, and monitors execution.
- Supervisor Nodes: Worker nodes that execute tasks. Each supervisor runs worker processes that execute spouts and bolts.
- Spouts: Sources of data streams that read from external systems and emit tuples.
- Bolts: Processing units that consume tuples, perform transformations, and emit new tuples.
- Data Sinks: Systems that consume processed data (e.g., databases, message queues, file systems).
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ Storm Cluster │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Nimbus (Master) │ │
│ │ (Topology Submission, Coordination) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ ZooKeeper │ │
│ │ (Coordination, State Management) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Supervisor Nodes │ │
│ │ (Worker Processes, Task Execution) │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘
Topology
Basic Topology
Java:
TopologyBuilder builder = new TopologyBuilder();
// Spout
builder.setSpout("spout", new RandomSentenceSpout(), 1);
// Bolt
builder.setBolt("split", new SplitSentenceBolt(), 2)
.shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 2)
.fieldsGrouping("split", new Fields("word"));
Config config = new Config();
StormSubmitter.submitTopology("word-count", config, builder.createTopology());
Python:
from streamparse import Topology
class WordCountTopology(Topology):
spout = RandomSentenceSpout.spec(par=1)
split = SplitSentenceBolt.spec(par=2, inputs={spout: ShuffleGrouping()})
count = WordCountBolt.spec(par=2, inputs={split: FieldsGrouping('word')})
Spouts
Basic Spout
Java:
public class RandomSentenceSpout extends BaseRichSpout {
private SpoutOutputCollector collector;
private Random random;
@Override
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
this.collector = collector;
this.random = new Random();
}
@Override
public void nextTuple() {
String[] sentences = {
"the cow jumped over the moon",
"an apple a day keeps the doctor away"
};
String sentence = sentences[random.nextInt(sentences.length)];
collector.emit(new Values(sentence));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence"));
}
}
Kafka Spout
Java:
SpoutConfig spoutConfig = new SpoutConfig(
new ZkHosts("localhost:2181"),
"topic-name",
"/kafka-storm",
"spout-id"
);
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
builder.setSpout("kafka-spout", kafkaSpout, 1);
Bolts
Basic Bolt
Java:
public class SplitSentenceBolt extends BaseRichBolt {
private OutputCollector collector;
@Override
public void prepare(Map conf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for (String word : words) {
collector.emit(new Values(word));
}
collector.ack(tuple);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
Aggregation Bolt
Java:
public class WordCountBolt extends BaseRichBolt {
private OutputCollector collector;
private Map<String, Integer> counts = new HashMap<>();
@Override
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
counts.put(word, counts.getOrDefault(word, 0) + 1);
collector.emit(new Values(word, counts.get(word)));
collector.ack(tuple);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Stream Grouping
Shuffle Grouping
Random Distribution:
builder.setBolt("bolt", new MyBolt(), 2)
.shuffleGrouping("spout");
Fields Grouping
Group by Field:
builder.setBolt("bolt", new MyBolt(), 2)
.fieldsGrouping("spout", new Fields("user_id"));
All Grouping
Broadcast to All:
builder.setBolt("bolt", new MyBolt(), 2)
.allGrouping("spout");
Global Grouping
Single Target:
builder.setBolt("bolt", new MyBolt(), 2)
.globalGrouping("spout");
Reliability
Acknowledgment
Ack Tuple:
@Override
public void execute(Tuple tuple) {
// Process tuple
processTuple(tuple);
// Acknowledge
collector.ack(tuple);
}
Fail Tuple:
@Override
public void execute(Tuple tuple) {
try {
processTuple(tuple);
collector.ack(tuple);
} catch (Exception e) {
collector.fail(tuple);
}
}
Anchoring
Anchor Tuple:
@Override
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
// Emit with anchor
collector.emit(tuple, new Values(word.toUpperCase()));
collector.ack(tuple);
}
Performance Optimization
Parallelism
Set Parallelism:
builder.setSpout("spout", new MySpout(), 4); // 4 executors
builder.setBolt("bolt", new MyBolt(), 8); // 8 executors
Resource Allocation
Set Resources:
Config config = new Config();
config.setNumWorkers(4);
config.setMaxSpoutPending(1000);
Backpressure
Handle Backpressure:
if (collector.getPendingCount() > threshold) {
// Slow down or buffer
}
Best Practices
1. Topology Design
- Keep topologies simple
- Use appropriate parallelism
- Minimize network hops
- Design for failure
2. Reliability
- Implement proper acking
- Handle failures gracefully
- Use anchoring for tracking
- Monitor tuple processing
3. Performance
- Tune parallelism
- Optimize stream grouping
- Minimize serialization
- Monitor backpressure
4. Error Handling
- Catch exceptions
- Fail tuples appropriately
- Implement retry logic
- Log errors
What Interviewers Look For
Stream Processing Understanding
- Storm Concepts
- Understanding of topologies, spouts, bolts
- Stream grouping
- Reliability mechanisms
- Red Flags: No Storm understanding, wrong concepts, no reliability
- Real-Time Processing
- Low-latency processing
- Throughput optimization
- Backpressure handling
- Red Flags: No latency awareness, poor throughput, no backpressure
- Fault Tolerance
- Acknowledgment mechanisms
- Failure recovery
- Exactly-once semantics
- Red Flags: No acking, poor recovery, no semantics
Problem-Solving Approach
- Topology Design
- Spout and bolt organization
- Stream grouping selection
- Parallelism tuning
- Red Flags: Poor organization, wrong grouping, no parallelism
- Reliability Design
- Acknowledgment strategy
- Failure handling
- Exactly-once processing
- Red Flags: No acking, poor handling, no exactly-once
System Design Skills
- Stream Processing Architecture
- Storm cluster design
- Topology organization
- Resource allocation
- Red Flags: No architecture, poor organization, no resources
- Scalability
- Horizontal scaling
- Parallelism tuning
- Performance optimization
- Red Flags: No scaling, poor parallelism, no optimization
Communication Skills
- Clear Explanation
- Explains Storm concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- Stream Processing Expertise
- Understanding of real-time processing
- Storm mastery
- Performance optimization
- Key: Demonstrate stream processing expertise
- System Design Skills
- Can design stream processing systems
- Understands real-time challenges
- Makes informed trade-offs
- Key: Show practical stream processing design skills
Summary
Apache Storm Key Points:
- Real-Time Processing: Processes streams in real-time
- Topology-Based: Graph of spouts and bolts
- Fault Tolerant: Automatic failure recovery
- Guaranteed Processing: At-least-once or exactly-once
- Scalable: Horizontal scaling with parallelism
Common Use Cases:
- Real-time analytics
- Event processing
- Stream aggregation
- ETL pipelines
- Fraud detection
- Alerting systems
Best Practices:
- Design simple topologies
- Use appropriate stream grouping
- Implement proper acking
- Tune parallelism
- Handle backpressure
- Monitor performance
- Design for failure
Apache Storm is a powerful framework for building real-time stream processing applications with guaranteed message processing.