Introduction
Apache HBase is a distributed, scalable, NoSQL database built on top of Hadoop HDFS. It provides random, real-time read/write access to big data and is modeled after Google’s Bigtable. Understanding HBase is essential for system design interviews involving big data storage and time-series data.
This guide covers:
- HBase Fundamentals: Column families, rows, cells, and versioning
- Data Model: Row keys, column qualifiers, and timestamps
- Architecture: Regions, RegionServers, and HMaster
- Performance: Row key design, compaction, and optimization
- Best Practices: Schema design, access patterns, and monitoring
What is Apache HBase?
Apache HBase is a NoSQL database that:
- Column-Family Storage: Data organized in column families
- HDFS Integration: Built on Hadoop Distributed File System
- Random Access: Fast random read/write access
- Scalability: Handles billions of rows and millions of columns
- Consistency: Strong consistency per row
Key Concepts
Table: Collection of rows
Row: Identified by row key
Column Family: Group of columns
Column Qualifier: Column name within family
Cell: Intersection of row, column family, and column qualifier
Region: Partition of table data
RegionServer: Server that manages regions
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Client │────▶│ Client │
│ Application │ │ Application │ │ Application │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
│ HBase API
│
▼
┌─────────────────────────┐
│ HBase Cluster │
│ │
│ ┌──────────┐ │
│ │ HMaster │ │
│ │(Metadata)│ │
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │ Region │ │
│ │ Servers │ │
│ └──────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ HDFS │ │
│ │ (Storage) │ │
│ └───────────────────┘ │
└─────────────────────────┘
Explanation:
- Client Applications: Applications that use HBase to store and retrieve large-scale structured data (e.g., big data applications, analytics platforms).
- HBase Cluster: Distributed NoSQL database built on top of Hadoop HDFS for storing large amounts of sparse data.
- HMaster: Manages metadata, coordinates region assignments, and handles cluster administration.
- Region Servers: Serve data for a set of regions. Each region server handles read/write requests for the regions it serves.
- HDFS (Storage): Hadoop Distributed File System that provides the underlying storage layer for HBase data.
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ HBase Client │
│ (HBase API, Connection Pooling) │
└────────────────────┬────────────────────────────────────┘
│
┌────────────┴────────────┐
│ Zookeeper │
│ (Coordination, Metadata)│
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ HMaster │
│ (Metadata, Coordination)│
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌───────▼────────┐ ┌─────────▼──────────┐
│ RegionServer 1 │ │ RegionServer 2 │
│ (Regions) │ │ (Regions) │
└───────┬────────┘ └─────────┬──────────┘
│ │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ HDFS │
│ (Data Storage) │
└─────────────────────────┘
Data Model
Table Structure
Table: users
Row Key: user123
Column Family: info
Column Qualifier: name → Value: "John Doe"
Column Qualifier: email → Value: "john@example.com"
Column Family: activity
Column Qualifier: last_login → Value: "2024-01-01"
Column Qualifier: login_count → Value: "42"
Row Key Design
Good Row Keys:
- Salting: Add prefix for distribution
Row Key: hash(user_id) + user_id - Reversing: Reverse timestamp for time-series
Row Key: reverse(timestamp) + user_id - Composite: Combine multiple fields
Row Key: user_id + timestamp
Bad Row Keys:
- Sequential IDs (hotspotting)
- Timestamps alone (hotspotting)
- Too short (poor distribution)
Column Families
Design Principles:
- Keep column families small (2-3 recommended)
- Group related columns together
- Consider access patterns
- Avoid too many column families
Example:
// Create table with column families
HTableDescriptor table = new HTableDescriptor(TableName.valueOf("users"));
table.addFamily(new HColumnDescriptor("info"));
table.addFamily(new HColumnDescriptor("activity"));
HBase Operations
Java API
Put (Write):
Connection connection = ConnectionFactory.createConnection();
Table table = connection.getTable(TableName.valueOf("users"));
Put put = new Put(Bytes.toBytes("user123"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"),
Bytes.toBytes("John Doe"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"),
Bytes.toBytes("john@example.com"));
table.put(put);
table.close();
Get (Read):
Get get = new Get(Bytes.toBytes("user123"));
get.addFamily(Bytes.toBytes("info"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("info"),
Bytes.toBytes("name"));
Scan (Range Query):
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("info"));
scan.setStartRow(Bytes.toBytes("user100"));
scan.setStopRow(Bytes.toBytes("user200"));
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
// Process result
}
scanner.close();
Delete:
Delete delete = new Delete(Bytes.toBytes("user123"));
delete.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"));
table.delete(delete);
Python API (HappyBase)
Operations:
import happybase
connection = happybase.Connection('localhost')
table = connection.table('users')
# Put
table.put('user123', {
'info:name': 'John Doe',
'info:email': 'john@example.com'
})
# Get
row = table.row('user123')
print(row[b'info:name'])
# Scan
for key, data in table.scan(row_prefix='user'):
print(key, data)
Regions and Sharding
Region Splitting
Automatic Splitting:
- Regions split when size exceeds threshold
- Default: 10GB per region
- Split at row key midpoint
Manual Splitting:
hbase shell> split 'users', 'user500'
Region Distribution
Pre-splitting:
byte[][] splits = new byte[][]{
Bytes.toBytes("user100"),
Bytes.toBytes("user200"),
Bytes.toBytes("user300")
};
admin.createTable(tableDescriptor, splits);
Performance Characteristics
Maximum Read & Write Throughput
Single RegionServer:
- Max Write Throughput: 10K-50K writes/sec (depends on row size and column families)
- Max Read Throughput: 5K-25K reads/sec (depends on data locality and caching)
Cluster (Horizontal Scaling):
- Max Write Throughput: 10K-50K writes/sec per RegionServer (linear scaling)
- Max Read Throughput: 5K-25K reads/sec per RegionServer (linear scaling)
- Example: 100 RegionServer cluster can handle 1M-5M writes/sec and 500K-2.5M reads/sec total
Factors Affecting Throughput:
- Row key design (hotspotting reduces throughput)
- Region distribution (even distribution = better throughput)
- MemStore size and flush frequency
- Block cache hit rate (higher cache hits = faster reads)
- HDFS replication factor
- Compaction strategy
- Network latency
- Number of RegionServers
Optimized Configuration:
- Max Write Throughput: 50K-100K writes/sec per RegionServer (with optimized row keys and memstore settings)
- Max Read Throughput: 25K-50K reads/sec per RegionServer (with high block cache hit rate and proper row key design)
Performance Optimization
Row Key Design
Avoid Hotspotting:
// Bad: Sequential IDs
Row Key: 1, 2, 3, 4, ...
// Good: Salted
Row Key: hash(user_id) + user_id
Time-Series Data:
// Bad: Timestamp first
Row Key: timestamp + user_id
// Good: Reversed timestamp
Row Key: Long.MAX_VALUE - timestamp + user_id
Caching
Block Cache:
Get get = new Get(Bytes.toBytes("user123"));
get.setCacheBlocks(true);
Column Family Cache:
HColumnDescriptor family = new HColumnDescriptor("info");
family.setBlockCacheEnabled(true);
Bloom Filters
Enable Bloom Filter:
HColumnDescriptor family = new HColumnDescriptor("info");
family.setBloomFilterType(BloomType.ROW);
Benefits:
- Reduces disk I/O
- Faster lookups
- Memory overhead
Compaction
Major Compaction:
hbase shell> major_compact 'users'
Compaction Types:
- Minor: Merges HFiles
- Major: Merges all HFiles in region
Schema Design
Time-Series Data
Schema:
Row Key: reverse(timestamp) + metric_id
Column Family: metrics
Column Qualifier: value
Column Qualifier: tags
Example:
// Row key for time-series
long timestamp = System.currentTimeMillis();
String rowKey = String.valueOf(Long.MAX_VALUE - timestamp) +
"_" + metricId;
Wide Tables
Many Columns:
Row Key: user_id
Column Family: attributes
Column Qualifier: attr_1, attr_2, ..., attr_N
Tall Tables
Many Rows:
Row Key: user_id + timestamp
Column Family: events
Column Qualifier: event_type
Best Practices
1. Row Key Design
- Avoid sequential keys
- Use salting for distribution
- Reverse timestamps for time-series
- Keep keys short
2. Column Families
- Limit to 2-3 families
- Group related columns
- Consider access patterns
- Avoid too many families
3. Performance
- Enable bloom filters
- Use appropriate caching
- Monitor compaction
- Optimize scans
4. Monitoring
- Monitor region distribution
- Track read/write performance
- Monitor HDFS usage
- Watch for hotspots
What Interviewers Look For
NoSQL Database Understanding
- HBase Concepts
- Understanding of column-family model
- Row key design
- Region architecture
- Red Flags: No HBase understanding, wrong model, poor row keys
- Big Data Storage
- HDFS integration
- Scalability patterns
- Performance optimization
- Red Flags: No HDFS understanding, poor scalability, no optimization
- Schema Design
- Column family design
- Row key strategies
- Access patterns
- Red Flags: Poor schema, wrong row keys, no patterns
Problem-Solving Approach
- Row Key Design
- Avoid hotspotting
- Distribution strategies
- Time-series patterns
- Red Flags: Sequential keys, hotspotting, poor distribution
- Performance Optimization
- Caching strategies
- Bloom filters
- Compaction tuning
- Red Flags: No optimization, poor performance, no tuning
System Design Skills
- Big Data Architecture
- HBase cluster design
- Region distribution
- HDFS integration
- Red Flags: No architecture, poor design, no HDFS
- Scalability
- Horizontal scaling
- Region management
- Performance tuning
- Red Flags: No scaling, poor regions, no tuning
Communication Skills
- Clear Explanation
- Explains HBase concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- Big Data Expertise
- Understanding of big data storage
- HBase mastery
- Performance optimization
- Key: Demonstrate big data expertise
- System Design Skills
- Can design big data systems
- Understands scalability challenges
- Makes informed trade-offs
- Key: Show practical big data design skills
Summary
HBase Key Points:
- Column-Family Storage: Data organized in column families
- HDFS Integration: Built on Hadoop Distributed File System
- Random Access: Fast random read/write
- Scalability: Handles billions of rows
- Row Key Design: Critical for performance and distribution
Common Use Cases:
- Time-series data
- Sensor data
- Log storage
- User activity tracking
- Real-time analytics
- Big data storage
Best Practices:
- Design row keys carefully
- Limit column families (2-3)
- Use salting for distribution
- Enable bloom filters
- Monitor region distribution
- Optimize for access patterns
Apache HBase is a powerful NoSQL database for storing and accessing large-scale structured data with random access patterns.