Introduction

Apache HBase is a distributed, scalable, NoSQL database built on top of Hadoop HDFS. It provides random, real-time read/write access to big data and is modeled after Google’s Bigtable. Understanding HBase is essential for system design interviews involving big data storage and time-series data.

This guide covers:

  • HBase Fundamentals: Column families, rows, cells, and versioning
  • Data Model: Row keys, column qualifiers, and timestamps
  • Architecture: Regions, RegionServers, and HMaster
  • Performance: Row key design, compaction, and optimization
  • Best Practices: Schema design, access patterns, and monitoring

What is Apache HBase?

Apache HBase is a NoSQL database that:

  • Column-Family Storage: Data organized in column families
  • HDFS Integration: Built on Hadoop Distributed File System
  • Random Access: Fast random read/write access
  • Scalability: Handles billions of rows and millions of columns
  • Consistency: Strong consistency per row

Key Concepts

Table: Collection of rows

Row: Identified by row key

Column Family: Group of columns

Column Qualifier: Column name within family

Cell: Intersection of row, column family, and column qualifier

Region: Partition of table data

RegionServer: Server that manages regions

Architecture

High-Level Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│   Client    │────▶│   Client    │
│ Application │     │ Application │     │ Application │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                            │ HBase API
                            │
                            ▼
              ┌─────────────────────────┐
              │   HBase Cluster         │
              │                         │
              │  ┌──────────┐           │
              │  │ HMaster  │           │
              │  │(Metadata)│           │
              │  └────┬─────┘           │
              │       │                 │
              │  ┌────┴─────┐           │
              │  │ Region   │           │
              │  │ Servers  │           │
              │  └──────────┘           │
              │                         │
              │  ┌───────────────────┐  │
              │  │  HDFS             │  │
              │  │  (Storage)        │  │
              │  └───────────────────┘  │
              └─────────────────────────┘

Explanation:

  • Client Applications: Applications that use HBase to store and retrieve large-scale structured data (e.g., big data applications, analytics platforms).
  • HBase Cluster: Distributed NoSQL database built on top of Hadoop HDFS for storing large amounts of sparse data.
  • HMaster: Manages metadata, coordinates region assignments, and handles cluster administration.
  • Region Servers: Serve data for a set of regions. Each region server handles read/write requests for the regions it serves.
  • HDFS (Storage): Hadoop Distributed File System that provides the underlying storage layer for HBase data.

Core Architecture

┌─────────────────────────────────────────────────────────┐
│                    HBase Client                          │
│  (HBase API, Connection Pooling)                        │
└────────────────────┬────────────────────────────────────┘
                     │
        ┌────────────┴────────────┐
        │        Zookeeper         │
        │  (Coordination, Metadata)│
        └────────────┬────────────┘
                     │
        ┌────────────┴────────────┐
        │        HMaster          │
        │  (Metadata, Coordination)│
        └────────────┬────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
┌───────▼────────┐      ┌─────────▼──────────┐
│ RegionServer 1 │      │  RegionServer 2     │
│  (Regions)     │      │   (Regions)         │
└───────┬────────┘      └─────────┬──────────┘
        │                         │
        └────────────┬────────────┘
                     │
        ┌────────────▼────────────┐
        │        HDFS             │
        │  (Data Storage)         │
        └─────────────────────────┘

Data Model

Table Structure

Table: users
Row Key: user123
Column Family: info
  Column Qualifier: name → Value: "John Doe"
  Column Qualifier: email → Value: "john@example.com"
Column Family: activity
  Column Qualifier: last_login → Value: "2024-01-01"
  Column Qualifier: login_count → Value: "42"

Row Key Design

Good Row Keys:

  • Salting: Add prefix for distribution
    Row Key: hash(user_id) + user_id
    
  • Reversing: Reverse timestamp for time-series
    Row Key: reverse(timestamp) + user_id
    
  • Composite: Combine multiple fields
    Row Key: user_id + timestamp
    

Bad Row Keys:

  • Sequential IDs (hotspotting)
  • Timestamps alone (hotspotting)
  • Too short (poor distribution)

Column Families

Design Principles:

  • Keep column families small (2-3 recommended)
  • Group related columns together
  • Consider access patterns
  • Avoid too many column families

Example:

// Create table with column families
HTableDescriptor table = new HTableDescriptor(TableName.valueOf("users"));
table.addFamily(new HColumnDescriptor("info"));
table.addFamily(new HColumnDescriptor("activity"));

HBase Operations

Java API

Put (Write):

Connection connection = ConnectionFactory.createConnection();
Table table = connection.getTable(TableName.valueOf("users"));

Put put = new Put(Bytes.toBytes("user123"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), 
              Bytes.toBytes("John Doe"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"), 
              Bytes.toBytes("john@example.com"));

table.put(put);
table.close();

Get (Read):

Get get = new Get(Bytes.toBytes("user123"));
get.addFamily(Bytes.toBytes("info"));

Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("info"), 
                              Bytes.toBytes("name"));

Scan (Range Query):

Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("info"));
scan.setStartRow(Bytes.toBytes("user100"));
scan.setStopRow(Bytes.toBytes("user200"));

ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    // Process result
}
scanner.close();

Delete:

Delete delete = new Delete(Bytes.toBytes("user123"));
delete.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"));
table.delete(delete);

Python API (HappyBase)

Operations:

import happybase

connection = happybase.Connection('localhost')
table = connection.table('users')

# Put
table.put('user123', {
    'info:name': 'John Doe',
    'info:email': 'john@example.com'
})

# Get
row = table.row('user123')
print(row[b'info:name'])

# Scan
for key, data in table.scan(row_prefix='user'):
    print(key, data)

Regions and Sharding

Region Splitting

Automatic Splitting:

  • Regions split when size exceeds threshold
  • Default: 10GB per region
  • Split at row key midpoint

Manual Splitting:

hbase shell> split 'users', 'user500'

Region Distribution

Pre-splitting:

byte[][] splits = new byte[][]{
    Bytes.toBytes("user100"),
    Bytes.toBytes("user200"),
    Bytes.toBytes("user300")
};

admin.createTable(tableDescriptor, splits);

Performance Characteristics

Maximum Read & Write Throughput

Single RegionServer:

  • Max Write Throughput: 10K-50K writes/sec (depends on row size and column families)
  • Max Read Throughput: 5K-25K reads/sec (depends on data locality and caching)

Cluster (Horizontal Scaling):

  • Max Write Throughput: 10K-50K writes/sec per RegionServer (linear scaling)
  • Max Read Throughput: 5K-25K reads/sec per RegionServer (linear scaling)
  • Example: 100 RegionServer cluster can handle 1M-5M writes/sec and 500K-2.5M reads/sec total

Factors Affecting Throughput:

  • Row key design (hotspotting reduces throughput)
  • Region distribution (even distribution = better throughput)
  • MemStore size and flush frequency
  • Block cache hit rate (higher cache hits = faster reads)
  • HDFS replication factor
  • Compaction strategy
  • Network latency
  • Number of RegionServers

Optimized Configuration:

  • Max Write Throughput: 50K-100K writes/sec per RegionServer (with optimized row keys and memstore settings)
  • Max Read Throughput: 25K-50K reads/sec per RegionServer (with high block cache hit rate and proper row key design)

Performance Optimization

Row Key Design

Avoid Hotspotting:

// Bad: Sequential IDs
Row Key: 1, 2, 3, 4, ...

// Good: Salted
Row Key: hash(user_id) + user_id

Time-Series Data:

// Bad: Timestamp first
Row Key: timestamp + user_id

// Good: Reversed timestamp
Row Key: Long.MAX_VALUE - timestamp + user_id

Caching

Block Cache:

Get get = new Get(Bytes.toBytes("user123"));
get.setCacheBlocks(true);

Column Family Cache:

HColumnDescriptor family = new HColumnDescriptor("info");
family.setBlockCacheEnabled(true);

Bloom Filters

Enable Bloom Filter:

HColumnDescriptor family = new HColumnDescriptor("info");
family.setBloomFilterType(BloomType.ROW);

Benefits:

  • Reduces disk I/O
  • Faster lookups
  • Memory overhead

Compaction

Major Compaction:

hbase shell> major_compact 'users'

Compaction Types:

  • Minor: Merges HFiles
  • Major: Merges all HFiles in region

Schema Design

Time-Series Data

Schema:

Row Key: reverse(timestamp) + metric_id
Column Family: metrics
  Column Qualifier: value
  Column Qualifier: tags

Example:

// Row key for time-series
long timestamp = System.currentTimeMillis();
String rowKey = String.valueOf(Long.MAX_VALUE - timestamp) + 
                "_" + metricId;

Wide Tables

Many Columns:

Row Key: user_id
Column Family: attributes
  Column Qualifier: attr_1, attr_2, ..., attr_N

Tall Tables

Many Rows:

Row Key: user_id + timestamp
Column Family: events
  Column Qualifier: event_type

Best Practices

1. Row Key Design

  • Avoid sequential keys
  • Use salting for distribution
  • Reverse timestamps for time-series
  • Keep keys short

2. Column Families

  • Limit to 2-3 families
  • Group related columns
  • Consider access patterns
  • Avoid too many families

3. Performance

  • Enable bloom filters
  • Use appropriate caching
  • Monitor compaction
  • Optimize scans

4. Monitoring

  • Monitor region distribution
  • Track read/write performance
  • Monitor HDFS usage
  • Watch for hotspots

What Interviewers Look For

NoSQL Database Understanding

  1. HBase Concepts
    • Understanding of column-family model
    • Row key design
    • Region architecture
    • Red Flags: No HBase understanding, wrong model, poor row keys
  2. Big Data Storage
    • HDFS integration
    • Scalability patterns
    • Performance optimization
    • Red Flags: No HDFS understanding, poor scalability, no optimization
  3. Schema Design
    • Column family design
    • Row key strategies
    • Access patterns
    • Red Flags: Poor schema, wrong row keys, no patterns

Problem-Solving Approach

  1. Row Key Design
    • Avoid hotspotting
    • Distribution strategies
    • Time-series patterns
    • Red Flags: Sequential keys, hotspotting, poor distribution
  2. Performance Optimization
    • Caching strategies
    • Bloom filters
    • Compaction tuning
    • Red Flags: No optimization, poor performance, no tuning

System Design Skills

  1. Big Data Architecture
    • HBase cluster design
    • Region distribution
    • HDFS integration
    • Red Flags: No architecture, poor design, no HDFS
  2. Scalability
    • Horizontal scaling
    • Region management
    • Performance tuning
    • Red Flags: No scaling, poor regions, no tuning

Communication Skills

  1. Clear Explanation
    • Explains HBase concepts
    • Discusses trade-offs
    • Justifies design decisions
    • Red Flags: Unclear explanations, no justification, confusing

Meta-Specific Focus

  1. Big Data Expertise
    • Understanding of big data storage
    • HBase mastery
    • Performance optimization
    • Key: Demonstrate big data expertise
  2. System Design Skills
    • Can design big data systems
    • Understands scalability challenges
    • Makes informed trade-offs
    • Key: Show practical big data design skills

Summary

HBase Key Points:

  • Column-Family Storage: Data organized in column families
  • HDFS Integration: Built on Hadoop Distributed File System
  • Random Access: Fast random read/write
  • Scalability: Handles billions of rows
  • Row Key Design: Critical for performance and distribution

Common Use Cases:

  • Time-series data
  • Sensor data
  • Log storage
  • User activity tracking
  • Real-time analytics
  • Big data storage

Best Practices:

  • Design row keys carefully
  • Limit column families (2-3)
  • Use salting for distribution
  • Enable bloom filters
  • Monitor region distribution
  • Optimize for access patterns

Apache HBase is a powerful NoSQL database for storing and accessing large-scale structured data with random access patterns.