Introduction
Apache Parquet is a columnar storage file format designed for efficient data storage and analytics. It’s optimized for analytical workloads and provides high compression ratios and fast query performance. Understanding Parquet is essential for system design interviews involving data storage and analytics.
This guide covers:
- Parquet Fundamentals: Columnar storage, row groups, and pages
- Schema: Nested data structures and type system
- Compression: Codecs and optimization
- Schema Evolution: Adding and removing columns
- Best Practices: File organization, partitioning, and performance
What is Apache Parquet?
Apache Parquet is a columnar storage format that:
- Columnar Storage: Stores data by columns
- Compression: High compression ratios
- Schema Evolution: Supports schema changes
- Nested Data: Supports complex nested structures
- Cross-Language: Works across multiple languages
Key Concepts
File: Parquet file containing data
Row Group: Horizontal partition of data
Column Chunk: Column data within a row group
Page: Unit of compression and encoding
Schema: Data structure definition
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Data │────▶│ Data │────▶│ Data │
│ Source │ │ Source │ │ Source │
│ (Database) │ │ (Files) │ │ (Stream) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
│ Write Data
│
▼
┌─────────────────────────┐
│ Parquet Writer │
│ │
│ ┌──────────┐ │
│ │ Schema │ │
│ │ Definition│ │
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │ Columnar │ │
│ │ Encoding │ │
│ └──────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ Parquet File │ │
│ │ (Storage) │ │
│ └───────────────────┘ │
└──────┬──────────────────┘
│
│ Read Data
│
┌─────────────┴─────────────┐
│ │
┌──────▼──────┐ ┌───────▼──────┐
│ Analytics │ │ Data │
│ Engine │ │ Warehouse │
│ (Spark) │ │ (S3/HDFS) │
└─────────────┘ └─────────────┘
Explanation:
- Data Sources: Systems that produce data (e.g., databases, file systems, data streams).
- Parquet Writer: Converts data into Parquet columnar format with schema definition and columnar encoding.
- Schema Definition: Defines the structure of data (columns, types, nested structures).
- Columnar Encoding: Stores data column-by-column for efficient compression and querying.
- Parquet File (Storage): Binary file format stored on disk or object storage (S3, HDFS).
- Analytics Engine: Systems that read Parquet files for analytics (e.g., Spark, Presto, Hive).
- Data Warehouse: Storage systems that store Parquet files (e.g., S3, HDFS, data lakes).
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ Parquet File │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ File Metadata │ │
│ │ (Schema, Row Groups, Statistics) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Row Group 1 │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Column 1 │ │ Column 2 │ │ Column 3 │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Row Group 2 │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Column 1 │ │ Column 2 │ │ Column 3 │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘
Schema Definition
Simple Schema
Python:
import pyarrow as pa
schema = pa.schema([
('id', pa.int64()),
('name', pa.string()),
('age', pa.int32()),
('salary', pa.float64())
])
Java:
MessageType schema = new MessageType(
"schema",
new PrimitiveType(Type.Repetition.REQUIRED, PrimitiveType.PrimitiveTypeName.INT64, "id"),
new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveType.PrimitiveTypeName.BINARY, "name"),
new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveType.PrimitiveTypeName.INT32, "age"),
new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveType.PrimitiveTypeName.DOUBLE, "salary")
);
Nested Schema
Python:
schema = pa.schema([
('id', pa.int64()),
('name', pa.string()),
('address', pa.struct([
('street', pa.string()),
('city', pa.string()),
('zip', pa.string())
])),
('tags', pa.list_(pa.string()))
])
Writing Parquet Files
Python (PyArrow)
import pyarrow as pa
import pyarrow.parquet as pq
# Create table
data = {
'id': [1, 2, 3],
'name': ['John', 'Jane', 'Bob'],
'age': [30, 25, 35],
'salary': [50000.0, 60000.0, 70000.0]
}
table = pa.Table.from_pydict(data)
# Write to Parquet
pq.write_table(table, 'data.parquet', compression='snappy')
Java
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
ParquetWriter<Group> writer = ExampleParquetWriter.builder(new Path("data.parquet"))
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(schema)
.build();
Group group = factory.newGroup()
.append("id", 1L)
.append("name", "John")
.append("age", 30)
.append("salary", 50000.0);
writer.write(group);
writer.close();
Reading Parquet Files
Python (PyArrow)
import pyarrow.parquet as pq
# Read entire file
table = pq.read_table('data.parquet')
# Read specific columns
table = pq.read_table('data.parquet', columns=['id', 'name'])
# Read with filters
table = pq.read_table('data.parquet',
filters=[('age', '>', 25)])
Java
import org.apache.parquet.hadoop.ParquetReader;
ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(), new Path("data.parquet"))
.build();
Group group;
while ((group = reader.read()) != null) {
long id = group.getLong("id", 0);
String name = group.getString("name", 0);
// Process data
}
reader.close();
Compression
Compression Codecs
Snappy:
pq.write_table(table, 'data.parquet', compression='snappy')
Gzip:
pq.write_table(table, 'data.parquet', compression='gzip')
Zstd:
pq.write_table(table, 'data.parquet', compression='zstd')
LZ4:
pq.write_table(table, 'data.parquet', compression='lz4')
Compression Comparison
- Snappy: Fast compression, moderate ratio
- Gzip: Slower compression, better ratio
- Zstd: Balanced compression and speed
- LZ4: Very fast compression, lower ratio
Schema Evolution
Adding Columns
Read with New Schema:
# Original schema
original_schema = pa.schema([
('id', pa.int64()),
('name', pa.string())
])
# New schema with additional column
new_schema = pa.schema([
('id', pa.int64()),
('name', pa.string()),
('email', pa.string()) # New column
])
# Read with new schema (email will be null)
table = pq.read_table('data.parquet', schema=new_schema)
Removing Columns
Read Subset:
# Read only specific columns
table = pq.read_table('data.parquet', columns=['id', 'name'])
Partitioning
Directory Partitioning
Hive-Style Partitioning:
data/
year=2024/
month=01/
data.parquet
month=02/
data.parquet
year=2023/
month=12/
data.parquet
Read Partitioned Data:
dataset = pq.ParquetDataset('data/',
filters=[('year', '=', 2024), ('month', '=', 1)])
table = dataset.read()
Best Practices
1. File Organization
- Use appropriate row group size (128MB-1GB)
- Partition by query patterns
- Co-locate related columns
- Use consistent schemas
2. Compression
- Use Snappy for balanced performance
- Use Gzip for better compression
- Test compression ratios
- Consider query patterns
3. Schema Design
- Use appropriate data types
- Minimize nested structures when possible
- Plan for schema evolution
- Document schema changes
4. Performance
- Use column projection
- Filter at file level
- Optimize row group size
- Monitor file sizes
What Interviewers Look For
Data Storage Understanding
- Parquet Concepts
- Understanding of columnar storage
- Compression strategies
- Schema evolution
- Red Flags: No Parquet understanding, wrong storage, no schema evolution
- Storage Optimization
- Compression selection
- File organization
- Partitioning strategies
- Red Flags: No compression, poor organization, no partitioning
- Performance
- Column projection
- Filter pushdown
- File size optimization
- Red Flags: No optimization, poor performance, no filtering
Problem-Solving Approach
- Storage Design
- File organization
- Compression strategy
- Partitioning design
- Red Flags: Poor organization, wrong compression, no partitioning
- Schema Design
- Data type selection
- Schema evolution planning
- Nested structure design
- Red Flags: Wrong types, no evolution, poor nesting
System Design Skills
- Data Storage Architecture
- Parquet file organization
- Compression strategy
- Partitioning design
- Red Flags: No architecture, poor compression, no partitioning
- Performance Optimization
- Column projection
- Filter optimization
- File size management
- Red Flags: No optimization, poor performance, no management
Communication Skills
- Clear Explanation
- Explains Parquet concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- Data Storage Expertise
- Understanding of columnar storage
- Parquet mastery
- Compression optimization
- Key: Demonstrate data storage expertise
- System Design Skills
- Can design storage systems
- Understands storage challenges
- Makes informed trade-offs
- Key: Show practical storage design skills
Summary
Apache Parquet Key Points:
- Columnar Storage: Optimized for analytical queries
- High Compression: Efficient data storage
- Schema Evolution: Supports schema changes
- Nested Data: Complex data structures
- Cross-Language: Works across languages
Common Use Cases:
- Data warehousing
- Analytics storage
- Big data processing
- Data lake storage
- ETL pipelines
- Query optimization
Best Practices:
- Use appropriate compression
- Partition by query patterns
- Optimize row group size
- Use column projection
- Filter at file level
- Plan for schema evolution
- Monitor file sizes
Apache Parquet is a powerful columnar storage format optimized for analytical workloads and efficient data storage.