393. UTF-8 Validation
393. UTF-8 Validation
Problem Statement
Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e., it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is a
0, followed by its Unicode code. - For an n-byte character, the first
nbits are all one’s, then+1bit is0, followed byn-1bytes with the most significant 2 bits being10.
This is how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Examples
Example 1:
Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-byte character followed by a 1-byte character.
Example 2:
Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-byte character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.
Constraints
1 <= data.length <= 2 * 10^40 <= data[i] <= 255
Clarification Questions
Before diving into the solution, here are 5 important clarifications and assumptions to discuss during an interview:
-
UTF-8 encoding: What is UTF-8 encoding? (Assumption: Variable-length encoding - 1 to 4 bytes per character, first byte indicates length)
-
Byte format: How are bytes represented? (Assumption: Each integer represents one byte - only least significant 8 bits used)
-
Validation rules: What makes encoding valid? (Assumption: Follow UTF-8 rules - proper byte sequences, correct continuation bytes)
-
Return value: What should we return? (Assumption: Boolean - true if valid UTF-8 encoding, false otherwise)
-
Incomplete sequences: What if sequence is incomplete? (Assumption: Invalid - all bytes for multi-byte character must be present)
Interview Deduction Process (20 minutes)
Step 1: Brute-Force Approach (5 minutes)
Process the array sequentially. For each byte, check if it starts with 0 (1-byte), 110 (2-byte), 1110 (3-byte), or 11110 (4-byte) by converting to binary string and checking prefixes. Then validate that the following bytes start with 10. This approach works but string operations are inefficient and error-prone.
Step 2: Semi-Optimized Approach (7 minutes)
Use bit manipulation with hardcoded checks: for each byte, check specific bit patterns using bitwise operations. Check if byte matches 0xxxxxxx, 110xxxxx, 1110xxxx, or 11110xxx by comparing with masks. Then validate continuation bytes by checking if they match 10xxxxxx. This is more efficient but requires careful handling of bit masks and edge cases.
Step 3: Optimized Solution (8 minutes)
Use bit masks and helper functions: define MASK1 = 0x80 (checks bit 7) and MASK2 = 0xC0 (checks bits 7-6). Create getBytes() function that counts leading 1s to determine character length (1-4 bytes). Create isValid() function that checks if a byte is a valid continuation byte (must start with 10). Process the array sequentially: for each character, get its byte count, ensure enough bytes are available, validate continuation bytes, and advance the pointer. This achieves O(n) time with O(1) space, which is optimal. The key insight is using bit masks to efficiently check UTF-8 patterns without string conversions.
Solution Approach
This problem requires validating UTF-8 encoding by checking bit patterns. UTF-8 has specific rules for how bytes are structured, and we need to verify the sequence follows these rules.
Key Insights:
- UTF-8 Byte Patterns:
- 1-byte:
0xxxxxxx(starts with 0) - 2-byte:
110xxxxx 10xxxxxx - 3-byte:
1110xxxx 10xxxxxx 10xxxxxx - 4-byte:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- 1-byte:
- Bit Masks: Use bit masks to check byte patterns
- Continuation Bytes: All bytes after the first must start with
10 - Length Validation: Ensure we have enough bytes for multi-byte characters
Algorithm:
- Check first byte: Determine number of bytes in character
- Validate continuation: Check that following bytes start with
10 - Length check: Ensure enough bytes available
- Process: Move to next character
Solution
Solution: Bit Manipulation with Masks
class Solution {
private:
static const int MASK1 = 1 << 7;
static const int MASK2 = MASK1 + (1 << 6);
bool isValid(int num) {
return (num & MASK2) == MASK1;
}
int getBytes(int num) {
if((num & MASK1) == 0) return 1;
int n = 0, mask = MASK1;
while ((num & mask) != 0) {
n++;
if(n > 4) return -1;
mask >>= 1;
}
return n >=2 ? n : -1;
}
public:
bool validUtf8(vector<int>& data) {
const int M = data.size();
int idx = 0;
while(idx < M) {
int num = data[idx];
int n = getBytes(num);
if(n < 0 || idx + n > M) return false;
for(int i = 1; i < n; i++) {
if(!isValid(data[idx + i])) return false;
}
idx += n;
}
return true;
}
};
Algorithm Explanation:
- Bit Masks (Lines 3-4):
MASK1 = 1 << 7 = 10000000(0x80): Checks if bit 7 is setMASK2 = MASK1 + (1 << 6) = 11000000(0xC0): Checks bits 7 and 6
- isValid Helper (Lines 6-8):
- Checks if a byte is a valid continuation byte
- Continuation bytes must start with
10(bits 7=1, 6=0) (num & MASK2) == MASK1checks: bit 7=1 AND bit 6=0
- getBytes Helper (Lines 10-20):
- Determines number of bytes in UTF-8 character
- 1-byte check: If
(num & MASK1) == 0, return 1 - Multi-byte: Count leading 1s
- Start with
mask = MASK1(bit 7) - Count consecutive 1s from left
- If count > 4: invalid (return -1)
- If count < 2: invalid (return -1)
- Return count (2, 3, or 4)
- Start with
- Main Validation (Lines 21-33):
- Process data array sequentially
- For each character:
- Get byte count:
n = getBytes(data[idx]) - Check validity:
n < 0or not enough bytes → return false - Validate continuation bytes: Check bytes
idx+1toidx+n-1 - Move to next character:
idx += n
- Get byte count:
Example Walkthrough:
Example 1: data = [197,130,1]
Step 1: Process first byte (197)
197 in binary: 11000101
getBytes(197):
(197 & MASK1) = (197 & 128) = 128 ≠ 0 → not 1-byte
Count leading 1s:
mask = 128 (10000000)
197 & 128 = 128 ≠ 0 → n=1, mask = 64
197 & 64 = 64 ≠ 0 → n=2, mask = 32
197 & 32 = 0 → stop
n = 2 → 2-byte character
Check: idx + 2 = 0 + 2 = 2 <= 3 ✓
Step 2: Validate continuation byte
data[1] = 130
130 in binary: 10000010
isValid(130):
(130 & MASK2) = (130 & 192) = 128
MASK1 = 128
128 == 128 ✓ → valid continuation byte
Step 3: Process next character (1)
1 in binary: 00000001
getBytes(1):
(1 & MASK1) = (1 & 128) = 0 → 1-byte character ✓
Result: true
Example 2: data = [235,140,4]
Step 1: Process first byte (235)
235 in binary: 11101011
getBytes(235):
Count leading 1s:
235 & 128 = 128 ≠ 0 → n=1
235 & 64 = 64 ≠ 0 → n=2
235 & 32 = 32 ≠ 0 → n=3
235 & 16 = 0 → stop
n = 3 → 3-byte character
Check: idx + 3 = 0 + 3 = 3 <= 3 ✓
Step 2: Validate first continuation byte
data[1] = 140
140 in binary: 10001100
isValid(140):
(140 & 192) = 128 == 128 ✓ → valid
Step 3: Validate second continuation byte
data[2] = 4
4 in binary: 00000100
isValid(4):
(4 & 192) = 0 ≠ 128 ✗ → invalid continuation byte
Result: false
Algorithm Breakdown
Key Insight: UTF-8 Bit Patterns
UTF-8 encoding uses specific bit patterns:
1-byte character:
0xxxxxxx
Bit 7 = 0
2-byte character:
110xxxxx 10xxxxxx
Bits 7-5 = 110, then continuation bytes with 10
3-byte character:
1110xxxx 10xxxxxx 10xxxxxx
Bits 7-4 = 1110, then continuation bytes with 10
4-byte character:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Bits 7-3 = 11110, then continuation bytes with 10
Bit Mask Operations
MASK1 (0x80 = 10000000):
- Checks if most significant bit is set
- Used to detect 1-byte vs multi-byte characters
MASK2 (0xC0 = 11000000):
- Checks bits 7 and 6 together
- Used to validate continuation bytes (must be
10xxxxxx)
getBytes Logic:
- Counts leading 1s from left
- Stops when encountering 0
- Validates count is 2-4
Validation Process
- First byte: Determine character length
- Length check: Ensure enough bytes available
- Continuation bytes: All following bytes must start with
10 - Move pointer: Advance by character length
Complexity Analysis
Time Complexity: O(n)
- Single pass: Process each byte exactly once
- Bit operations: O(1) per byte
- Total: O(n) where n = data.length
Space Complexity: O(1)
- Variables: Only use constant space
- No extra arrays: Process in-place
- Total: O(1)
Key Points
- Bit Masks: Use masks to check bit patterns efficiently
- Leading 1s: Count leading 1s to determine byte count
- Continuation Bytes: All continuation bytes must start with
10 - Length Validation: Ensure enough bytes for multi-byte characters
- Single Pass: Process array once, O(n) time
UTF-8 Encoding Rules
Byte Patterns
| Bytes | Pattern | Binary Format |
|---|---|---|
| 1 | ASCII | 0xxxxxxx |
| 2 | 2-byte | 110xxxxx 10xxxxxx |
| 3 | 3-byte | 1110xxxx 10xxxxxx 10xxxxxx |
| 4 | 4-byte | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Validation Rules
- First byte: Must match one of the patterns above
- Continuation bytes: Must start with
10(bits 7=1, 6=0) - Length: Must have exactly
nbytes forn-byte character - Range: Character length must be 1-4 bytes
Alternative Approaches
Approach 1: Bit Masks (Current Solution)
- Time: O(n)
- Space: O(1)
- Best for: Clear and efficient solution
Approach 2: Right Shift Comparison
- Time: O(n)
- Space: O(1)
- Use right shift: Compare with patterns like
0b110,0b1110 - Similar performance: Different implementation style
Approach 3: State Machine
- Time: O(n)
- Space: O(1)
- State tracking: Track expected continuation bytes
- More complex: But can be more readable
Detailed Example Walkthrough
Example: data = [240,162,138,147] (4-byte character)
Step 1: Process first byte (240)
240 in binary: 11110000
getBytes(240):
Count leading 1s:
240 & 128 = 128 ≠ 0 → n=1
240 & 64 = 64 ≠ 0 → n=2
240 & 32 = 32 ≠ 0 → n=3
240 & 16 = 16 ≠ 0 → n=4
240 & 8 = 0 → stop
n = 4 → 4-byte character
Check: idx + 4 = 4 <= 4 ✓
Step 2: Validate continuation bytes
data[1] = 162: (162 & 192) == 128 ✓
data[2] = 138: (138 & 192) == 128 ✓
data[3] = 147: (147 & 192) == 128 ✓
All continuation bytes valid → true
Example: data = [255] (Invalid)
Process first byte (255)
255 in binary: 11111111
getBytes(255):
Count leading 1s:
255 & 128 = 128 ≠ 0 → n=1
255 & 64 = 64 ≠ 0 → n=2
255 & 32 = 32 ≠ 0 → n=3
255 & 16 = 16 ≠ 0 → n=4
255 & 8 = 8 ≠ 0 → n=5
n > 4 → return -1
Result: false (invalid pattern)
Edge Cases
- Single byte: Array with one 1-byte character
- All 1-bytes: Array of ASCII characters
- Invalid pattern: Byte starting with
10(continuation byte as first) - Too many leading 1s: More than 4 leading 1s
- Insufficient bytes: Multi-byte character without enough bytes
Bit Manipulation Tips
Common Bit Operations
// Check if bit 7 is set
(num & 0x80) != 0
// Check if bits 7-6 are "10"
(num & 0xC0) == 0x80
// Count leading 1s
int count = 0;
int mask = 0x80;
while ((num & mask) != 0 && count < 4) {
count++;
mask >>= 1;
}
Related Problems
- 393. UTF-8 Validation - Current problem
- 191. Number of 1 Bits - Bit manipulation
- 190. Reverse Bits - Bit manipulation
- 136. Single Number - Bit manipulation
Tags
Bit Manipulation, Array, String, Medium