393. UTF-8 Validation
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is a 0, followed by its unicode code.
- For an n-bytes character (n > 1), the first n bits are all 1’s, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.
This is how the UTF-8 encoding would work:
- 1-byte character: 0xxxxxxx
- 2-byte character: 110xxxxx 10xxxxxx
- 3-byte character: 1110xxxx 10xxxxxx 10xxxxxx
- 4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Given an array of integers representing the data, return whether it is a valid UTF-8 encoding.
Clarifying Questions
- Input and Output:
- Input: A list of integers.
- Output: Boolean value indicating whether the given data represents a valid UTF-8 encoding.
- Value Range:
- Each integer in the input list can range from 0 to 255, representing one byte.
- Edge Cases:
- Input list length is zero.
- Input list contains an improperly terminated multi-byte character.
Strategy
- Initial Checks:
- If data length is 0, return True, as an empty string is trivially valid.
- Byte Classification:
- For each byte, classify it:
- 1-byte character: first bit
0
.
- 2-byte character: starts with
110
.
- 3-byte character: starts with
1110
.
- 4-byte character: starts with
11110
.
- Validate Subsequent Bytes:
- For multi-byte characters, ensure that the subsequent bytes start with
10
.
- Iterative Check:
- Iterate through bytes, deducing the number of required subsequent bytes for each multi-byte character and validate all necessary conditions.
Code
from typing import List
def validUtf8(data: List[int]) -> bool:
def is_valid_byte(byte):
return (byte & 0b11000000) == 0b10000000
i = 0
while i < len(data):
byte = data[i]
if (byte & 0b10000000) == 0: # 1-byte character
i += 1
continue
len_prefix = 0
mask = 0b10000000
while (byte & mask) != 0:
len_prefix += 1
mask >>= 1
if len_prefix == 1 or len_prefix > 4: # Invalid prefix length (<= 4 bytes allowed)
return False
for j in range(1, len_prefix):
if i + j >= len(data) or not is_valid_byte(data[i + j]):
return False
i += len_prefix
return True
Time Complexity
- O(n): We are iterating through the list of bytes once, making the solution linear in terms of data length.
Explanation
- is_valid_byte Function: This helper function checks if a byte starts with
10
, denoting it’s a valid continuation byte.
- Iteration:
- For each byte in the data:
- If the byte starts with
0
, it’s a single byte character.
- If it starts with
1
, count the 1
s at the start to determine the length of the multi-byte character.
- Validate the following bytes.
- Edge Cases: Handle cases where multi-byte characters are not properly terminated or overflow the list.
By following this structured approach, the solution ensures that we only validate valid UTF-8 sequences efficiently.
-
Got blindsided by a question you didn’t expect?
-
Spend too much time studying?
-
Or simply don’t have the time to go over all 3000 questions?