A valid UTF-8 encoding follows the rules below:
10
.This means the number of binary 1’s at the beginning of the first byte determines the total number of bytes in the character. The rest of the bytes must start with 10
.
Given an array of integers, where each integer represents one byte, return whether it is a valid UTF-8 encoding.
0
(indicating a single-byte character) or 110
, 1110
, 11110
(indicating multi-byte characters).10
.Here is the C++ implementation:
#include <vector>
bool validUtf8(std::vector<int>& data) {
int n = data.size();
int i = 0;
while (i < n) {
int byteCount = 0;
if ((data[i] & 0x80) == 0) { // 1-byte character
byteCount = 1;
} else if ((data[i] & 0xE0) == 0xC0) { // 2-byte character
byteCount = 2;
} else if ((data[i] & 0xF0) == 0xE0) { // 3-byte character
byteCount = 3;
} else if ((data[i] & 0xF8) == 0xF0) { // 4-byte character
byteCount = 4;
} else {
return false; // not a valid starting byte
}
if (i + byteCount > n) {
return false; // Not enough bytes
}
for (int j = 1; j < byteCount; ++j) {
if ((data[i + j] & 0xC0) != 0x80) {
return false; // does not start with '10'
}
}
i += byteCount;
}
return true;
}
The solution efficiently checks the UTF-8 validity by iterating through the array and validating each character’s format according to the UTF-8 encoding rules.
Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?