algoadvance

Problem Statement

A valid UTF-8 encoding follows the rules below:

For a 1-byte character, the first bit is a 0, followed by its unicode code.
For an n-byte character (n > 1), the first n bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.

This means the number of binary 1’s at the beginning of the first byte determines the total number of bytes in the character. The rest of the bytes must start with 10.

Given an array of integers, where each integer represents one byte, return whether it is a valid UTF-8 encoding.

Clarifying Questions

Are all input values between 0 and 255?
- Yes, since each element in the array represents a byte.
How long can the input array be?
- The length can be up to (10^5).
Is the input guaranteed to be non-empty?
- Yes, we will assume the input is non-empty based on the problem description.

Strategy

Reading Byte Indicators:
- Read the first byte to determine the number of bytes in the current UTF-8 character.
Check Following Bytes:
- Ensure that subsequent bytes start with ‘10’.
Validation:
- The first byte can start with 0 (indicating a single-byte character) or 110, 1110, 11110 (indicating multi-byte characters).
- For multi-byte characters, ensure subsequent bytes start with 10.
Iterate Through Array:
- Continue this process until the entire array has been validated or an invalid character sequence is discovered.

Code

Here is the C++ implementation:

#include <vector>

bool validUtf8(std::vector<int>& data) {
    int n = data.size();
    int i = 0;

    while (i < n) {
        int byteCount = 0;
        
        if ((data[i] & 0x80) == 0) { // 1-byte character
            byteCount = 1;
        } else if ((data[i] & 0xE0) == 0xC0) { // 2-byte character
            byteCount = 2;
        } else if ((data[i] & 0xF0) == 0xE0) { // 3-byte character
            byteCount = 3;
        } else if ((data[i] & 0xF8) == 0xF0) { // 4-byte character
            byteCount = 4;
        } else {
            return false; // not a valid starting byte
        }
        
        if (i + byteCount > n) {
            return false; // Not enough bytes
        }

        for (int j = 1; j < byteCount; ++j) {
            if ((data[i + j] & 0xC0) != 0x80) {
                return false; // does not start with '10'
            }
        }

        i += byteCount;
    }

    return true;
}

Time Complexity

The algorithm processes each byte exactly once.
Hence the time complexity is (O(n)), where (n) is the length of the data array.

The solution efficiently checks the UTF-8 validity by iterating through the array and validating each character’s format according to the UTF-8 encoding rules.

Cut your prep time in half and DOMINATE your interview with AlgoAdvance AI

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?