algoadvance

393. UTF-8 Validation

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its unicode code.
For an n-bytes character (n > 1), the first n bits are all 1’s, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

1-byte character: 0xxxxxxx
2-byte character: 110xxxxx 10xxxxxx
3-byte character: 1110xxxx 10xxxxxx 10xxxxxx
4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is a valid UTF-8 encoding.

Clarifying Questions

Input and Output:
- Input: A list of integers.
- Output: Boolean value indicating whether the given data represents a valid UTF-8 encoding.
Value Range:
- Each integer in the input list can range from 0 to 255, representing one byte.
Edge Cases:
- Input list length is zero.
- Input list contains an improperly terminated multi-byte character.

Strategy

Initial Checks:
- If data length is 0, return True, as an empty string is trivially valid.
Byte Classification:
- For each byte, classify it:
  - 1-byte character: first bit 0.
  - 2-byte character: starts with 110.
  - 3-byte character: starts with 1110.
  - 4-byte character: starts with 11110.
Validate Subsequent Bytes:
- For multi-byte characters, ensure that the subsequent bytes start with 10.
Iterative Check:
- Iterate through bytes, deducing the number of required subsequent bytes for each multi-byte character and validate all necessary conditions.

Code

from typing import List

def validUtf8(data: List[int]) -> bool:
    def is_valid_byte(byte):
        return (byte & 0b11000000) == 0b10000000

    i = 0
    while i < len(data):
        byte = data[i]
        if (byte & 0b10000000) == 0:  # 1-byte character
            i += 1
            continue

        len_prefix = 0
        mask = 0b10000000
        while (byte & mask) != 0:
            len_prefix += 1
            mask >>= 1

        if len_prefix == 1 or len_prefix > 4:  # Invalid prefix length (<= 4 bytes allowed)
            return False

        for j in range(1, len_prefix):
            if i + j >= len(data) or not is_valid_byte(data[i + j]):
                return False

        i += len_prefix
    
    return True

Time Complexity

O(n): We are iterating through the list of bytes once, making the solution linear in terms of data length.

Explanation

is_valid_byte Function: This helper function checks if a byte starts with 10, denoting it’s a valid continuation byte.
Iteration:
- For each byte in the data:
  - If the byte starts with 0, it’s a single byte character.
  - If it starts with 1, count the 1s at the start to determine the length of the multi-byte character.
  - Validate the following bytes.
Edge Cases: Handle cases where multi-byte characters are not properly terminated or overflow the list.

By following this structured approach, the solution ensures that we only validate valid UTF-8 sequences efficiently.

Try our interview co-pilot at AlgoAdvance.com

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?