algoadvance

393. UTF-8 Validation

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its unicode code.
  2. For an n-bytes character (n > 1), the first n bits are all 1’s, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Given an array of integers representing the data, return whether it is a valid UTF-8 encoding.

Clarifying Questions

  1. Input and Output:
    • Input: A list of integers.
    • Output: Boolean value indicating whether the given data represents a valid UTF-8 encoding.
  2. Value Range:
    • Each integer in the input list can range from 0 to 255, representing one byte.
  3. Edge Cases:
    • Input list length is zero.
    • Input list contains an improperly terminated multi-byte character.

Strategy

  1. Initial Checks:
    • If data length is 0, return True, as an empty string is trivially valid.
  2. Byte Classification:
    • For each byte, classify it:
      • 1-byte character: first bit 0.
      • 2-byte character: starts with 110.
      • 3-byte character: starts with 1110.
      • 4-byte character: starts with 11110.
  3. Validate Subsequent Bytes:
    • For multi-byte characters, ensure that the subsequent bytes start with 10.
  4. Iterative Check:
    • Iterate through bytes, deducing the number of required subsequent bytes for each multi-byte character and validate all necessary conditions.

Code

from typing import List

def validUtf8(data: List[int]) -> bool:
    def is_valid_byte(byte):
        return (byte & 0b11000000) == 0b10000000

    i = 0
    while i < len(data):
        byte = data[i]
        if (byte & 0b10000000) == 0:  # 1-byte character
            i += 1
            continue

        len_prefix = 0
        mask = 0b10000000
        while (byte & mask) != 0:
            len_prefix += 1
            mask >>= 1

        if len_prefix == 1 or len_prefix > 4:  # Invalid prefix length (<= 4 bytes allowed)
            return False

        for j in range(1, len_prefix):
            if i + j >= len(data) or not is_valid_byte(data[i + j]):
                return False

        i += len_prefix
    
    return True

Time Complexity

Explanation

  1. is_valid_byte Function: This helper function checks if a byte starts with 10, denoting it’s a valid continuation byte.
  2. Iteration:
    • For each byte in the data:
      • If the byte starts with 0, it’s a single byte character.
      • If it starts with 1, count the 1s at the start to determine the length of the multi-byte character.
      • Validate the following bytes.
  3. Edge Cases: Handle cases where multi-byte characters are not properly terminated or overflow the list.

By following this structured approach, the solution ensures that we only validate valid UTF-8 sequences efficiently.

Try our interview co-pilot at AlgoAdvance.com