algoadvance

Leetcode 393. UTF

Problem Statement

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its unicode code.
For an n-bytes character (n > 1), the first n bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.

This means that the binary representation of the byte starts with the following pattern for n bytes:

1-byte character: 0xxxxxxx
2-byte character: 110xxxxx 10xxxxxx
3-byte character: 1110xxxx 10xxxxxx 10xxxxxx
4-byte character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is a valid UTF-8 encoding.

Clarifying Questions

What data type is the input array?
- The input is an array of integers.
What is the range of the integers?
- Each integer in the array is between 0 and 255 (inclusive), representing each byte in the data.
What should the function return if the data is valid or invalid UTF-8?
- The function should return true if the data is valid UTF-8 encoding and false otherwise.

Strategy

Bit Manipulation:
- To solve this problem, we will utilize bit manipulation to check the leading bits of each byte.
Main Steps:
- Initialize a variable to keep track of how many bytes are expected in the current UTF-8 character.
- Iterate through the array:
  - If we are expecting a new character (i.e., not in the middle of a multi-byte character), determine the number of bytes based on the leading bits.
  - If in the middle of a multi-byte character, ensure the current byte starts with 10.
- Ensure no extra bytes are expected at the end of the array.

Code

Here’s the implementation of the UTF-8 validation in Java:

public class Solution {
    public boolean validUtf8(int[] data) {
        int numberOfBytes = 0;
        
        for (int i = 0; i < data.length; i++) {
            int num = data[i];
            
            // If this is a new character, calculate the number of bytes
            if (numberOfBytes == 0) {
                if ((num >> 5) == 0b110) numberOfBytes = 1;
                else if ((num >> 4) == 0b1110) numberOfBytes = 2;
                else if ((num >> 3) == 0b11110) numberOfBytes = 3;
                else if ((num >> 7) == 0b0) numberOfBytes = 0;
                else return false;
            } else {
                // Check if the byte is a valid continuation (10xxxxxx)
                if ((num >> 6) != 0b10) return false;
                numberOfBytes--;
            }
        }
        
        // All characters should be properly ended
        return numberOfBytes == 0;
    }

    public static void main(String[] args) {
        Solution solution = new Solution();
        
        int[] data1 = {197, 130, 1};
        System.out.println(solution.validUtf8(data1)); // Output: true

        int[] data2 = {235, 140, 4};
        System.out.println(solution.validUtf8(data2)); // Output: false
    }
}

Time Complexity

Time Complexity: O(n)
- We iterate through the data array once, where n is the length of the input array.
Space Complexity: O(1)
- The space required is constant, as we only use a few extra variables.

This solution ensures that the given data adheres to the UTF-8 encoding rules by validating each byte appropriately and keeping track of how many continuation bytes are expected.

Cut your prep time in half and DOMINATE your interview with AlgoAdvance AI

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?