algoadvance

Given a list of directory paths, where each path is a string in the format:

"root/d1/d2/.../dn file1.txt(content1) file2.txt(content2) ... fileN.txt(contentN)"

You need to find all the groups of duplicate files in the file system in terms of their content.

A single directory path will contain at least one file. Each file consists of a filename and a content, both of them wrapped with parentheses. Multiple files in different directories can have the same content.

Return a list of groups of duplicate files. For each group, return the list of file paths that have the same content.

Clarifying Questions

Input format: Is each entry in the list a single string that represents a directory and its files?
Output format: Should the output be a list of lists where each sub-list contains the file paths of files with the same content?
Constraints: Are there any constraints on the number of directories or the length of the paths/content?

If these clarifications are understood, I’ll proceed with the solution.

Strategy

Parsing Input: For each directory string in the input, split the string to get the directory path and files.
Storing Content Mappings: Use a dictionary to map file content to a list of file paths.
Construct Paths: Construct the full file path using the directory path and the filename.
Populate Results: Iterate over the content mapping dictionary and collect lists of file paths that share the same content.

Code

def findDuplicate(paths):
    from collections import defaultdict

    # Dictionary to map file content to list of file paths
    content_map = defaultdict(list)
    
    for path in paths:
        parts = path.split(" ")
        root_dir = parts[0]
        for file_info in parts[1:]:
            # Split the file_info into filename and content
            file_name, file_content = file_info.split("(")
            file_content = file_content[:-1]  # remove the closing ')'
            # Construct the full file path
            full_path = root_dir + "/" + file_name
            # Map content to the full path
            content_map[file_content].append(full_path)
    
    # Prepare result ignoring data with only one file path
    result = [file_paths for file_paths in content_map.values() if len(file_paths) > 1]
    
    return result

# Example usage
paths = [
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
]
print(findDuplicate(paths))

Time Complexity

File Path Parsing: Each path string is parsed once, resulting in O(N) operations, where N is the total number of files.
Dictionary Operations: Inserting and retrieving from a dictionary both have an average time complexity of O(1). Therefore, storing all file paths by content is O(N).
Result Construction: Constructing the result involves traversing the dictionary, which is also O(N).

Overall, the time complexity is O(N), where N is the total number of files across all directories.

Try our interview co-pilot at AlgoAdvance.com

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?