algoadvance

Given a list of directory paths, where each path is a string in the format:

"root/d1/d2/.../dn file1.txt(content1) file2.txt(content2) ... fileN.txt(contentN)"

You need to find all the groups of duplicate files in the file system in terms of their content.

A single directory path will contain at least one file. Each file consists of a filename and a content, both of them wrapped with parentheses. Multiple files in different directories can have the same content.

Return a list of groups of duplicate files. For each group, return the list of file paths that have the same content.

Clarifying Questions

  1. Input format: Is each entry in the list a single string that represents a directory and its files?
  2. Output format: Should the output be a list of lists where each sub-list contains the file paths of files with the same content?
  3. Constraints: Are there any constraints on the number of directories or the length of the paths/content?

If these clarifications are understood, I’ll proceed with the solution.

Strategy

  1. Parsing Input: For each directory string in the input, split the string to get the directory path and files.
  2. Storing Content Mappings: Use a dictionary to map file content to a list of file paths.
  3. Construct Paths: Construct the full file path using the directory path and the filename.
  4. Populate Results: Iterate over the content mapping dictionary and collect lists of file paths that share the same content.

Code

def findDuplicate(paths):
    from collections import defaultdict

    # Dictionary to map file content to list of file paths
    content_map = defaultdict(list)
    
    for path in paths:
        parts = path.split(" ")
        root_dir = parts[0]
        for file_info in parts[1:]:
            # Split the file_info into filename and content
            file_name, file_content = file_info.split("(")
            file_content = file_content[:-1]  # remove the closing ')'
            # Construct the full file path
            full_path = root_dir + "/" + file_name
            # Map content to the full path
            content_map[file_content].append(full_path)
    
    # Prepare result ignoring data with only one file path
    result = [file_paths for file_paths in content_map.values() if len(file_paths) > 1]
    
    return result

# Example usage
paths = [
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
]
print(findDuplicate(paths))

Time Complexity

Overall, the time complexity is O(N), where N is the total number of files across all directories.

Try our interview co-pilot at AlgoAdvance.com