algoadvance

Leetcode 609. Find Duplicate File in System

Problem Statement

Given a list of directory info including directory path and all the files with contents in this directory, you need to find all the duplicate files in the file system in terms of their contents.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dn f1(content1) f2(content2) ... fk(contentk)"

It means there are k files (f1, f2, … fk) with content content1, content2, … contentk respectively in directory root/d1/d2/.../dn.

Note:

No file in the input list will have duplicate names in the same directory.
You may assume content corresponds to a unique file identifier.

The output is a list of groups of duplicate file paths. For each group, we should have at least two paths, and paths should be returned in a single list within the group.

Example

Example 1:

Input:
[
  "root/a 1.txt(abcd) 2.txt(efgh)",
  "root/c 3.txt(abcd)",
  "root/c/d 4.txt(efgh)",
  "root 4.txt(efgh)"
]

Output: 
[
  ["root/a/1.txt","root/c/3.txt"],
  ["root/a/2.txt","root/c/d/4.txt","root/4.txt"]
]

Clarifying Questions

Can the contents of the files contain spaces or special characters?
- Yes, contents can contain any character except the closing parenthesis ).
Is content comparison case-sensitive?
- Yes, it is case-sensitive per the unique file identifier assumption.
Should we return the results in a specific order?
- No specific order is required for the output lists.
Can there be nested directories?
- Yes, directories can be nested.

Strategy

Parse Input:
- Iterate over each directory string.
- Split the string to extract the directory path and all the files with their contents.
Organize Data:
- Use a HashMap where each key is the content of the file and the value is a list of file paths.
Filter Duplicates:
- Iterate over the HashMap and collect lists that have more than one file path.

Code

import java.util.*;

public class FindDuplicateFileInSystem {
    public List<List<String>> findDuplicate(String[] paths) {
        // HashMap to store file contents as key and list of their paths as value
        HashMap<String, List<String>> contentToFilePaths = new HashMap<>();
        
        for (String path : paths) {
            // Split the directory information.
            String[] parts = path.split(" ");
            String dirPath = parts[0];
            
            // Parse each file in the current directory.
            for (int i = 1; i < parts.length; i++) {
                int openParen = parts[i].indexOf("(");
                String fileName = parts[i].substring(0, openParen);
                String content = parts[i].substring(openParen + 1, parts[i].length() - 1);
                
                String filePath = dirPath + "/" + fileName;
                
                contentToFilePaths
                    .computeIfAbsent(content, k -> new ArrayList<>())
                    .add(filePath);
            }
        }
        
        // Collect all groups of duplicate files.
        List<List<String>> duplicates = new ArrayList<>();
        for (List<String> filePaths : contentToFilePaths.values()) {
            if (filePaths.size() > 1) {
                duplicates.add(filePaths);
            }
        }
        
        return duplicates;
    }

    public static void main(String[] args) {
        FindDuplicateFileInSystem solution = new FindDuplicateFileInSystem();
        String[] input = {
            "root/a 1.txt(abcd) 2.txt(efgh)",
            "root/c 3.txt(abcd)",
            "root/c/d 4.txt(efgh)",
            "root 4.txt(efgh)"
        };
        List<List<String>> result = solution.findDuplicate(input);
        System.out.println(result);
    }
}

Time Complexity

Parsing Input: Each file entry is processed exactly once, O(N), where N is the total number of files.
Building HashMap: Insert and lookup operations in the HashMap are O(1), resulting in O(N) for all files.
Collecting Duplicates: Iterating over the HashMap’s values is O(N) in the worst case.

Overall, the algorithm runs in O(N) time complexity.

Cut your prep time in half and DOMINATE your interview with AlgoAdvance AI

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?