algoadvance

Leetcode 609. Find Duplicate File in System

Problem Statement

Given a list of directory info, including the directory path and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

“root/d1/d2/…/dm f1.txt(f1_content) f2.txt(f2_content) … fn.txt(fn_content)”

It means there are n files (f1.txt, f2.txt … fn.txt) with content (f1_content, f2_content … fn_content) respectively in the directory “root/d1/d2/…/dm”.

Note:

No spaces between the directory and the files.
The directory contains only alphabets and numbers.
The files contains only alphabets and numbers.
The files content is surrounded by parentheses.

Example

Example 1:

Input:

["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]

Output:

[ ["root/a/1.txt","root/c/3.txt"], ["root/a/2.txt","root/c/d/4.txt","root/4.txt"] ]

Example 2:

Input:

["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)"]

Output:

[ ["root/a/1.txt","root/c/3.txt"], ["root/a/2.txt","root/c/d/4.txt"] ]

Clarifying Questions

Input format: Can we assume that the input will always be a non-empty list of strings?
Contents uniqueness: Should we consider contents to be unique if they have different cases?
Output format: Should the output include groups with only one file (considered non-duplicates)?

Let’s assume:

Yes, the input is always a non-empty list of strings.
Yes, file contents are case-sensitive.
No, only groups with at least two files should be returned.

Strategy

Parsing: Parse each string to extract the directory path and files with their contents.
Hashing: Use a hash map to map the file contents to their respective paths.
Grouping: Traverse the parsed data, grouping file paths by their contents. Only include groups with more than one file path in the output.

Code

#include <vector>
#include <string>
#include <unordered_map>
#include <sstream>

using namespace std;

vector<vector<string>> findDuplicate(vector<string>& paths) {
    unordered_map<string, vector<string>> contentToFilePaths;

    for (const string& path : paths) {
        istringstream ss(path);
        string dir;
        ss >> dir;
        string fileInfo;
        
        while (ss >> fileInfo) {
            size_t openParen = fileInfo.find('(');
            size_t closeParen = fileInfo.find(')');
            string fileName = fileInfo.substr(0, openParen);
            string fileContent = fileInfo.substr(openParen + 1, closeParen - openParen - 1);
            contentToFilePaths[fileContent].emplace_back(dir + '/' + fileName);
        }
    }
    
    vector<vector<string>> duplicates;
    for (const auto& entry : contentToFilePaths) {
        if (entry.second.size() > 1) {
            duplicates.emplace_back(entry.second);
        }
    }

    return duplicates;
}

Time Complexity

Parsing Time: Each character of the input can be processed linearly, leading to O(n*k) where n is the number of directory entries and k is the average length of directory/files/contents.
Hashing Time: Inserting into the hash map and grouping paths by content also takes O(n*k).
Overall Time Complexity: Combining these operations, the time complexity is O(n*k).

Space Complexity

Hash Map Storage: Storing the paths and contents in the hash map potentially takes O(n*k) additional space.

Thus, both the time and space complexity are O(n*k).

Cut your prep time in half and DOMINATE your interview with AlgoAdvance AI

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?