algoadvance

Leetcode 609. Find Duplicate File in System

Problem Statement

Given a list of directory info, including the directory path and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

It means there are n files (f1.txt, f2.txt … fn.txt) with content (f1_content, f2_content … fn_content) respectively in the directory “root/d1/d2/…/dm”.

Note:

Example

Example 1:

Input:

["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]

Output:

[ ["root/a/1.txt","root/c/3.txt"], ["root/a/2.txt","root/c/d/4.txt","root/4.txt"] ]

Example 2:

Input:

["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)"]

Output:

[ ["root/a/1.txt","root/c/3.txt"], ["root/a/2.txt","root/c/d/4.txt"] ]

Clarifying Questions

  1. Input format: Can we assume that the input will always be a non-empty list of strings?
  2. Contents uniqueness: Should we consider contents to be unique if they have different cases?
  3. Output format: Should the output include groups with only one file (considered non-duplicates)?

Let’s assume:

  1. Yes, the input is always a non-empty list of strings.
  2. Yes, file contents are case-sensitive.
  3. No, only groups with at least two files should be returned.

Strategy

  1. Parsing: Parse each string to extract the directory path and files with their contents.
  2. Hashing: Use a hash map to map the file contents to their respective paths.
  3. Grouping: Traverse the parsed data, grouping file paths by their contents. Only include groups with more than one file path in the output.

Code

#include <vector>
#include <string>
#include <unordered_map>
#include <sstream>

using namespace std;

vector<vector<string>> findDuplicate(vector<string>& paths) {
    unordered_map<string, vector<string>> contentToFilePaths;

    for (const string& path : paths) {
        istringstream ss(path);
        string dir;
        ss >> dir;
        string fileInfo;
        
        while (ss >> fileInfo) {
            size_t openParen = fileInfo.find('(');
            size_t closeParen = fileInfo.find(')');
            string fileName = fileInfo.substr(0, openParen);
            string fileContent = fileInfo.substr(openParen + 1, closeParen - openParen - 1);
            contentToFilePaths[fileContent].emplace_back(dir + '/' + fileName);
        }
    }
    
    vector<vector<string>> duplicates;
    for (const auto& entry : contentToFilePaths) {
        if (entry.second.size() > 1) {
            duplicates.emplace_back(entry.second);
        }
    }

    return duplicates;
}

Time Complexity

Space Complexity

Thus, both the time and space complexity are O(n*k).

Cut your prep time in half and DOMINATE your interview with AlgoAdvance AI