algoadvance

The problem “Repeated DNA Sequences” asks you to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. A DNA sequence is composed of a series of nucleotides represented by the characters ‘A’, ‘C’, ‘G’, and ‘T’.

You need to:

Identify all 10-letter-long sequences that appear more than once in the given string.
Return these sequences in the form of a list.

Clarifying Questions

Input Length: What is the maximum length of the DNA sequence string?
Case Sensitivity: Are the DNA sequences case-sensitive?
Multiple Occurrences: Should the sequences be listed with all their positions, or just be identified if they appear more than once?

For the sake of this problem, we will assume the input string could be up to 10^5 characters (as typical for coding challenges), is case-sensitive, and we just need to list the sequences without worrying about their positions.

Strategy

Sliding Window: Use a sliding window of size 10 to traverse through the string.
Hashing: Use a hash set to keep track of sequences we’ve seen and a second set to keep track of sequences that we’ve seen more than once.
Result Collection: Store sequences that appear more than once in a list.

By using sets, we can efficiently check for repeating sequences.

Time Complexity

Time Complexity: O(n), where n is the length of the DNA sequence string since we are traversing the string with a fixed window size.
Space Complexity: O(n), primarily for storing unique sequences in sets.

Python Code

def findRepeatedDnaSequences(s: str):
    if len(s) < 10:
        return []

    seen, repeated = set(), set()
    for i in range(len(s) - 9):
        sequence = s[i:i + 10]
        if sequence in seen:
            repeated.add(sequence)
        else:
            seen.add(sequence)
    
    return list(repeated)

# Example usage
dna_sequence = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
print(findRepeatedDnaSequences(dna_sequence))  # Output: ["AAAAACCCCC", "CCCCCAAAAA"]

Explanation

Edge Case Handling: If the string length is less than 10, immediately return an empty list since it’s impossible to have 10-letter sequences.
Sliding Window: Use a range-based loop to construct 10-letter substrings and move one character at a time.
Tracking and Recording:
- Use a set seen to record all the 10-letter sequences we come across.
- Use another set repeated to record sequences that appear more than once.
Finally, convert the repeated set to a list as the output.

This solution efficiently solves the problem by ensuring that all operations with sets (insertion, membership test) are performed in average O(1) time.

Try our interview co-pilot at AlgoAdvance.com

Got blindsided by a question you didn’t expect?
Spend too much time studying?
Or simply don’t have the time to go over all 3000 questions?