187. Repeated DNA Sequences
187. Repeated DNA Sequences
Problem Statement
The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'. Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Examples
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Constraints
1 <= s.length <= 10^5s[i]is either'A','C','G', or'T'.
Clarification Questions
Before diving into the solution, here are 5 important clarifications and assumptions to discuss during an interview:
-
Substring length: Is the length always 10? (Assumption: Yes — we need 10-letter-long sequences only.)
-
Character set: What characters can appear in
s? (Assumption: Only'A','C','G','T'— standard DNA nucleotides.) -
Output format: Do we need a specific order? (Assumption: Any order; return as list/set of strings.)
-
Overlap: Can the same 10-letter sequence appear in multiple overlapping positions? (Assumption: Yes — count each distinct substring that appears at least twice.)
-
Empty or short string: What if
shas fewer than 10 characters? (Assumption: Return empty list — no 10-letter substring exists.)
Interview Deduction Process (20 minutes)
Step 1: Brute-Force Approach (5 minutes)
Slide a window of length 10 over the string; for each substring, check all other positions for a match. O(n²) or O(n × 10) per window — too slow for n up to 10^5. We need to detect duplicates in one pass or with O(1) comparison per window.
Step 2: Semi-Optimized Approach (7 minutes)
Use a hash set: for each starting index i, extract s[i:i+10] and add to a set; if already in the set, add to the result. O(n) time, O(n) space. This works but creates O(n) substrings of length 10. For large n we can avoid creating every substring by using a rolling hash or encoding.
Step 3: Optimized Solution (8 minutes)
Encode each of the 4 letters in 2 bits; a 10-letter window fits in 20 bits (one integer). Maintain a rolling hash: for each new character, shift left by 2, add the new 2 bits, mask to 20 bits. Compare integers in O(1); only build the substring when adding to the result. O(n) time, O(n) space with a smaller constant and no substring slicing in the hot loop.
Solution Approach
This is a fixed-length sliding window problem with substring deduplication. The key insight is that with only 4 characters, we can encode a 10-character window in 20 bits and use a rolling update for O(1) per step.
Key Insights:
- Fixed window: There are
n - 9possible 10-letter substrings; we need to find those that appear more than once. - Small alphabet: 4 symbols → 2 bits per character → 10 characters = 20 bits, fits in one integer.
- Rolling hash: Update the integer in O(1) by shifting, OR-ing the new character’s bits, and masking to 20 bits.
- Duplicate detection: Use a set of seen hashes; when we see a hash again, add the current window substring to the result.
Solution 1: Hash Set (Baseline)
Idea: Slide a window of length 10, store each substring in a set; if seen again, add to the result set.
Time Complexity: O(n)
Space Complexity: O(n)
Each substring copy is O(10), so this is simple but does more string work than necessary.
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) < 10:
return []
seen = set()
repeated = set()
for i in range(len(s) - 9):
sub = s[i : i + 10]
if sub in seen:
repeated.add(sub)
else:
seen.add(sub)
return list(repeated)
Algorithm Explanation:
- Iterate over every starting index
ifor a 10-character window. - Extract substring
s[i : i + 10]. - If it’s already in
seen, add it torepeated; otherwise add it toseen. - Return
repeatedas a list.
Solution 2: Rolling Hash with Bit Encoding (Optimal)
Idea: With only 4 letters, we can encode each character in 2 bits. So 10 characters use 20 bits and fit in a 32-bit integer. We can use a rolling window: update the integer in O(1) by shifting, adding the new character’s bits, and masking out the oldest.
Encoding:
| Char | Binary |
|---|---|
| A | 00 |
| C | 01 |
| G | 10 |
| T | 11 |
Rolling update: For a window of 10 characters (20 bits), use a mask (1 << 20) - 1. After shifting left by 2 and adding the new character’s 2 bits, apply the mask so only the last 20 bits are kept.
Time Complexity: O(n)
Space Complexity: O(n)
Each step is O(1); no substring creation in the hot path. We only form the substring when adding to the result (when a duplicate is found).
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
if len(s) < 10:
return []
mapping = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
seen = set()
repeated = set()
hash_val = 0
mask = (1 << 20) - 1
for i in range(len(s)):
hash_val = ((hash_val << 2) | mapping[s[i]]) & mask
if i >= 9:
if hash_val in seen:
repeated.add(s[i - 9 : i + 1])
else:
seen.add(hash_val)
return list(repeated)
Algorithm Explanation:
- Encode each character as 0, 1, 2, or 3 (2 bits). Initialize
hash_val = 0,mask = (1 << 20) - 1. - For each index
i, update:hash_val = ((hash_val << 2) | mapping[s[i]]) & mask. - Once
i >= 9, the current window iss[i-9 : i+1]. Ifhash_valis already inseen, add that substring torepeated; otherwise addhash_valtoseen. - Return
list(repeated).
Why This Works:
- The mask keeps only the last 20 bits, so the integer uniquely represents the last 10 characters. Duplicate windows yield the same hash. We only build the substring when adding to
repeated.
Mask trick
- 10 characters × 2 bits = 20 bits.
mask = (1 << 20) - 1keeps only the lowest 20 bits.- After
(hash_val << 2) | new_bits, the& maskdrops the oldest character’s bits, so the integer always represents the last 10 characters.
Key insights
- Fixed-length sliding window over the string.
- Rolling hash / bit encoding turns a 10-character window into a single integer for O(1) comparison.
- Small alphabet (4 symbols) makes 2-bit encoding and a 20-bit integer feasible.
- This pattern is useful for string matching, Rabin–Karp, and substring deduplication.
Edge Cases:
- String shorter than 10: Return empty list.
- No repeated sequence: Return empty list.
- All same character (e.g. “AAAAAAAAAAAAA”): One 10-letter substring repeated; return that one string in a list.
Common Mistakes:
- Forgetting
i >= 9: The first valid 10-letter window starts wheni >= 9. - Using one set only: We need to distinguish “seen once” vs “seen twice or more”; use one set for seen and one for repeated (or count frequencies).
- Wrong mask size: For 10 characters × 2 bits = 20 bits; mask must be
(1 << 20) - 1.
Related Problems:
- LC 3: Longest Substring Without Repeating Characters — Sliding window
- LC 76: Minimum Window Substring — Variable window
- LC 1044: Longest Duplicate Substring — Binary search + rolling hash
Follow-ups
- What if the substring length is
kinstead of 10? - What if the alphabet is larger (e.g., full ASCII)?
- How would you solve it with Rabin–Karp (e.g., a prime-base hash)?
- Can you find repeated substrings of any length?