692. Top K Frequent Words
692. Top K Frequent Words
Problem Statement
Given an array of strings words and an integer k, return the k most frequent strings.
Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.
Examples
Example 1:
Input: words = ["i","love","leetcode","i","love","coding"], k = 2
Output: ["i","love"]
Explanation: "i" and "love" are the two most frequent words.
Note that "i" comes before "love" due to a lower alphabetical order.
Example 2:
Input: words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 2
Output: ["the","is"]
Explanation: "the", "is", "sunny" and "day" are the four most frequent words, with the number of occurrence being 4, 3, 2 and 1 respectively.
Constraints
1 <= words.length <= 5001 <= words[i].length <= 10words[i]consists of lowercase English letters.kis in the range[1, The number of unique words[i]]
Clarification Questions
Before diving into the solution, here are 5 important clarifications and assumptions to discuss during an interview:
-
Tie-breaking: When words have the same frequency, how should we break ties? (Assumption: Sort lexicographically - alphabetical order)
-
Frequency calculation: How is frequency calculated? (Assumption: Count occurrences of each word in the input array)
-
Case sensitivity: Are word comparisons case-sensitive? (Assumption: Based on constraints, only lowercase letters, so case doesn’t matter)
-
Output format: Should we return k words or all words with same frequency? (Assumption: Return exactly k words - top k most frequent)
-
Word uniqueness: Can the same word appear multiple times in result? (Assumption: No - each word appears once in the result, sorted by frequency then lexicographically)
Interview Deduction Process (20 minutes)
Step 1: Brute-Force Approach (5 minutes)
Initial Thought: “I need to find top k frequent words. Let me count frequencies and sort.”
Naive Solution: Count frequency of each word, sort by frequency (descending) then lexicographically, return top k.
Complexity: O(n log n) time, O(n) space
Issues:
- O(n log n) time when O(n log k) is possible
- Sorts all words when only need top k
- Doesn’t leverage heap
- Can be optimized
Step 2: Semi-Optimized Approach (7 minutes)
Insight: “I can use min-heap of size k to track top k words, with custom comparator.”
Improved Solution: Count frequencies, use min-heap of size k with custom comparator (frequency then lexicographic). For each word, if heap size < k, add; else if word is more frequent or same frequency but lexicographically smaller, replace top.
Complexity: O(n log k) time, O(n) space
Improvements:
- O(n log k) time is better than O(n log n)
- Heap efficiently maintains top k
- Custom comparator handles tie-breaking
- Can optimize further
Step 3: Optimized Solution (8 minutes)
Final Optimization: “Min-heap with custom comparator is optimal. Sort result at end for correct order.”
Best Solution: Min-heap approach is optimal. Count frequencies, use min-heap of size k. After processing, extract and sort heap elements for final result (frequency descending, then lexicographic).
Complexity: O(n log k) time, O(n) space
Key Realizations:
- Heap is perfect for top-k problems
- Custom comparator handles tie-breaking
- O(n log k) time is optimal for heap approach
- Final sort ensures correct order
Solution Approach
This problem is similar to LC 347: Top K Frequent Elements, but with two key differences:
- Strings instead of integers: Need to handle string comparison
- Lexicographic ordering: When frequencies are equal, sort alphabetically
Key Insights:
- Frequency Counting: Use hash map to count word frequencies
- Custom Sorting: Sort by frequency (descending), then by lexicographic order (ascending)
- Top K Selection: Return only the first k elements after sorting
Algorithm:
- Count frequencies: Use
unordered_mapto count occurrences of each word - Collect unique words: Extract all unique words from the map
- Sort: Sort by frequency (descending), then lexicographically (ascending)
- Return top k: Take first k elements
Solution
Solution: Hash Map + Custom Sorting
class Solution:
def topKFrequent(self, words, k):
dict[str, int> cnt
for word in words:
cnt[word]++
list[str> rtn
for([key, value]: cnt) :
rtn.emplace_back(key)
sort(rtn.begin(), rtn.end(), [](str a, str b):
(a < b if return cnt[a] == cnt[b] else cnt[a] > cnt[b])
)
rtn.erase(rtn.begin() + k, rtn.end())
return rtn
Algorithm Explanation:
- Count Frequencies (Lines 3-6):
- Create
unordered_map<string, int>to count word occurrences - Iterate through all words and increment their counts
- Create
- Collect Unique Words (Lines 7-10):
- Extract all unique words (keys) from the frequency map
- Store them in result vector
rtn
- Custom Sort (Lines 11-13):
- Primary sort: By frequency (descending) -
cnt[a] > cnt[b] - Secondary sort: By lexicographic order (ascending) -
a < bwhen frequencies are equal - Uses lambda with capture-by-reference
[&]to accesscntmap
- Primary sort: By frequency (descending) -
- Return Top K (Lines 14-15):
- Erase elements after index
kto keep only top k - Return the result
- Erase elements after index
Why This Works:
- Hash map counting: Efficiently counts frequencies in O(n) time
- Custom comparator: Handles both frequency and lexicographic ordering
- Sorting approach: Simple and straightforward for small input sizes
Example Walkthrough:
For words = ["i","love","leetcode","i","love","coding"], k = 2:
Step 1: Count frequencies
cnt = {
"i": 2,
"love": 2,
"leetcode": 1,
"coding": 1
}
Step 2: Collect unique words
rtn = ["i", "love", "leetcode", "coding"]
Step 3: Sort with custom comparator
Compare pairs:
- "i" vs "love": cnt["i"] == cnt["love"] (both 2) → "i" < "love" → "i" comes first
- "leetcode" vs "coding": cnt["leetcode"] == cnt["coding"] (both 1) → "coding" < "leetcode" → "coding" comes first
- "i"/"love" vs "leetcode"/"coding": cnt["i"] > cnt["leetcode"] → higher frequency comes first
After sorting:
rtn = ["i", "love", "coding", "leetcode"]
Step 4: Take top k = 2
rtn = ["i", "love"]
Result: ["i", "love"]
Complexity Analysis:
- Time Complexity: O(n + m log m) where n is total words, m is unique words
- O(n) for frequency counting
- O(m) for collecting unique words
- O(m log m) for sorting m unique words
- O(k) for erasing (can be optimized to O(1) by using resize)
- Space Complexity: O(m) where m is number of unique words
- O(m) for frequency map
- O(m) for result vector
Optimization Note:
Instead of erase, we could use resize(k) which is more efficient:
rtn.resize(k)
return rtn
Alternative Approaches
Approach 2: Min Heap (Better for Large k)
For cases where k is much smaller than the number of unique words, a min heap approach would be more efficient:
class Solution:
def topKFrequent(self, words, k):
dict[str, int> cnt
for word in words:
cnt[word]++
cmp = [](str a, str b) :
(a > b if return cnt[a] == cnt[b] * else cnt[a] < cnt[b])
heapq[str, list[str>, decltype(cmp)> pq(cmp)
for([word, freq]: cnt) :
pq.push(word)
if len(pq) > k) pq.pop(:
list[str> rtn
while not not pq:
rtn.append(pq.top())
pq.pop()
rtn.reverse()
return rtn
Time Complexity: O(n + m log k) where m is unique words
Space Complexity: O(m + k)
Approach 3: Bucket Sort
Similar to LC 347, but requires additional sorting within each bucket:
class Solution:
def topKFrequent(self, words, k):
dict[str, int> cnt
for word in words:
cnt[word]++
maxFreq = 0
for([word, freq]: cnt) :
maxFreq = max(maxFreq, freq)
list[list[str>> buckets(maxFreq + 1)
for([word, freq]: cnt) :
buckets[freq].append(word)
list[str> rtn
for(i = maxFreq i >= 1 and len(rtn) < k i -= 1) :
sort(buckets[i].begin(), buckets[i].end())
for word in buckets[i]:
rtn.append(word)
if(len(rtn) == k) break
return rtn
Time Complexity: O(n + m log m) in worst case
Space Complexity: O(m)
Key Insights
- Custom Comparator: The key is the two-level sorting: frequency first, then lexicographic order
- Hash Map Efficiency:
unordered_mapprovides O(1) average case for frequency counting - Sorting Trade-off: Simple sorting works well for small inputs; heap is better for large k
- Lexicographic Order: When frequencies are equal, use standard string comparison (
<)
Edge Cases
- All words have same frequency: Sort purely by lexicographic order
- k equals number of unique words: Return all words sorted
- Single word repeated: Return that word
- Different frequencies, same lexicographic prefix: Frequency takes priority
Related Problems
- LC 347: Top K Frequent Elements - Similar problem with integers
- LC 215: Kth Largest Element in an Array - Kth largest element
- LC 451: Sort Characters By Frequency - Sort by frequency
- LC 973: K Closest Points to Origin - Top K with custom ordering
This problem demonstrates the importance of custom comparators for multi-criteria sorting, combining frequency-based and lexicographic ordering.