[Medium] 1242. Web Crawler Multithreaded
[Medium] 1242. Web Crawler Multithreaded
This is a multithreading problem that requires implementing a concurrent web crawler. The key insight is using proper synchronization mechanisms to avoid race conditions while crawling URLs from the same domain concurrently.
Problem Description
Given a start URL and an HTML parser, implement a multithreaded web crawler that:
- Crawls all URLs from the same domain as the start URL
- Uses multiple threads for concurrent crawling
- Avoids visiting the same URL multiple times
- Returns all discovered URLs
Examples
Example 1:
Input:
urls = [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.google.com"
]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/"
]
Example 2:
Input:
urls = [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.google.com"
]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Constraints
- 1 <= urls.length <= 1000
- 1 <= urls[i].length <= 300
- startUrl is one of the urls
- All URLs have the same hostname
Approach
The solution involves:
- Domain Extraction: Extract the base domain from the start URL
- Thread Pool: Use multiple threads for concurrent crawling
- Synchronization: Use locks to prevent race conditions
- URL Filtering: Only crawl URLs from the same domain
- Visited Tracking: Avoid revisiting the same URL
Solution in Python
Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results
/**
* // This is the HtmlParser's API interface.
* // You should not implement it, or speculate about its implementation
* class HtmlParser {
*
* list[string] getUrls(string url);
* };
*/
class Solution:
list[string] crawl(string startUrl, HtmlParser htmlParser) {
StUrl = getStartUrl(startUrl);
q.push(startUrl);
auto eUrl = []() {
while(true) {
mtxq.lock();
if(!qlen()) {
mtxq.unlock();
this_thread::sleep_for(chrono::milliseconds(20));
mtxq.lock();
if(!qlen()) {mtxq.unlock(); return;}
}
string t=q.front();
q.pop();
if(getStartUrl(t)!=StUrl) {mtxq.unlock(); continue;}
mtxm.lock();
if(m.count(t)) {mtxm.unlock();mtxq.unlock(); continue;}
m[t] = true;
mtxa.lock();
rtn.emplace_back(t);
mtxa.unlock();
mtxm.unlock();
mtxq.unlock();
list[string] vec(htmlParser.getUrls(t));
mtxq.lock();
for(auto s:vec) {q.push(s);}
mtxq.unlock();
}
return;
};
while(n--) pool.emplace_back(thread(eUrl));
for(auto t:pool) t.join();
return rtn;
}
private:
list[string] rtn;
unordered_map<string, bool> m;
mutex mtxq, mtxm, mtxa;
string StUrl;
int n = thread::hardware_concurrency();
list[thread] pool;
queue<string> q;
def getStartUrl(self, string s) -> string:
t = 3
string rtn="";
for (char c: s){
if(c== '/') t--;
if(!t) return rtn;
rtn.append(c);
}
return rtn;
}
};
Solution in Python
Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results
from threading import Lock
from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED
class Solution:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
def host(u: str) -> str:
return u.split('/')[2]
base = host(startUrl)
visited = set([startUrl])
lock = Lock()
def worker(url: str) -> List[str]:
next_urls = []
for u in htmlParser.getUrls(url):
if host(u) == base:
with lock:
if u in visited:
continue
visited.add(u)
next_urls.append(u)
return next_urls
with ThreadPoolExecutor(max_workers=32) as ex:
pending = {ex.submit(worker, startUrl)}
while pending:
done, pending = wait(pending, return_when=FIRST_COMPLETED)
for fut in done:
for nxt in fut.result():
pending.add(ex.submit(worker, nxt))
return list(visited)
Step-by-Step Example
Let’s trace through the Python solution with startUrl = “http://news.yahoo.com/news/topics/”:
Step 1: Extract base domain
host("http://news.yahoo.com/news/topics/")→ “news.yahoo.com”visited = {"http://news.yahoo.com/news/topics/"}
Step 2: Start worker thread
- Submit initial URL to thread pool
- Worker calls
htmlParser.getUrls()on startUrl
Step 3: Process discovered URLs
- For each URL from parser:
- Check if host matches base domain
- If yes, add to visited set (with lock)
- Add to next_urls for further processing
Step 4: Continue until no more URLs
- Submit new URLs to thread pool
- Wait for completion and process results
- Repeat until all URLs are processed
Key Insights
- Thread Safety: Use locks to protect shared data structures
- Domain Filtering: Only process URLs from the same domain
- Concurrent Processing: Use thread pools for parallel execution
- Deadlock Prevention: Careful lock ordering and timeout handling
- Memory Management: Proper cleanup of thread resources
Synchronization Patterns
Python Approach:
- Multiple Mutexes: Separate locks for queue, map, and results
- Manual Thread Management: Create and join threads manually
- Polling: Sleep and check for work periodically
Python Approach:
- Single Lock: One lock for the visited set
- ThreadPoolExecutor: Automatic thread management
- Future-based: Use futures for asynchronous execution
Common Mistakes
- Race Conditions: Not properly synchronizing access to shared data
- Deadlocks: Incorrect lock ordering or holding multiple locks
- Memory Leaks: Not properly cleaning up thread resources
- Infinite Loops: Not handling empty queue conditions correctly
- Domain Mismatch: Processing URLs from different domains