[Medium] 1242. Web Crawler Multithreaded

This is a multithreading problem that requires implementing a concurrent web crawler. The key insight is using proper synchronization mechanisms to avoid race conditions while crawling URLs from the same domain concurrently.

Problem Description

Given a start URL and an HTML parser, implement a multithreaded web crawler that:

Crawls all URLs from the same domain as the start URL
Uses multiple threads for concurrent crawling
Avoids visiting the same URL multiple times
Returns all discovered URLs

Examples

Example 1:

Input: 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/"
]

Example 2:

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]

Constraints

1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl is one of the urls
All URLs have the same hostname

Approach

The solution involves:

Domain Extraction: Extract the base domain from the start URL
Thread Pool: Use multiple threads for concurrent crawling
Synchronization: Use locks to prevent race conditions
URL Filtering: Only crawl URLs from the same domain
Visited Tracking: Avoid revisiting the same URL

Solution in C++

Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results

/**
 * // This is the HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation
 * class HtmlParser {
 *   public:
 *     vector<string> getUrls(string url);
 * };
 */
class Solution {
public:
    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
        StUrl = getStartUrl(startUrl);
        q.push(startUrl);
        auto eUrl = [&]() {
            while(true) {
                mtxq.lock();
                if(!q.size()) {
                    mtxq.unlock();
                    this_thread::sleep_for(chrono::milliseconds(20));
                    mtxq.lock();
                    if(!q.size()) {mtxq.unlock(); return;}
                }
                string t=q.front();
                q.pop();
                if(getStartUrl(t)!=StUrl) {mtxq.unlock(); continue;}
                mtxm.lock();
                if(m.count(t)) {mtxm.unlock();mtxq.unlock(); continue;}
                m[t] = true;
                mtxa.lock();
                rtn.emplace_back(t);
                mtxa.unlock();
                mtxm.unlock();
                mtxq.unlock();
                vector<string> vec(htmlParser.getUrls(t));
                mtxq.lock();
                for(auto& s:vec) {q.push(s);}
                mtxq.unlock();
            }
            return;
        };
        while(n--) pool.emplace_back(thread(eUrl));
        for(auto& t:pool) t.join();
        return rtn;
    }
private:
    vector<string> rtn;
    unordered_map<string, bool> m;
    mutex mtxq, mtxm, mtxa;
    string StUrl;
    int n = thread::hardware_concurrency();
    vector<thread> pool;
    queue<string> q;

    string getStartUrl(string& s){
        int t = 3;
        string rtn="";
        for (char& c: s){
            if(c== '/') t--;
            if(!t) return rtn;
            rtn.push_back(c);
        }
        return rtn;
    }
};

Solution in Python

Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results

from threading import Lock
from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def host(u: str) -> str:
            return u.split('/')[2]
        
        base = host(startUrl)
        visited = set([startUrl])
        lock = Lock()

        def worker(url: str) -> List[str]:
            next_urls = []
            for u in htmlParser.getUrls(url):
                if host(u) == base:
                    with lock:
                        if u in visited:
                            continue
                        visited.add(u)
                    next_urls.append(u)
            return next_urls
        
        with ThreadPoolExecutor(max_workers=32) as ex:
            pending = {ex.submit(worker, startUrl)}
            while pending:
                done, pending = wait(pending, return_when=FIRST_COMPLETED)
                for fut in done:
                    for nxt in fut.result():
                        pending.add(ex.submit(worker, nxt))
        
        return list(visited)

Step-by-Step Example

Let’s trace through the Python solution with startUrl = “http://news.yahoo.com/news/topics/”:

Step 1: Extract base domain

host("http://news.yahoo.com/news/topics/") → “news.yahoo.com”
visited = {"http://news.yahoo.com/news/topics/"}

Step 2: Start worker thread

Submit initial URL to thread pool
Worker calls htmlParser.getUrls() on startUrl

Step 3: Process discovered URLs

For each URL from parser:
- Check if host matches base domain
- If yes, add to visited set (with lock)
- Add to next_urls for further processing

Step 4: Continue until no more URLs

Submit new URLs to thread pool
Wait for completion and process results
Repeat until all URLs are processed

Key Insights

Thread Safety: Use locks to protect shared data structures
Domain Filtering: Only process URLs from the same domain
Concurrent Processing: Use thread pools for parallel execution
Deadlock Prevention: Careful lock ordering and timeout handling
Memory Management: Proper cleanup of thread resources

Synchronization Patterns

C++ Approach:

Multiple Mutexes: Separate locks for queue, map, and results
Manual Thread Management: Create and join threads manually
Polling: Sleep and check for work periodically

Python Approach:

Single Lock: One lock for the visited set
ThreadPoolExecutor: Automatic thread management
Future-based: Use futures for asynchronous execution

Common Mistakes

Race Conditions: Not properly synchronizing access to shared data
Deadlocks: Incorrect lock ordering or holding multiple locks
Memory Leaks: Not properly cleaning up thread resources
Infinite Loops: Not handling empty queue conditions correctly
Domain Mismatch: Processing URLs from different domains