[Medium] 1242. Web Crawler Multithreaded

This is a multithreading problem that requires implementing a concurrent web crawler. The key insight is using proper synchronization mechanisms to avoid race conditions while crawling URLs from the same domain concurrently.

Problem Description

Given a start URL and an HTML parser, implement a multithreaded web crawler that:

Crawls all URLs from the same domain as the start URL
Uses multiple threads for concurrent crawling
Avoids visiting the same URL multiple times
Returns all discovered URLs

Examples

Example 1:

Input: 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/"
]

Example 2:

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]

Constraints

1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl is one of the urls
All URLs have the same hostname

Clarification Questions

Before diving into the solution, here are 5 important clarifications and assumptions to discuss during an interview:

Crawling scope: What URLs should we crawl? (Assumption: Only URLs with the same hostname as startUrl - per constraints, all URLs have same hostname)
Thread safety: Do we need thread-safe operations? (Assumption: Yes - multithreaded environment, need synchronization for shared state)
URL format: What constitutes a valid URL? (Assumption: Standard URL format, startUrl is guaranteed to be in urls list)
Duplicate handling: How should we handle duplicate URLs? (Assumption: Each URL should be crawled only once - need to track visited URLs)
Return format: What should we return? (Assumption: List of all URLs that can be reached from startUrl with same hostname)

Interview Deduction Process (20 minutes)

Step 1: Brute-Force Approach (5 minutes)

Initial Thought: “I need to crawl web pages. Let me start from startUrl and follow links sequentially.”

Naive Solution: Start from startUrl, fetch HTML, extract URLs, filter by hostname, recursively crawl each URL sequentially.

Complexity: O(n) time where n is number of URLs, O(n) space

Issues:

Sequential crawling is slow
Doesn’t leverage multithreading
Not optimal for concurrent operations
Can be optimized

Step 2: Semi-Optimized Approach (7 minutes)

Insight: “I can use BFS/DFS with thread pool to crawl URLs concurrently.”

Improved Solution: Use BFS/DFS with thread pool. Maintain queue of URLs to crawl. Use multiple threads to fetch URLs concurrently. Track visited URLs to avoid duplicates.

Complexity: O(n) time with better concurrency, O(n) space

Improvements:

Concurrent crawling is faster
Thread pool manages threads efficiently
Handles visited URLs correctly
Better utilizes resources

Step 3: Optimized Solution (8 minutes)

Final Optimization: “BFS with thread pool and proper synchronization is optimal.”

Best Solution: BFS with thread pool. Use queue for URLs to crawl, visited set for tracking. Use thread pool to fetch URLs concurrently. Synchronize access to shared data structures.

Complexity: O(n) time with concurrency, O(n) space

Key Realizations:

BFS is natural for web crawling
Thread pool enables concurrent fetching
Synchronization is crucial for thread safety
Visited tracking prevents infinite loops

Approach

The solution involves:

Domain Extraction: Extract the base domain from the start URL
Thread Pool: Use multiple threads for concurrent crawling
Synchronization: Use locks to prevent race conditions
URL Filtering: Only crawl URLs from the same domain
Visited Tracking: Avoid revisiting the same URL

Solution in Java

Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results

// import java.util.*;
/**
 * // This is the HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation class HtmlParser {
 *   public:
 *     String[]getUrls(String url);
 * }
 */
class Solution {
    public String[]crawl(String startUrl, HtmlParser htmlParser) {
        StUrl = getStartUrl(startUrl);
        q.push(startUrl);
        var eUrl = [&]() {
            while(true) {
                mtxq.lock();
                if(!q.size()) {
                    mtxq.unlock();
                    this_thread::sleep_for(chrono::milliseconds(20));
                    mtxq.lock();
                    if(!q.size()) {mtxq.unlock(); return;}
                }
                String t=q.getFirst();
                q.pop();
                if(getStartUrl(t)!=StUrl) {mtxq.unlock(); continue;}
                mtxm.lock();
                if(m.count(t)) {mtxm.unlock();mtxq.unlock(); continue;}
                m[t] = true;
                mtxa.lock();
                rtn.add(t);
                mtxa.unlock();
                mtxm.unlock();
                mtxq.unlock();
                String[]vec(htmlParser.getUrls(t));
                mtxq.lock();
                for(auto s:vec) {q.push(s);}
                mtxq.unlock();
            }
            return;
        }
        while(n--) pool.add(thread(eUrl));
        for(auto t:pool) t.join();
        return rtn;
    }
    String[]rtn;
    HashMap<String, boolean> m = new HashMap<String, boolean>();
    lock mtxq, mtxm, mtxa;
    String StUrl;
    int n = thread::hardware_concurrency();
    thread[]pool;
    Queue<String> q = new LinkedList<>();

    String getStartUrl(String s){
        int t = 3;
        String rtn="";
        for (char c: s){
            if(c== '/') t--;
            if(!t) return rtn;
            rtn.add(c);
        }
        return rtn;
    }
}

Solution in Python

Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results

from threading import Lock
from concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED

class Solution:
    public static void crawl(this, startUrl: str, htmlParser: 'HtmlParser') . List[str]:
        public static void host(u: str) . str:
            return u.split('/')[2]
        
        base = host(startUrl)
        visited = set([startUrl])
        lock = Lock()

        public static void worker(url: str) . List[str]:
            next_urls = []
            for u in htmlParser.getUrls(url):
                if host(u) == base:
                    with lock:
                        if u in visited:
                            continue
                        visited.add(u)
                    next_urls.append(u)
            return next_urls
        
        with ThreadPoolExecutor(max_workers=32) as ex:
            pending = {ex.submit(worker, startUrl)}
            while pending:
                done, pending = wait(pending, return_when=FIRST_COMPLETED)
                for fut in done:
                    for nxt in fut.result():
                        pending.add(ex.submit(worker, nxt))
        
        return list(visited)

Step-by-Step Example

Let’s trace through the Python solution with startUrl = “http://news.yahoo.com/news/topics/”:

Step 1: Extract base domain

host("http://news.yahoo.com/news/topics/") → “news.yahoo.com”
visited = {"http://news.yahoo.com/news/topics/"}

Step 2: Start worker thread

Submit initial URL to thread pool
Worker calls htmlParser.getUrls() on startUrl

Step 3: Process discovered URLs

For each URL from parser:
- Check if host matches base domain
- If yes, add to visited set (with lock)
- Add to next_urls for further processing

Step 4: Continue until no more URLs

Submit new URLs to thread pool
Wait for completion and process results
Repeat until all URLs are processed

Key Insights

Thread Safety: Use locks to protect shared data structures
Domain Filtering: Only process URLs from the same domain
Concurrent Processing: Use thread pools for parallel execution
Deadlock Prevention: Careful lock ordering and timeout handling
Memory Management: Proper cleanup of thread resources

Synchronization Patterns

Java Approach:

Multiple Mutexes: Separate locks for queue, map, and results
Manual Thread Management: Create and join threads manually
Polling: Sleep and check for work periodically

Python Approach:

Single Lock: One lock for the visited set
ThreadPoolExecutor: Automatic thread management
Future-based: Use futures for asynchronous execution

Common Mistakes

Race Conditions: Not properly synchronizing access to shared data
Deadlocks: Incorrect lock ordering or holding multiple locks
Memory Leaks: Not properly cleaning up thread resources
Infinite Loops: Not handling empty queue conditions correctly
Domain Mismatch: Processing URLs from different domains

Robina Li

[Medium] 1242. Web Crawler Multithreaded

[Medium] 1242. Web Crawler Multithreaded

Problem Description

Examples

Constraints

Clarification Questions

Interview Deduction Process (20 minutes)

Step 1: Brute-Force Approach (5 minutes)

Step 2: Semi-Optimized Approach (7 minutes)

Step 3: Optimized Solution (8 minutes)

Approach

Solution in Java

Solution in Python

Step-by-Step Example

Key Insights

Synchronization Patterns

Java Approach:

Python Approach:

Common Mistakes

Related Posts

Recent Posts