[Medium] 1242. Web Crawler Multithreaded

This is a multithreading problem that requires implementing a concurrent web crawler. The key insight is using proper synchronization mechanisms to avoid race conditions while crawling URLs from the same domain concurrently.

Given a start URL and an HTML parser, implement a multithreaded web crawler that:

Crawls all URLs from the same domain as the start URL
Uses multiple threads for concurrent crawling
Avoids visiting the same URL multiple times
Returns all discovered URLs

Examples

Example 1:

Input: 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/"
]

Example 2:

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]

Constraints

1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl is one of the urls
All URLs have the same hostname

Thinking Process

The solution involves:

Domain Extraction: Extract the base domain from the start URL
Thread Pool: Use multiple threads for concurrent crawling
Synchronization: Use locks to prevent race conditions
URL Filtering: Only crawl URLs from the same domain
Visited Tracking: Avoid revisiting the same URL

Common Approaches

Typical techniques for this pattern:

Approach	Time	Space	Notes
Brute force (this problem)	Often O(n^2) or O(2^n)	O(n)	Baseline; clarifies the optimization target
Sort + scan	O(n log n)	O(1)	Pairs, intervals, greedy ordering
Hash map / set	O(n)	O(n)	Frequency, membership, two-sum style
Single-pass linear	O(n)	O(1)	Two pointers, sliding window, Kadane

Solution

Time Complexity: O(n) where n is the number of URLs in the same domain
Space Complexity: O(n) for storing visited URLs and results

/**
 * // This is the HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation
 * interface HtmlParser {
 *     public List<String> getUrls(String url);
 * }
 */
class Solution {
    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        String hostname = getHostname(startUrl);
        Set<String> visited = ConcurrentHashMap.newKeySet();
        Queue<String> queue = new ConcurrentLinkedQueue<>();
        visited.add(startUrl);
        queue.offer(startUrl);

        ExecutorService pool = Executors.newFixedThreadPool(16);
        for (int i = 0; i < 16; i++) {
            pool.submit(() -> {
                while (true) {
                    String url = queue.poll();
                    if (url == null) {
                        break;
                    }
                    for (String next : htmlParser.getUrls(url)) {
                        if (!getHostname(next).equals(hostname)) {
                            continue;
                        }
                        if (visited.add(next)) {
                            queue.offer(next);
                        }
                    }
                }
            });
        }
        pool.shutdown();
        try {
            pool.awaitTermination(10, TimeUnit.SECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
        return new ArrayList<>(visited);
    }

    private String getHostname(String url) {
        int start = url.indexOf("://") + 3;
        int end = url.indexOf('/', start);
        return end == -1 ? url.substring(start) : url.substring(start, end);
    }
}

Solution Explanation

Approach: Brute force (this problem)

Key idea: The solution involves:

How the code works:

Domain Extraction: Extract the base domain from the start URL
Thread Pool: Use multiple threads for concurrent crawling
Synchronization: Use locks to prevent race conditions
URL Filtering: Only crawl URLs from the same domain
Visited Tracking: Avoid revisiting the same URL

Walkthrough — input urls = [, expected output [:

Initialize variables from the problem setup.
Apply the main loop / recursion until the condition is met.
Confirm the result matches the expected output.
Step-by-Step Example

Let’s trace through the Python solution with startUrl = “http://news.yahoo.com/news/topics/”:

Step 1: Extract base domain

host("http://news.yahoo.com/news/topics/") → “news.yahoo.com”
visited = {"http://news.yahoo.com/news/topics/"}

Step 2: Start worker thread

Submit initial URL to thread pool
Worker calls htmlParser.getUrls() on startUrl

Step 3: Process discovered URLs

For each URL from parser:
- Check if host matches base domain
- If yes, add to visited set (with lock)
- Add to next_urls for further processing

Step 4: Continue until no more URLs

Submit new URLs to thread pool
Wait for completion and process results
Repeat until all URLs are processed

Synchronization Patterns

C++ Approach:

Multiple Mutexes: Separate locks for queue, map, and results
Manual Thread Management: Create and join threads manually
Polling: Sleep and check for work periodically

Python Approach:

Single Lock: One lock for the visited set
ThreadPoolExecutor: Automatic thread management
Future-based: Use futures for asynchronous execution

Common Mistakes

Race Conditions: Not properly synchronizing access to shared data
Deadlocks: Incorrect lock ordering or holding multiple locks
Memory Leaks: Not properly cleaning up thread resources
Infinite Loops: Not handling empty queue conditions correctly
Domain Mismatch: Processing URLs from different domains

References

LC 1242: Web Crawler Multithreaded on LeetCode
LeetCode Discuss — LC 1242: Web Crawler Multithreaded
LeetCode Editorial (may require premium)

Key Takeaways

Thread Safety: Use locks to protect shared data structures
Domain Filtering: Only process URLs from the same domain
Concurrent Processing: Use thread pools for parallel execution
Deadlock Prevention: Careful lock ordering and timeout handling
Memory Management: Proper cleanup of thread resources

Robina Li