Boost Your Web Scraping Speed: How Concurrent Threads Changed Everything

Ever built a scraper that worked beautifully on a few hundred pages, then completely choked when you tried to scale up? Yeah, me too. The slowness was unbearable. But here's what I learned: there's actually a straightforward way to fix this, and it doesn't require rebuilding everything from scratch.

This article walks through what concurrent threads are, why they matter for web scraping performance, and how to set them up properly. You'll see real numbers from actual tests—not theory, just what happens when you increase from 100 threads to 500 threads on the same scraping job.

What are concurrent threads, anyway?

Think of it this way: you're at a coffee shop, and there's one barista making drinks. Each customer waits in line. Now imagine five baristas working at once—suddenly, five customers get served simultaneously. That's basically what concurrent threads do for your scraper.

When you send requests to ScraperAPI, concurrent threads let you fire off multiple requests at the same time instead of waiting for each one to finish before starting the next. With 5 concurrent threads, you're making 5 requests in parallel. More threads means more simultaneous requests, which means faster results.

Different ScraperAPI plans come with different thread limits. The Business plan gives you up to 100 concurrent threads. The Scaling plan bumps that to 200. And if you need more than that, the Enterprise plan doesn't have a fixed cap—they'll work with you to figure out what makes sense for your specific use case.

Testing concurrent threads: the experiment

Enough explanation. Let's see what actually happens when you increase concurrent threads.

I ran a simple test: scrape over 1,000 URLs, first with 100 concurrent threads, then with 500. Same URLs, same code, just different thread counts. The goal was to measure the actual time difference.

Getting sample URLs to test with

First, I needed a list of URLs to scrape. I crawled the tech section of CNN's website using Scrapy to extract around 1,000 URLs. This step is just setup—in your actual project, these would be whatever pages you need to scrape.

I opened the terminal, installed Scrapy and BeautifulSoup, and created a spider that crawls CNN's tech section and saves all the URLs it finds.

Here's the spider code:

python
import scrapy
from urllib.parse import urljoin, urlparse

class CnnSpider(scrapy.Spider):
name = "cnn"
allowed_domains = ["edition.cnn.com"]
start_urls = ["https://edition.cnn.com/business/tech"]
seen_urls = set()

custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

def parse(self, response):
links = response.css("a::attr(href)").getall()

for link in links:

if link.startswith("/"):

full_url = urljoin("https://edition.cnn.com", link)

elif link.startswith("http") and "edition.cnn.com" in link:

full_url = link

else:

continue

if full_url not in self.seen_urls:

self.seen_urls.add(full_url)

yield {"url": full_url}

yield response.follow(full_url, callback=self.parse)

if len(self.seen_urls) >= 1000:

self.crawler.engine.close_spider(self, "URL limit reached")

The custom_settings part makes the spider look like a real browser to avoid getting blocked. The parse() function grabs all links on the page, converts them to full URLs, and keeps track of which ones it's already seen. Once it hits 1,000 URLs, it stops.

Run this from the spiders folder, and it saves all the URLs into a JSON file.

Scraping with ScraperAPI: the real test

Now comes the interesting part. I created a Python script that reads those URLs and sends them to ScraperAPI. The script uses concurrent threads to send multiple requests at once.

If you're serious about web scraping at scale, handling rate limits and rotating proxies becomes critical. That's where tools designed specifically for this problem come in handy. 👉 See how ScraperAPI handles concurrent requests and proxy rotation automatically, so you can focus on getting your data instead of managing infrastructure.

Here's the scraping code:

python
import requests
import json
import csv
import time
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

API_KEY = 'ScraperAPI API_key'
NUM_RETRIES = 3
NUM_THREADS = 100

with open("path/to/URLs_json_file", "r") as file:
raw_data = json.load(file)
list_of_urls = [item["url"] for item in raw_data if "url" in item]

def scrape_url(url):
params = {
'api_key': API_KEY,
'url': url
}

for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=params)
if response.status_code in [200, 404]:
break
except requests.exceptions.ConnectionError:
continue
else:
return {
'url': url,
'h1': 'Failed after retries',
'title': '',
'meta_description': '',
'status_code': 'Error'
}

if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
h1 = soup.find("h1")
title = soup.title.string.strip() if soup.title else "No Title Found"
meta_tag = soup.find("meta", attrs={"name": "description"})
meta_description = meta_tag["content"].strip() if meta_tag and meta_tag.has_attr("content") else "No Meta Description"
return {
'url': url,
'h1': h1.get_text(strip=True) if h1 else 'No H1 found',
'title': title,
'meta_description': meta_description,
'status_code': response.status_code
}
else:
return {
'url': url,
'h1': 'No H1 - Status {}'.format(response.status_code),
'title': '',
'meta_description': '',
'status_code': response.status_code
}

start_time = time.time()

with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
scraped_data = list(executor.map(scrape_url, list_of_urls))

elapsed_time = time.time() - start_time
print(f"Using {NUM_THREADS} concurrent threads, scraping completed in {elapsed_time:.2f} seconds.")

with open("cnn_h1_results.csv", "w", newline='', encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["url", "h1", "title", "meta_description", "status_code"])
writer.writeheader()
writer.writerows(scraped_data)

The scrape_url() function sends each URL to ScraperAPI. If something goes wrong, it retries up to 3 times. When it gets a successful response, BeautifulSoup extracts the H1 tag, page title, and meta description.

The ThreadPoolExecutor part is what handles the concurrent requests. You set max_workers to however many threads you want running at once. The script measures how long everything takes, then saves the results to a CSV file.

The results: what actually happened

With 100 concurrent threads, scraping those 1,000+ URLs took 100.68 seconds.

Then I changed one line of code—set NUM_THREADS to 500—and ran it again. Same URLs, same everything else. This time it took 23.56 seconds.

That's nearly 4 times faster. Same data, same quality, just way less waiting around.

The difference is dramatic when you're working with thousands or tens of thousands of pages. What used to take hours can finish in minutes. And if you need even more speed, the setup scales—you can push thread counts higher depending on your plan and infrastructure.

Why this matters for large-scale scraping

When you're scraping at scale, speed isn't just about convenience—it directly impacts what you can accomplish. Being able to process data faster means you can handle more pages, respond to changes quicker, and keep your projects moving without bottlenecks.

Concurrent threads make that possible. They're not complicated to implement, and the performance gains are real and measurable. Whether you're scraping product data, monitoring competitors, or gathering research information, using concurrent threads properly can completely change how fast your scraper runs.

If you're looking to scale up your web scraping without constantly running into rate limits or proxy issues, ScraperAPI's concurrent thread support handles the heavy lifting so you can focus on using the data instead of fighting with infrastructure.

Page updated

Google Sites

Report abuse