Making web scraping work at scale isn't about fancy code—it's about having a proxy solution that just works. This guide shows you exactly how to hook up ScraperAPI with Scrapy, so you can stop worrying about IP bans and start actually collecting data.
Look, we've all been there. You write a beautiful spider, run it, and within minutes you're blocked. Your IP is toast. You scramble to find new proxies, rotate headers, handle CAPTCHAs—it's exhausting.
ScraperAPI handles all that nonsense for you. Send them a URL, they manage the proxy rotation, header configuration, retries, and even CAPTCHA solving. You just get back the HTML you asked for.
Here's what your typical Scrapy request looks like normally:
python
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', callback=self.parse)
Simple enough. But this goes straight to the target website, which means you're one suspicious pattern away from getting blocked.
Integrating ScraperAPI with Scrapy is surprisingly straightforward. You've got three options, depending on how you like to work.
This is the most explicit method. You create a helper function that reformats your URLs into ScraperAPI requests:
python
from urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'
def get_scraperapi_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/']
for url in urls:
yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)
Every request now routes through ScraperAPI's endpoint. Clean and visible.
If you don't want to write that helper function yourself, ScraperAPI provides an SDK that does it for you.
Install it:
bash
pip install scraperapi-sdk
Then use it in your spider:
python
from scraperapi_sdk import ScraperAPIClient
API_KEY = 'YOUR_API_KEY'
client = ScraperAPIClient(API_KEY)
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/']
for url in urls:
yield scrapy.Request(url=client.scrapyGet(url), callback=self.parse)
Same result, less code on your end.
Maybe you're used to working with traditional proxy setups. ScraperAPI works that way too:
python
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/']
proxy = 'http://scraperapi:YOUR_API_KEY@proxy-server.scraperapi.com:8001'
for url in urls:
yield scrapy.Request(url=url, meta={'proxy': proxy}, callback=self.parse)
Scrapy skips SSL verification by default, so you don't need to mess with certificates here.
Sometimes the default settings aren't enough. ScraperAPI lets you tweak how it handles requests by adding parameters.
Need JavaScript rendered? Add render=true to your request parameters:
python
def get_scraperapi_url(url):
payload = {'api_key': API_KEY, 'url': url, 'render': 'true'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
Other useful parameters include:
country_code – Route through a specific country
device_type – Simulate mobile or desktop
session_number – Maintain the same IP across multiple requests
When you're dealing with sophisticated sites that require specific configurations beyond basic scraping, having these options available makes all the difference. For sites with complex anti-bot systems, 👉 ScraperAPI's advanced features handle the heavy lifting automatically, so you can focus on extracting data rather than fighting blocks.
ScraperAPI automatically selects optimal headers for maximum success rates. Usually that's what you want.
But sometimes you need to send specific headers—maybe for authentication, or because the site requires a particular format. In those cases, add keep_headers=true:
python
def get_scraperapi_url(url):
payload = {'api_key': API_KEY, 'url': url, 'keep_headers': 'true'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
headers = {"X-MyHeader": "123"}
yield scrapy.Request(url=get_scraperapi_url(url), headers=headers, callback=self.parse)
Your custom headers now pass through to the target site.
Here's where most people leave performance on the table. ScraperAPI can handle serious throughput, but only if you configure Scrapy properly.
ScraperAPI's whole point is letting you scale from hundreds to millions of pages. That requires parallel requests.
In your settings.py file, set concurrent requests to match your plan's thread limit:
python
CONCURRENT_REQUESTS = 100 # Adjust based on your plan
Also make sure these settings are disabled:
python
DOWNLOAD_DELAY = 0
RANDOMIZE_DOWNLOAD_DELAY = False
Those delay settings throttle your scraper—unnecessary when you're routing through ScraperAPI.
Even with a 97%+ success rate, some requests will fail. ScraperAPI returns a 500 status code for these and doesn't charge you.
Configure Scrapy to automatically retry failed requests:
python
RETRY_TIMES = 3
After three retries, virtually everything succeeds unless the target site itself is down.
Scrapy checks robots.txt by default, which can interfere with ScraperAPI endpoint routing.
Turn it off:
python
ROBOTSTXT_OBEY = False
You're sending requests through ScraperAPI anyway, so this check is redundant.
Here's what your settings.py should look like:
python
CONCURRENT_REQUESTS = 100
DOWNLOAD_DELAY = 0
RANDOMIZE_DOWNLOAD_DELAY = False
RETRY_TIMES = 3
ROBOTSTXT_OBEY = False
Integrating ScraperAPI with Scrapy isn't complicated—pick one of the three methods, adjust your settings, and you're scraping at scale. No more IP rotation headaches, no more CAPTCHA nightmares, no more manually managing proxy pools.
The real benefit shows up when you're scaling. Going from scraping a few hundred pages to millions becomes a configuration change rather than a complete infrastructure overhaul. For teams serious about extracting data without the operational overhead, 👉 ScraperAPI removes the infrastructure complexity entirely—you write spiders, they handle everything else.