If you've ever relied on a single proxy provider for your web scraping projects, you know the pain. One day everything works smoothly, the next day your success rate drops to 30%. Proxies are temperamental creatures, and that's exactly why smart developers don't put all their eggs in one basket.
The solution? Build a proxy waterfalling system that automatically switches between multiple providers based on cost, reliability, and specific needs. In this guide, we'll walk through creating a custom Scrapy middleware that does exactly that.
Think of proxy waterfalling like having backup internet connections. Your main connection might be fast and cheap, but when it goes down, you automatically switch to your mobile hotspot. Same concept here.
Here's what a good proxy waterfalling system accomplishes:
Cost savings - Try requests without proxies first, only use them when necessary
Smart fallback - Start with cheaper proxies, escalate to premium ones if needed
Better reliability - If one provider fails, another takes over automatically
Targeted routing - Use specific proxies for specific websites (like Google Search)
The beauty is that you're not locked into one provider's uptime or pricing structure. When you're scraping at scale, this flexibility matters.
Before jumping into code, let's map out the game plan. The strategy we're building follows a clear escalation path:
First attempt: No proxy at all. Many websites don't block initial requests, so why pay for bandwidth you don't need?
Second and third retries: Route through a budget-friendly provider like Scrapingdog. This handles most blocked requests without breaking the bank.
Fourth and fifth retries: Escalate to a mid-tier provider like ScraperAPI for tougher targets.
Final retries: Pull out the big guns with premium proxies from Scrapingbee.
There's one exception to this flow - Google searches. Some providers don't charge extra for Google requests, so we'll route those directly to the right provider from the start.
👉 Get reliable proxies that won't break your scraping budget with Scrapingdog
Time to get our hands dirty with code. We're creating a custom DownloaderMiddleware that sits between Scrapy's engine and the actual HTTP requests.
Every Scrapy middleware follows a similar pattern. We'll need three core methods and two helper functions:
python
class ProxyWaterfallMiddleware:
def init(self, settings):
# Initialize proxy configurations
pass
@classmethod
def from_crawler(cls, crawler):
# Setup middleware from Scrapy crawler
return cls(crawler.settings)
def process_request(self, request, spider):
# Main logic for adding proxies to requests
pass
def api_key_valid(self, api_key):
# Check if API key is configured
pass
def add_proxy(self, request, username, password, host):
# Apply proxy credentials to request
pass
The structure is straightforward - we initialize settings, process each request, and have helpers to manage the proxy details.
When Scrapy starts up, we need to load all our proxy provider credentials from the settings file. Each provider formats their authentication differently, so we handle them individually:
python
def init(self, settings):
# Load API keys from settings
self.scraperapi_key = settings.get('SCRAPERAPI_KEY', None)
self.scrapingdog_key = settings.get('SCRAPINGDOG_KEY', None)
self.scrapingbee_key = settings.get('SCRAPINGBEE_KEY', None)
# ScraperAPI configuration
self.scraperapi_username = 'scraperapi'
self.scraperapi_password = self.scraperapi_key
self.scraperapi_host = 'proxy-server.scraperapi.com:8001'
# Scrapingdog configuration
self.scrapingdog_username = 'scrapingdog'
self.scrapingdog_password = self.scrapingdog_key
self.scrapingdog_host = 'proxy.scrapingdog.com:8081'
# Scrapingbee configuration
self.scrapingbee_username = self.scrapingbee_key
self.scrapingbee_password = 'render_js=False'
self.scrapingbee_host = 'proxy.scrapingbee.com:8886'
Notice how each provider has their own quirks. ScraperAPI puts the API key in the password field, while Scrapingbee puts it in the username. This is why we can't just use a generic configuration.
The add_proxy function handles the actual work of attaching proxy credentials to each request. This involves base64 encoding the credentials and setting the proper headers:
python
import base64
def add_proxy(self, request, username, password, host):
# Encode credentials
user_pass = f"{username}:{password}"
basic_auth = base64.b64encode(user_pass.encode()).decode()
# Set proxy on request
request.meta['proxy'] = f"http://{host}"
request.headers['Proxy-Authorization'] = f'Basic {basic_auth}'
This is standard HTTP proxy authentication. The credentials get encoded and attached as headers that the proxy server will read and validate.
Now for the main event - the process_request method where our waterfalling magic happens. This is where we decide which proxy to use based on the retry count and target website:
python
def process_request(self, request, spider):
retry_count = request.meta.get('retry_times', 0)
# Always use ScraperAPI for Google
if 'google.com' in request.url:
if self.api_key_valid(self.scraperapi_key):
self.add_proxy(request, self.scraperapi_username,
self.scraperapi_password, self.scraperapi_host)
return
# First attempt - no proxy
if retry_count == 0:
return
# Retry 1-2: Use Scrapingdog
if retry_count in [1, 2]:
if self.api_key_valid(self.scrapingdog_key):
self.add_proxy(request, self.scrapingdog_username,
self.scrapingdog_password, self.scrapingdog_host)
return
# Retry 3-4: Use ScraperAPI
if retry_count in [3, 4]:
if self.api_key_valid(self.scraperapi_key):
self.add_proxy(request, self.scraperapi_username,
self.scraperapi_password, self.scraperapi_host)
return
# Retry 5+: Use Scrapingbee
if self.api_key_valid(self.scrapingbee_key):
self.add_proxy(request, self.scrapingbee_username,
self.scrapingbee_password, self.scrapingbee_host)
def api_key_valid(self, api_key):
return api_key is not None and len(api_key) > 0
The logic flows naturally - start cheap, escalate as needed. If a provider's API key isn't configured, it gracefully falls through to the next option.
👉 Start waterfalling your proxy requests with cost-effective solutions from Scrapingdog
Building the middleware is one thing, but it won't do anything until we enable it in our project settings. Open up your settings.py file and add these lines:
python
SCRAPERAPI_KEY = 'your_scraperapi_key_here'
SCRAPINGDOG_KEY = 'your_scrapingdog_key_here'
SCRAPINGBEE_KEY = 'your_scrapingbee_key_here'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ProxyWaterfallMiddleware': 350,
}
The number 350 is the middleware's priority. Lower numbers execute first, and we want this to run before Scrapy's default retry middleware but after other essential middlewares.
Once configured, every request your spiders make will automatically route through this waterfall system. No changes needed to your existing spider code - it all happens transparently in the background.
This middleware is a solid foundation, but you can easily extend it for your specific needs. Want to add geotargeting for certain requests? Add a check for a custom meta flag. Need JavaScript rendering for dynamic pages? Modify the proxy parameters based on the target URL.
The key is that you now have a flexible system that balances cost against reliability. Your scraper will always try the cheapest option first, only spending more when necessary. And if one provider goes down or starts blocking you, the system automatically switches to an alternative.
That's the power of proxy waterfalling - you're never stuck relying on a single provider's performance or uptime. Your scraper becomes more resilient, more cost-effective, and ultimately more reliable for production use.