Scrapy Proxy Waterfalling: How to Waterfall Requests Over Multiple Proxy Providers

If you've ever relied on a single proxy provider for your web scraping projects, you know the pain. One day everything works smoothly, the next day your success rate drops to 30%. Proxies are temperamental creatures, and that's exactly why smart developers don't put all their eggs in one basket.

The solution? Build a proxy waterfalling system that automatically switches between multiple providers based on cost, reliability, and specific needs. In this guide, we'll walk through creating a custom Scrapy middleware that does exactly that.

Why You Need Proxy Waterfalling

Think of proxy waterfalling like having backup internet connections. Your main connection might be fast and cheap, but when it goes down, you automatically switch to your mobile hotspot. Same concept here.

Here's what a good proxy waterfalling system accomplishes:

Cost savings - Try requests without proxies first, only use them when necessary
Smart fallback - Start with cheaper proxies, escalate to premium ones if needed
Better reliability - If one provider fails, another takes over automatically
Targeted routing - Use specific proxies for specific websites (like Google Search)

The beauty is that you're not locked into one provider's uptime or pricing structure. When you're scraping at scale, this flexibility matters.

Our Waterfalling Strategy

Before jumping into code, let's map out the game plan. The strategy we're building follows a clear escalation path:

First attempt: No proxy at all. Many websites don't block initial requests, so why pay for bandwidth you don't need?

Second and third retries: Route through a budget-friendly provider like Scrapingdog. This handles most blocked requests without breaking the bank.

Fourth and fifth retries: Escalate to a mid-tier provider like ScraperAPI for tougher targets.

Final retries: Pull out the big guns with premium proxies from Scrapingbee.

There's one exception to this flow - Google searches. Some providers don't charge extra for Google requests, so we'll route those directly to the right provider from the start.

👉 Get reliable proxies that won't break your scraping budget with Scrapingdog

Building The Proxy Waterfall Middleware

Time to get our hands dirty with code. We're creating a custom DownloaderMiddleware that sits between Scrapy's engine and the actual HTTP requests.

Setting Up The Middleware Structure

Every Scrapy middleware follows a similar pattern. We'll need three core methods and two helper functions:

python
class ProxyWaterfallMiddleware:
def init(self, settings):
# Initialize proxy configurations
pass

@classmethod

def from_crawler(cls, crawler):

# Setup middleware from Scrapy crawler

return cls(crawler.settings)

def process_request(self, request, spider):

# Main logic for adding proxies to requests

pass

def api_key_valid(self, api_key):

# Check if API key is configured

pass

def add_proxy(self, request, username, password, host):

# Apply proxy credentials to request

pass

The structure is straightforward - we initialize settings, process each request, and have helpers to manage the proxy details.

Loading Proxy Credentials on Launch

When Scrapy starts up, we need to load all our proxy provider credentials from the settings file. Each provider formats their authentication differently, so we handle them individually:

python
def init(self, settings):
# Load API keys from settings
self.scraperapi_key = settings.get('SCRAPERAPI_KEY', None)
self.scrapingdog_key = settings.get('SCRAPINGDOG_KEY', None)
self.scrapingbee_key = settings.get('SCRAPINGBEE_KEY', None)

# ScraperAPI configuration

self.scraperapi_username = 'scraperapi'

self.scraperapi_password = self.scraperapi_key

self.scraperapi_host = 'proxy-server.scraperapi.com:8001'

# Scrapingdog configuration

self.scrapingdog_username = 'scrapingdog'

self.scrapingdog_password = self.scrapingdog_key

self.scrapingdog_host = 'proxy.scrapingdog.com:8081'

# Scrapingbee configuration

self.scrapingbee_username = self.scrapingbee_key

self.scrapingbee_password = 'render_js=False'

self.scrapingbee_host = 'proxy.scrapingbee.com:8886'

Notice how each provider has their own quirks. ScraperAPI puts the API key in the password field, while Scrapingbee puts it in the username. This is why we can't just use a generic configuration.

Applying Proxies to Requests

The add_proxy function handles the actual work of attaching proxy credentials to each request. This involves base64 encoding the credentials and setting the proper headers:

python
import base64

def add_proxy(self, request, username, password, host):
# Encode credentials
user_pass = f"{username}:{password}"
basic_auth = base64.b64encode(user_pass.encode()).decode()

# Set proxy on request

request.meta['proxy'] = f"http://{host}"

request.headers['Proxy-Authorization'] = f'Basic {basic_auth}'

This is standard HTTP proxy authentication. The credentials get encoded and attached as headers that the proxy server will read and validate.

Implementing The Waterfall Logic

Now for the main event - the process_request method where our waterfalling magic happens. This is where we decide which proxy to use based on the retry count and target website:

python
def process_request(self, request, spider):
retry_count = request.meta.get('retry_times', 0)

# Always use ScraperAPI for Google

if 'google.com' in request.url:

if self.api_key_valid(self.scraperapi_key):

self.add_proxy(request, self.scraperapi_username,

self.scraperapi_password, self.scraperapi_host)

return

# First attempt - no proxy

if retry_count == 0:

return

# Retry 1-2: Use Scrapingdog

if retry_count in [1, 2]:

if self.api_key_valid(self.scrapingdog_key):

self.add_proxy(request, self.scrapingdog_username,

self.scrapingdog_password, self.scrapingdog_host)

return

# Retry 3-4: Use ScraperAPI

if retry_count in [3, 4]:

if self.api_key_valid(self.scraperapi_key):

self.add_proxy(request, self.scraperapi_username,

self.scraperapi_password, self.scraperapi_host)

return

# Retry 5+: Use Scrapingbee

if self.api_key_valid(self.scrapingbee_key):

self.add_proxy(request, self.scrapingbee_username,

self.scrapingbee_password, self.scrapingbee_host)

def api_key_valid(self, api_key):
return api_key is not None and len(api_key) > 0

The logic flows naturally - start cheap, escalate as needed. If a provider's API key isn't configured, it gracefully falls through to the next option.

👉 Start waterfalling your proxy requests with cost-effective solutions from Scrapingdog

Activating The Middleware

Building the middleware is one thing, but it won't do anything until we enable it in our project settings. Open up your settings.py file and add these lines:

python

Proxy API Keys

SCRAPERAPI_KEY = 'your_scraperapi_key_here'
SCRAPINGDOG_KEY = 'your_scrapingdog_key_here'
SCRAPINGBEE_KEY = 'your_scrapingbee_key_here'

Enable Proxy Waterfall Middleware

DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ProxyWaterfallMiddleware': 350,
}

The number 350 is the middleware's priority. Lower numbers execute first, and we want this to run before Scrapy's default retry middleware but after other essential middlewares.

Once configured, every request your spiders make will automatically route through this waterfall system. No changes needed to your existing spider code - it all happens transparently in the background.

Making It Work For You

This middleware is a solid foundation, but you can easily extend it for your specific needs. Want to add geotargeting for certain requests? Add a check for a custom meta flag. Need JavaScript rendering for dynamic pages? Modify the proxy parameters based on the target URL.

The key is that you now have a flexible system that balances cost against reliability. Your scraper will always try the cheapest option first, only spending more when necessary. And if one provider goes down or starts blocking you, the system automatically switches to an alternative.

That's the power of proxy waterfalling - you're never stuck relying on a single provider's performance or uptime. Your scraper becomes more resilient, more cost-effective, and ultimately more reliable for production use.

Page updated

Google Sites

Report abuse