We've all been there. Your automated system is humming along nicely, pulling race results from various websites, when suddenly—bam—a 403 Forbidden error stops everything dead. That's exactly what happened when I was maintaining Canterbury Harriers' results system.
The ironic part? Our scraper was being more polite than a regular web browser. We made one clean request per page, while a typical browser would fire off 73 additional requests for images, stylesheets, and tracking scripts. Yet somehow, we were the ones getting blocked.
My first instinct was to try the usual tricks—adding user-agent headers, mimicking browser behavior, spacing out requests. None of it worked. The target site had clearly upgraded its anti-scraping defenses, and my conventional approaches weren't cutting it anymore.
This wasn't just annoying; it was a legitimate data access problem. We weren't doing anything sketchy—just automatically importing publicly available race results that our club members had participated in.
While testing different approaches, I experimented with various web screenshot services to see if they could bypass the block. Most failed just like my direct requests did. But one service stood out: it successfully retrieved the webpage content without triggering any 403 errors.
This discovery led me to explore proxy-based crawling solutions more seriously. The key insight was that rotating residential proxies and smart request handling could make legitimate data collection work again, even on sites with aggressive anti-bot measures.
👉 Get reliable webpage access with enterprise-grade proxy infrastructure
Once I found a working approach, implementation turned out to be straightforward. The core concept is simple: instead of requesting pages directly, you route requests through a service that handles all the proxy rotation and browser fingerprinting for you.
Here's a clean PHP class that demonstrates the pattern:
php
<?php
namespace TelfordCodes;
use GuzzleHttp\Client;
use Monolog\Logger;
class CrawlService
{
private const CRAWL_SERVICE_URL_FMT = "https://api.proxycrawl.com/?token=%s&url=%s";
function __construct(private Client $client, private Logger $logger)
{ }
public function getResponse(string $base_url) : string
{
$enc_base_url = urlencode($base_url);
$service_url = sprintf(self::CRAWL_SERVICE_URL_FMT,
$_ENV['CRAWL_SERVICE_TOKEN'],
$enc_base_url);
try {
$response = $this->client->get($service_url);
$body = $response->getBody();
$text = $body->getContents();
$this->logger->info("Successful request for $base_url");
} catch (\Exception $e) {
$text = "";
$this->logger->error("Failed request to $base_url " . $e->getMessage());
}
return $text;
}
}
What makes this implementation solid:
Uses PHP 8 constructor property promotion for cleaner code
Follows PSR-18 for HTTP clients and PSR-3 for logging
Stores API credentials in environment variables (never in your codebase)
Includes comprehensive error logging for debugging
The beauty of this approach is that it works with any PSR-compliant HTTP client. I used GuzzleHttp, but you could swap in Symfony's HTTP client or any other standard implementation.
Using the service is remarkably simple. Here's how you'd set it up in practice:
php
use GuzzleHttp\Client;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Bramus\Monolog\Formatter\ColoredLineFormatter;
use TelfordCodes\CrawlService;
// Load environment variables from .env
(Dotenv\Dotenv::createImmutable(DIR))->load();
$client = new GuzzleHttp\Client();
$logger = new Logger("CrawlService");
$handler = new StreamHandler(DIR.'/log/crawler.log', Logger::DEBUG);
$handler->setFormatter(new ColoredLineFormatter());
$logger->pushHandler($handler);
$crawl_service = new CrawlService($client, $logger);
$url = "https://site-you-need-goes-here";
$resp = $crawl_service->getResponse($url);
I added the ColoredLineFormatter for Monolog because it makes scanning logs much faster—successful requests show up in one color, errors in another. When you're debugging why certain pages aren't loading, this visual distinction saves real time.
While the proxy service provides its own dashboard for tracking successes and failures, I strongly recommend maintaining your own detailed logs. Here's why: when something goes wrong at 3 AM and your automated import fails, you need context that goes beyond "request failed."
Your logs should capture:
Which specific URLs are causing problems
Whether failures are consistent or intermittent
Timing patterns that might indicate rate limiting
Error messages that point to underlying issues
👉 Scale your web scraping with built-in JavaScript rendering and global proxy coverage
After switching to this proxy-based approach, the 403 errors disappeared completely. Race results started flowing into our system again automatically, and I stopped getting panicked messages from club members wondering why their results weren't posted.
The broader lesson here isn't really about any specific service—it's about recognizing when your scraping approach needs to evolve. Websites are getting smarter about bot detection, and legitimate automated systems need more sophisticated tools to keep functioning.
For our use case—importing publicly available sports results for a running club—this solution struck the right balance between reliability and simplicity. The code is maintainable, the service handles the complexity of proxy management, and our members get their results posted automatically again.
If you're facing similar challenges with legitimate web scraping projects, consider whether your approach has outgrown simple HTTP requests. Sometimes the right tool makes all the difference between a system that breaks constantly and one that just works.