You're building a review analysis tool but hit a wall when trying to scrape beyond the first page of Amazon reviews. Sound familiar? You're not alone—Amazon's anti-bot measures are designed to stop exactly what you're trying to do, and a simple loop won't cut it.
Let's fix your scraping workflow so you can actually collect the data you need for your semantic analysis project, without the headaches.
Your code looks clean, but there are several reasons it's failing to grab reviews from multiple pages:
Amazon's bot detection is smarter than basic headers. Just adding a User-Agent isn't enough anymore. Amazon analyzes request patterns, timing, IP addresses, and dozens of other signals to identify automated scrapers.
The URL pagination parameter might not be functioning as expected. Amazon dynamically loads content and frequently changes its page structure. Your pageNumber parameter could be ignored or formatted incorrectly.
Rate limiting kicks in fast. When you loop through pages rapidly, Amazon's systems flag your IP address within seconds, returning empty pages or CAPTCHAs instead of actual review data.
The "next button disabled" check is unreliable. DOM elements change, and that specific class name (a-disabled a-last) might not appear consistently across different product pages or regions.
Here's what you need to modify in your existing code:
Add random delays between requests. This mimics human browsing behavior and helps avoid immediate rate limits:
python
import time
import random
for x in range(1, 10):
soup = get_soup(f'https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
get_reviews(soup)
time.sleep(random.uniform(2, 5)) # Wait 2-5 seconds between pages
if not soup.find('li', {'class': "a-disabled a-last"}):
pass
else:
break
Improve your headers. Add more realistic browser fingerprints:
python
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Referer': 'https://www.amazon.com/'
}
Check if reviews actually loaded. Before processing, verify you got real content:
python
def get_reviews(soup):
reviews = soup.findAll('div', {'data-hook': 'review'})
if not reviews:
print("No reviews found - possible blocking")
return False
# ... rest of your extraction code
return True
But here's the truth: even with these improvements, you'll still run into blocks eventually. Amazon actively works against scrapers, and they're good at it.
When you're dealing with a target as sophisticated as Amazon, residential proxies and smart request routing become essential. You need systems that automatically rotate IPs, handle CAPTCHAs, and adjust request patterns in real-time.
For projects that depend on reliable data collection at scale, setting up that infrastructure yourself means weeks of debugging and constant maintenance. Many developers working on review analysis, price monitoring, or market research projects save significant development time by using established scraping infrastructure that already handles Amazon's defenses.
If you're pulling reviews for semantic analysis or building a production tool, consistency matters more than scraping a few pages manually. The difference between a hobby project and something scalable often comes down to whether your scraper works reliably every single time.
If you want to stick with your current approach for learning purposes, here's a more robust pagination strategy:
Look for the actual "Next" button URL instead of manually constructing page numbers:
python
def get_next_page_url(soup):
next_button = soup.find('li', {'class': 'a-last'})
if next_button and not next_button.find('span', {'class': 'a-disabled'}):
next_link = next_button.find('a')
if next_link and next_link.get('href'):
return 'https://www.amazon.com' + next_link.get('href')
return None
url = 'https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/product-reviews/B098LG3N6R/'
while url:
soup = get_soup(url)
success = get_reviews(soup)
if not success:
break
url = get_next_page_url(soup)
time.sleep(random.uniform(3, 6))
This approach follows Amazon's actual pagination links rather than guessing URL patterns.
You mentioned you haven't used Selenium. While it can render JavaScript and look more like a real browser, it's actually slower and easier to detect than well-configured HTTP requests with Beautiful Soup.
Selenium is overkill for Amazon reviews since the content loads server-side. You'd add complexity, slower execution times, and more detection surface area without solving the core problem: Amazon knows you're a bot.
Before running your scraper across hundreds of pages, test it properly:
Print the soup object to verify you're getting actual HTML, not a CAPTCHA page
Check response status codes - 503 or 403 means you're blocked
Monitor the length of reviewList - if it stops growing suddenly, you hit a block
Test with different products - some pages have different HTML structures
Add this debugging to your loop:
python
for x in range(1, 10):
soup = get_soup(url)
print(f"Page {x}: Found {len(soup.findAll('div', {'data-hook': 'review'}))} reviews")
if len(soup.text) < 10000: # Suspiciously short response
print("Possible block detected")
break
Scraping Amazon reviews across multiple pages requires more than just looping through URLs. You need realistic headers, random delays, proper pagination handling, and ideally, rotating proxies to avoid detection.
For a learning project where you're scraping a handful of products, the improvements above will help you get further. For production use cases where you need thousands of reviews reliably, you'll want infrastructure designed specifically for large-scale e-commerce scraping that handles blocks automatically.
Your semantic analysis project deserves clean, consistent data. Whether you build it yourself or use battle-tested infrastructure, make sure your scraping foundation is solid before investing time in the analysis layer. The best NLP models in the world can't fix gaps in your training data caused by inconsistent scraping.