Web scraping Amazon can feel like navigating a minefield. The site's sophisticated anti-bot measures make it tough to extract product data reliably. But here's the good news: with the right approach and tools, you can build a robust Amazon scraper that actually works.
In this guide, we'll create two Python scripts that work together to scrape Amazon product listings efficiently. The first script fetches product URLs from category pages, while the second extracts detailed information from individual listings.
Amazon doesn't exactly roll out the welcome mat for scrapers. Their systems actively detect and block automated requests through various techniques like IP tracking, user agent verification, and behavioral analysis. You might get a few successful requests, then suddenly hit a wall of CAPTCHAs or connection timeouts.
This is where many DIY scraping projects stall out. Setting up proxy rotation, managing headers, and handling dynamic content rendering quickly becomes a full-time job.
We'll build this in two parts. The first script collects product listing URLs and saves them to a text file. The second script processes each URL to extract product details and stores everything in JSON format.
👉 Get reliable Amazon data extraction with automated proxy rotation and rendering
Here's the URL fetching script:
python
import requests
from bs4 import BeautifulSoup
if name == 'main':
headers = {
'authority': 'www.amazon.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8',
'referer': 'https://google.com',
'accept-language': 'en-US,en;q=0.9',
}
API_KEY = None
links_file = 'links.txt'
links = []
with open('API_KEY.txt', encoding='utf8') as f:
API_KEY = f.read()
URL_TO_SCRAPE = 'https://www.amazon.com/s?i=electronics&rh=n%3A172541%2Cp_n_feature_four_browse-bin%3A12097501011&lo=image'
payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE, 'render': 'false'}
r = requests.get('http://api.scraperapi.com', params=payload, timeout=60)
if r.status_code == 200:
text = r.text.strip()
soup = BeautifulSoup(text, 'lxml')
links_section = soup.select('h2 > .a-link-normal')
for link in links_section:
url = 'https://amazon.com' + link['href']
links.append(url)
if len(links) > 0:
with open(links_file, 'a+', encoding='utf8') as f:
f.write('\n'.join(links))
print('Links stored successfully.')
I'm targeting the electronics category here, but you can swap in any category URL you want. The key is using the h2 > .a-link-normal selector to grab only the product links we need, filtering out all the noise that Amazon's pages contain.
Once you have your URLs, the parsing script does the heavy lifting. It pulls title, price, availability, ASIN, and product features from each listing:
python
import requests
from bs4 import BeautifulSoup
def parse(url):
record = {}
headers = {
'authority': 'www.amazon.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8',
'accept-language': 'en-US,en;q=0.9',
}
payload = {'api_key': API_KEY, 'url': url, 'render': 'false'}
r = requests.get('http://api.scraperapi.com', params=payload, timeout=60)
if r.status_code == 200:
data = r.text.strip()
soup = BeautifulSoup(data, 'lxml')
title_section = soup.select('#productTitle')
price_section = soup.select('#priceblock_ourprice')
availability_section = soup.select('#availability')
features_section = soup.select('#feature-bullets')
asin_section = soup.find('link', {'rel': 'canonical'})
if title_section:
title = title_section[0].text.strip()
if price_section:
price = price_section[0].text.strip()
if availability_section:
availability = availability_section[0].text.strip()
if features_section:
features = features_section[0].text.strip()
if asin_section:
asin_url = asin_section['href']
asin_url_parts = asin_url.split('/')
asin = asin_url_parts[len(asin_url_parts) - 1]
record = {'title': title, 'price': price, 'availability': availability, 'asin': asin, 'features': features}
return record
The script uses CSS selectors to target specific page elements. Amazon's product pages are reasonably consistent in structure, which makes this approach reliable. The ASIN extraction is particularly useful since it's Amazon's unique product identifier that you can use for tracking and database management.
This basic framework opens up possibilities for more sophisticated projects. You could build a price monitoring system that tracks products over time and alerts you to deals. Or create an ASIN database for market research. 👉 Scale your Amazon data collection with enterprise-grade scraping infrastructure
The beauty of this approach is that you're not wrestling with proxy management, CAPTCHA solving, or browser automation complexity. The heavy lifting happens behind the scenes, letting you focus on what matters: extracting and analyzing the data you need.
Want to add review scraping? Just extend the parsing function with additional selectors. Need to track inventory changes? Run your scripts on a schedule and compare the results. The modular structure makes it easy to adapt for different use cases.
Individual scraping projects often start simple but hit roadblocks when scaling up. IP blocks become frequent, request patterns get flagged, and maintaining infrastructure becomes a headache you didn't sign up for.
The solution is handling anti-bot measures, proxy rotation, and request rendering automatically. This lets you scrape thousands of products without constant babysitting or infrastructure management. Your code stays clean and focused on business logic rather than scraping mechanics.
The data you collect gets stored in JSON format, making it easy to feed into databases, analytics tools, or whatever system you're building. Whether you're tracking competitor pricing, analyzing product trends, or building a comparison shopping tool, having reliable Amazon data extraction is the foundation.