Web scraping sounds technical, but think of it as teaching your computer to read websites the way you do—only faster and without the coffee breaks. Whether you're pulling market data, tracking prices, or building datasets, Python gives you the toolkit to automate what would otherwise take hours of manual copying and pasting.
This guide walks through six core techniques that cover everything from simple data extraction to handling those annoying CAPTCHAs. No fluff, just practical methods you can start using today.
Before diving into complex HTML parsing, always check if a website loads data through APIs. It's like finding the service entrance instead of breaking in through the front door—cleaner, faster, and less likely to break when the website redesigns.
Take a site like quotes.toscrape.com/scroll. When you scroll, new quotes appear magically. That "magic" is usually an API call your browser makes in the background.
Finding the API endpoint:
Open Chrome DevTools (Ctrl + Shift + I), switch to the Network tab, and select "Fetch/XHR". Reload or scroll the page. You'll see requests pop up—these are your APIs. Click one, check the Response tab, and if it shows clean JSON data, congratulations, you've found the easy route.
For the quotes site, scrolling reveals https://quotes.toscrape.com/api/quotes?page=1. The page=1 parameter is your ticket to pagination. Change it to page=2, page=3, and so on.
If you need a robust solution for handling complex scraping scenarios without worrying about infrastructure, 👉 check out how Scrapingdog simplifies API-based data extraction with built-in proxy rotation and CAPTCHA handling.
Here's a Python snippet that loops through pages:
python
import requests
url = "https://quotes.toscrape.com/api/quotes"
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'x-requested-with': 'XMLHttpRequest'
}
for page_number in range(1, 11):
try:
response = requests.get(url, params={"page": page_number}, headers=headers)
print(response.json())
except Exception as error:
print(f"Error on page {page_number}: {error}")
Hidden APIs are everywhere. Some sites only trigger API calls when you scroll or click. Keep DevTools open while interacting with the page—you might discover endpoints that aren't obvious from the start.
Not every website uses fancy JavaScript. Static sites deliver all their content in the initial HTML response. Before you reach for Selenium, verify you actually need it.
Quick test: Open DevTools (Ctrl + Shift + I), hit Ctrl + Shift + P, type "Disable JavaScript", and reload. If the content still shows up, you're dealing with a static site. This means requests will work fine—no need for browser automation.
For static sites like quotes.toscrape.com, you have two parsing options: XPath with lxml or CSS selectors with BeautifulSoup. Pick whichever selector language feels more natural.
Using lxml with XPath:
python
import requests
from lxml import html
session = requests.Session()
response = session.get("https://quotes.toscrape.com/")
tree = html.fromstring(response.content)
quotes = tree.xpath("//div[@class='quote']/span[@class='text']/text()")
print(quotes)
The XPath //div[@class='quote']/span[@class='text']/text() translates to: "Find any div with class 'quote', then grab the text from the span with class 'text' inside it."
Using BeautifulSoup with CSS selectors:
python
import requests
from bs4 import BeautifulSoup
session = requests.Session()
response = session.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
quotes = [quote.get_text() for quote in soup.select("div.quote > span.text")]
print(quotes)
Both approaches work equally well. Choose based on your comfort level with XPath versus CSS selectors.
When you disable JavaScript and the page goes blank, you're looking at a dynamic site. These load content after the initial page render, which means requests alone won't cut it—it can't execute JavaScript.
Sites like quotes.toscrape.com/js fall into this category. Try the JavaScript disable test, and you'll see empty quote containers. This is where Selenium comes in, automating a real browser that can render JavaScript.
When dealing with JavaScript-heavy sites that resist simple scraping methods, 👉 Scrapingdog's headless browser support handles dynamic content automatically, saving you from managing Selenium infrastructure.
Basic Selenium script:
python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get("https://quotes.toscrape.com/js/")
quotes = driver.find_elements(By.XPATH, "//div[@class='quote']/span[@class='text']")
quote_texts = [quote.text for quote in quotes]
print(quote_texts)
driver.quit()
This opens a Chrome browser, loads the page, waits for JavaScript to execute, then extracts the data. For production scraping where you don't need to see the browser, add headless mode:
python
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
service=ChromeService(ChromeDriverManager().install()),
options=options
)
Headless mode runs faster and uses less memory—perfect for large-scale scraping jobs.
Many valuable datasets hide behind login forms. The key is finding the authentication endpoint and replicating what your browser does when you log in.
Step-by-step approach:
Open the login page with DevTools Network tab active
Set the filter to "All"
Submit the login form
Look for POST requests to endpoints like /login or /sign_in
Check the Payload tab to see what data gets sent
For a site like web-scraping.dev/login, you'll spot the authentication endpoint and see form data containing your username and password. Right-click the request, select "Copy as cURL", and paste it into Postman to generate Python code.
Login script example:
python
import requests
url = "https://web-scraping.dev/api/login"
payload = 'username=user123&password=password'
headers = {
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
session = requests.Session()
response = session.post(url, headers=headers, data=payload)
protected_page = session.get("https://web-scraping.dev/protected-page")
print(protected_page.text)
The Session() object is crucial here—it preserves cookies between requests, keeping you logged in. Some sites add CSRF tokens for extra security. Extract these from the login page HTML before submitting credentials.
CAPTCHAs exist to stop bots, which makes scraping them... ironic. Manual solving doesn't scale, so enter services like 2Captcha that solve them programmatically.
Identifying Google reCAPTCHA v2:
Look for an iframe with a URL containing google.com/recaptcha/api2. Open it in a new tab and grab the k parameter from the URL—that's your site key.
Solving reCAPTCHA v2:
python
from twocaptcha import TwoCaptcha
import requests
api_key = "YOUR_2CAPTCHA_API_KEY"
sitekey = "SITE_RECAPTCHA_KEY"
url = "https://example.com/login"
solver = TwoCaptcha(api_key)
result = solver.recaptcha(sitekey=sitekey, url=url)
recaptcha_response = result.get('code')
payload = {
"username": "your_username",
"password": "your_password",
"g-recaptcha-response": recaptcha_response
}
session = requests.Session()
response = session.post(url, data=payload)
print(response.status_code)
The 2Captcha service returns a token that proves you "solved" the CAPTCHA. Include this in your form submission, and the site accepts it as if you clicked the checkbox yourself.
Scrape too aggressively from one IP address, and you'll get blocked. The solution: rotate through multiple IPs using proxies. This spreads your requests across different addresses, making your scraping look like traffic from many different users.
Why rotate IPs:
Privacy: Hard to trace all requests back to you
Avoid blocks: Distribute requests to stay under per-IP limits
Beat rate limiting: APIs often restrict requests per IP per minute
Rotating proxies with BrightData:
python
import requests
import random
proxy_hostname = "YOUR_PROXY_HOSTNAME"
proxy_username = "YOUR_PROXY_USERNAME"
proxy_password = "YOUR_PROXY_PASSWORD"
for i in range(1, 11):
rand_num = random.randint(1, 10000)
proxy = f"{proxy_username}-session-rand{rand_num}:{proxy_password}@{proxy_hostname}"
proxy_server = {
"http": f"https://{proxy}",
"https": f"https://{proxy}"
}
response = requests.get("https://api.ipify.org?format=json", proxies=proxy_server)
print(f"Request {i} from IP:", response.json())
The -session-rand{rand_num} parameter tells the proxy service to assign a new IP for each request. Each iteration uses a different IP address, keeping you under the radar.
Start with the simplest approach that works. Check for APIs first—they're faster and more reliable than parsing HTML. If no API exists, test whether the site is static or dynamic. Static sites need only requests and lxml/BeautifulSoup. Dynamic sites require Selenium or similar browser automation.
Only add complexity when necessary. Proxies, CAPTCHA solvers, and headless browsers all slow things down and add points of failure. But when you need them, they're essential tools that turn impossible scraping tasks into routine automation.
The techniques here cover most scraping scenarios you'll encounter. Master these fundamentals, and you'll be extracting data from websites that once seemed impenetrable.