How to Build a Web Scraping Bot with Selenium and Python

Selenium wasn't originally designed for web scraping. It's actually a testing tool for web applications. But here's the interesting part: it's become one of the go-to solutions when you need to scrape websites that rely heavily on JavaScript or require user interactions before revealing their data.

Think about those frustrating websites where you need to click buttons, fill out forms, or set date ranges before seeing the information you want. That's where Selenium shines. While tools like BeautifulSoup are great for static pages, they simply can't handle dynamic content or simulate user behavior the way Selenium can.

In this guide, we'll build a practical scraper that extracts historical currency exchange rate data from investing.com. We'll tackle real challenges like interacting with date pickers and handling multiple currencies in a single run.

Understanding the Target Website

Before writing any code, let's examine what we're working with. The investing.com currency page shows historical USD exchange rates in a table format. The site includes a date range selector that defaults to the last 20 days of data.

Here's what makes this interesting: the data loads dynamically based on the date range you select. You can't just grab the HTML and parse it. You need to interact with the date picker first, then wait for the table to reload with your requested timeframe.

The URL structure is straightforward. For USD to EUR data, you'll see /currencies/usd-eur-historical-data. Want to check other currencies? Just swap out the currency code in the URL.

Setting Up the Scraper

Let's start with the necessary imports. We're keeping this lean and focused on what actually matters:

python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
import pandas as pd

The function we'll build accepts a list of currency codes, a start date, an end date, and an optional parameter for CSV export. We'll process multiple currencies in one go and store each result in a list:

python
def get_currencies(currencies, start, end, export_csv=False):
frames = []

Making Selenium Interact with the Page

For each currency in our list, we'll construct the URL and initialize a Chrome driver. The headless option lets you decide whether to watch the scraping happen in real-time or run it silently in the background:

python
for currency in currencies:
my_url = f'https://br.investing.com/currencies/usd-{currency.lower()}-historical-data'
option = Options()
option.headless = False
driver = webdriver.Chrome(options=option)
driver.get(my_url)
driver.maximize_window()

Now comes the interesting part. We need to click the date button, clear the default dates, and input our custom range. This is where Selenium's real power shows up. When dealing with large-scale scraping operations across different websites, you'll want to ensure your requests appear natural and distributed. 👉 Get reliable residential proxies for seamless web scraping automation to avoid rate limiting and maintain consistent data collection.

We'll use WebDriverWait to ensure elements are clickable before interacting with them. This prevents errors from clicking elements that haven't fully loaded yet:

python
date_button = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH,
"/html/body/div[5]/section/div[8]/div[3]/div/div[2]/span")))
date_button.click()

Next, we'll clear the start date field and send our custom date:

python
start_bar = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH,
"/html/body/div[7]/div[1]/input[1]")))
start_bar.clear()
start_bar.send_keys(start)

We repeat this process for the end date, then click the Apply button and wait for the page to reload with our requested data:

python
end_bar = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH,
"/html/body/div[7]/div[1]/input[2]")))
end_bar.clear()
end_bar.send_keys(end)

apply_button = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH,
"/html/body/div[7]/div[5]/a")))
apply_button.click()
sleep(5)

Extracting and Processing the Data

Once the table reloads with our custom date range, we use pandas to grab all tables from the page source. Then we close the driver to free up resources:

python
dataframes = pd.read_html(driver.page_source)
driver.quit()
print(f'{currency} scraped.')

The page contains multiple tables, but we only want the one with historical exchange rates. We'll loop through the dataframes and identify it by its column names:

python
for dataframe in dataframes:
if dataframe.columns.tolist() == ['Date', 'Price', 'Open', 'High', 'Low', 'Change%']:
df = dataframe
break

frames.append(df)

If the user wants CSV exports, we'll handle that too:

python
if export_csv:
df.to_csv('currency.csv', index=False)
print(f'{currency}.csv exported.')

Handling the Inevitable Failures

Selenium can be temperamental. Network issues, slow page loads, or unexpected page changes can break your scraper. Instead of letting the entire operation fail, we'll wrap everything in a retry loop.

The strategy is simple: keep trying until successful, with a 30-second pause between attempts. This gives the website time to recover if it's experiencing temporary issues:

python
for currency in currencies:
while True:
try:
# All the scraping code goes here
break
except:
driver.quit()
print(f'Failed to scrape {currency}. Trying again in 30 seconds.')
sleep(30)
continue

This approach ensures you don't lose data from successfully scraped currencies even if one fails temporarily.

Taking Your Scraper Further

This scraper gives you a solid foundation, but there's room for enhancement. You could modify it to combine all currency data into a single DataFrame instead of keeping them separate. Or add an update function that checks your existing data and only fetches new records.

The same interaction techniques work for scraping stocks, commodities, indices, and futures data. The key is understanding how each page structures its interactive elements.

When scaling up to scrape multiple pages or running frequent updates, you'll need to be mindful of server load. Add appropriate delays between requests, and consider rotating your connection points. For production environments where reliability matters, 👉 implement proxy rotation to distribute requests and maintain uninterrupted access across different IP addresses.

What Makes This Approach Work

Selenium's ability to simulate real user behavior is what sets it apart. It doesn't just fetch HTML; it clicks buttons, fills forms, selects dropdown options, and waits for JavaScript to execute. For websites that gate their data behind interactions, this capability is essential.

The wait conditions we used throughout this code prevent the most common Selenium errors. By explicitly waiting for elements to become clickable, we account for varying page load times and ensure reliable execution across different network conditions.

The retry logic adds another layer of robustness. Real-world scraping rarely works perfectly on the first try. Network hiccups, rate limiting, and temporary server issues are all part of the game. Building in automatic retries means your scraper can handle these issues without manual intervention.

This combination of interaction capabilities, explicit waits, and error handling creates a scraper that's both powerful and reliable. Whether you're collecting financial data, monitoring prices, or aggregating information from interactive websites, these patterns will serve you well.

Page updated

Google Sites

Report abuse