Tabular data scattered across the web is like finding money on the sidewalk—except this money is organized, structured, and ready to fuel your next data project. Whether you're tracking football stats or analyzing stock trends, Python makes extracting this goldmine surprisingly straightforward.
And hey, stick around till the end—we've got something useful waiting for you.
HTML tables are basically spreadsheets living on web pages. Visually, they're neat rows and columns. Under the hood? A hierarchy of tags doing the heavy lifting.
Here's the skeleton you'll encounter:
<table>: The wrapper that says "hey, table starts here"
<th> or <thead>: Header row (the column titles)
<tbody>: Where the actual data lives
<tr>: Each individual row
<td>: Individual cells within rows
Now, here's the kicker: not every developer follows these conventions religiously. Some tables are messier than others, which means you'll occasionally need to improvise. But understanding the basics? Non-negotiable.
Let's peek at a real example. We're using this table from datatables.net (https://datatables.net/examples/styling/stripe.html). Pop open the browser inspector and you'll see clean <table> tags with everything nicely tucked inside a <tbody> section—exactly 10 rows matching what's displayed.
The table has 57 total entries. You could click through pagination buttons or fiddle with dropdown menus to see more rows, but that adds complexity. Instead, let's check if all the data already exists in the HTML source. Right-click, "View Page Source," and search for a few cell values from different pages.
Bingo. Everything's already there, just hidden by the front-end display logic. This makes our job infinitely easier.
Since all our target data lives in the HTML file, we can use the Requests library to fetch it and Beautiful Soup to parse it. No browser automation needed—just straightforward HTTP requests.
Note: New to web scraping? We've got a beginner-friendly Python web scraping tutorial you might want to check out first. But honestly, you can follow along either way.
Create a project directory called python-html-table, add a subfolder bs4-table-scraper, and create python_table_scraper.py inside.
From your terminal, install the libraries:
bash
pip3 install requests beautifulsoup4
Import them:
python
import requests
from bs4 import BeautifulSoup
Now send a simple HTTP request:
python
url = 'https://datatables.net/examples/styling/stripe.html'
response = requests.get(url)
print(response.status_code)
A 200 status code means success. Anything else? Your IP might be getting blocked by anti-scraping defenses. You could try adding custom headers to appear more human-like, but that's not always enough.
Here's where things get smoother. ScraperAPI handles the messy stuff—rotating IPs, managing headers, solving CAPTCHAs—so you don't have to. It uses machine learning to figure out the best way to access your target data.
When you're dealing with sites that have strict anti-bot measures, having a reliable tool like ScraperAPI in your corner means fewer headaches and more consistent results. No more getting blocked mid-scrape or spending hours tweaking headers.
👉 Get started with ScraperAPI and skip the anti-scraping headaches
Sign up for a free account, grab your API key from the dashboard, and plug it into your request:
python
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
print(response.status_code)
Boom. Clean 200 response, zero drama.
Turn that raw HTML into something usable:
python
soup = BeautifulSoup(response.text, 'html.parser')
Now we can navigate the parse tree using tags and attributes. The table we want has a class of stripe:
python
table = soup.find('table', class_='stripe')
print(table)
Pro tip: During testing, adding the second class (dataTable) didn't work. The returned element only had stripe as its class. You could also use id='example'.
Every row is a <tr> element containing <td> cells, all wrapped in <tbody>. Let's extract them:
python
for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')
print(rows)
Now loop through individual rows to grab specific data. Each cell's position in the index tells us what it contains:
python
for row in rows:
name = row.find_all('td')[0].text
position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text
print(name)
There's your employee names, printed clean and tidy. Same logic applies to the rest of the cells—just adjust the index position.
Printing to console is cute, but let's store this data properly. Python's built-in JSON module makes this trivial—no installation needed.
Create an empty list outside your loop:
python
employee_list = []
Append each row's data as a dictionary:
python
employee_list.append({
'Name': name,
'Position': position,
'Office': office,
'Age': age,
'Start date': start_date,
'salary': salary
})
Verify it worked:
python
print(employee_list)
print(len(employee_list)) # Should return 57
Now dump it into a JSON file:
python
import json
with open('employee_data.json', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)
The indent=2 parameter keeps everything readable instead of cramming it into one endless line.
Here's everything together:
python
import requests
from bs4 import BeautifulSoup
import json
employee_list = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', class_='stripe')
for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')
for row in rows:
name = row.find_all('td')[0].text
position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text
# sending scraped data to the empty array
employee_list.append({
'Name': name,
'Position': position,
'Office': office,
'Age': age,
'Start date': start_date,
'salary': salary
})
with open('employee_data.json', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)
Run it with python3 python_table_scraper.py and watch your JSON file populate with clean, structured data.
Sometimes tables get fancy with nested headers, rowspans, or colspans. When that happens, you need to level up your parsing logic.
Check out this example table from datatables.net (https://datatables.net/examples/basic_init/complex_header.html). It has a two-tiered header: broader categories like "Name," "Position," and "Contact" on top, with subcategories underneath.
Import your libraries and set up ScraperAPI:
python
import requests
from bs4 import BeautifulSoup
import pandas as pd
api_key = 'YOUR_API_KEY'
url = 'https://datatables.net/examples/basic_init/complex_header.html'
Build a function that handles the entire process:
python
def scrape_complex_table(url):
payload = {'api_key': api_key, 'url': url}
response = requests.get('https://api.scraperapi.com', params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
Find the table using its id:
python
table = soup.find('table', id='example')
Pull both header levels and combine them:
python
headers_level1 = [th.text.strip() for th in table.select('thead tr:nth-of-type(1) th')]
headers_level2 = [th.text.strip() for th in table.select('thead tr:nth-of-type(2) th')]
combined_headers = []
for i, header in enumerate(headers_level1):
if header == 'Name':
combined_headers.append(header)
elif header == 'Position':
combined_headers.extend([f"{header} - {col}" for col in ['Title', 'Salary']])
elif header == 'Contact':
combined_headers.extend([f"{header} - {col}" for col in ['Office', 'Extn.', 'Email']])
Loop through rows and extract cell data:
python
rows = []
for row in table.select('tbody tr'):
cells = [cell.text.strip() for cell in row.find_all('td')]
rows.append(cells)
Build a Pandas DataFrame with your extracted data:
python
df = pd.DataFrame(rows, columns=combined_headers)
return df
Execute the function and save results:
python
result_df = scrape_complex_table(url)
print(result_df.head())
result_df.to_csv('complex_table_data.csv', index=False)
print("Data saved to 'complex_table_data.csv'")
Large datasets often get split across multiple pages. Traditionally, you'd fire up Selenium and deal with browser automation. But ScraperAPI's Render Instruction Set offers a cleaner approach.
Our example table has ">" and "<" buttons for navigation. To scrape everything, we need to:
Load the initial page
Click the ">" button
Wait for new data to load
Repeat until we've grabbed all pages
Instead of manually controlling a browser, send instructions via API:
python
api_key = 'YOUR_API_KEY'
target_url = 'https://datatables.net/examples/styling/stripe.html'
config = [{
"type": "loop",
"for": 5,
"instructions": [
{
"type": "click",
"selector": {
"type": "css",
"value": "button.dt-paging-button.next"
}
},
{
"type": "wait",
"value": 3
}
]
}]
The loop instruction repeats the click action five times, with a 3-second wait after each click.
Convert your config to JSON and include it in the headers:
python
import json
config_json = json.dumps(config)
headers = {
'x-sapi-api_key': api_key,
'x-sapi-render': 'true',
'x-sapi-instruction_set': config_json
}
payload = {'url': target_url}
response = requests.get('https://api.scraperapi.com', headers=headers, params=payload)
Parse and extract as usual:
python
soup = BeautifulSoup(response.text, 'html.parser')
employee_list = []
table = soup.find('table', class_='stripe')
for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')
for row in rows:
cells = row.find_all('td')
employee_list.append({
'Name': cells[0].text,
'Position': cells[1].text,
'Office': cells[2].text,
'Age': cells[3].text,
'Start date': cells[4].text,
'salary': cells[5].text
})
with open('employee_data.json', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)
No Selenium installation or WebDriver management
Simpler code with fewer dependencies
Better anti-bot handling through ScraperAPI
More reliable execution with built-in waits
Easy server deployment without browser dependencies
Real-world tables love throwing curveballs: empty cells, nested elements, merged cells, malformed HTML. Here's how to handle them.
Empty cells can crash your script. Handle them gracefully:
python
def extract_cell_data(cell):
if not cell:
return "N/A"
if cell.text.strip() == "":
return "N/A"
return cell.text.strip()
Some tables load dynamically via JavaScript. ScraperAPI's Render Instruction Set handles this without needing Selenium.
For wonky HTML structures, use the html5lib parser:
python
soup = BeautifulSoup(response.text, 'html5lib')
Pandas' read_html method is faster and more reliable for table extraction since it's purpose-built for this task.
Here's the quick-and-dirty approach: Pandas can scrape all tables from a page in just a few lines.
Create a new folder, install Pandas:
bash
pip install pandas
Import it:
python
import pandas as pd
Use read_html() to scrape the URL:
python
employee_data = pd.read_html('http://api.scraperapi.com?api_key=YOUR_API_KEY&url=https://datatables.net/examples/styling/stripe.html')
This returns a list of DataFrame objects—one for each table on the page.
Convert to JSON:
python
employee_data[0].to_json('./employee_list.json', orient='index', indent=2)
Full code:
python
import pandas as pd
employee_data = pd.read_html('http://api.scraperapi.com?api_key=YOUR_API_KEY&url=https://datatables.net/examples/styling/stripe.html')
employee_data[0].to_json('./employee_list.json', orient='index', indent=2)
Simple, clean, effective.
You're now equipped to scrape virtually any HTML table on the web. The key is understanding the structure and logic behind the page. Once you've got that down, extraction becomes straightforward.
One caveat: these methods work when data lives in the HTML file. If you encounter dynamically generated tables (loaded via JavaScript after the initial page load), you'll need different tactics. For those scenarios, check out our guide on scraping JavaScript tables with Python.
For complex scraping projects—especially when dealing with aggressive anti-bot systems, dynamic content, or large-scale data extraction—tools like ScraperAPI handle the technical complexity so you can focus on extracting insights instead of debugging connection issues.
👉 Try ScraperAPI for hassle-free web scraping across any site