How To Scrape HTML Tables Using Python: A Friendly Guide

Tabular data scattered across the web is like finding money on the sidewalk—except this money is organized, structured, and ready to fuel your next data project. Whether you're tracking football stats or analyzing stock trends, Python makes extracting this goldmine surprisingly straightforward.

And hey, stick around till the end—we've got something useful waiting for you.

Understanding HTML Table's Structure

HTML tables are basically spreadsheets living on web pages. Visually, they're neat rows and columns. Under the hood? A hierarchy of tags doing the heavy lifting.

Here's the skeleton you'll encounter:

<table>: The wrapper that says "hey, table starts here"
<th> or <thead>: Header row (the column titles)
<tbody>: Where the actual data lives
<tr>: Each individual row
<td>: Individual cells within rows

Now, here's the kicker: not every developer follows these conventions religiously. Some tables are messier than others, which means you'll occasionally need to improvise. But understanding the basics? Non-negotiable.

Let's peek at a real example. We're using this table from datatables.net (https://datatables.net/examples/styling/stripe.html). Pop open the browser inspector and you'll see clean <table> tags with everything nicely tucked inside a <tbody> section—exactly 10 rows matching what's displayed.

The table has 57 total entries. You could click through pagination buttons or fiddle with dropdown menus to see more rows, but that adds complexity. Instead, let's check if all the data already exists in the HTML source. Right-click, "View Page Source," and search for a few cell values from different pages.

Bingo. Everything's already there, just hidden by the front-end display logic. This makes our job infinitely easier.

Scraping HTML Tables Using Python's Beautiful Soup

Since all our target data lives in the HTML file, we can use the Requests library to fetch it and Beautiful Soup to parse it. No browser automation needed—just straightforward HTTP requests.

Note: New to web scraping? We've got a beginner-friendly Python web scraping tutorial you might want to check out first. But honestly, you can follow along either way.

1. Sending Our Main Request

Create a project directory called python-html-table, add a subfolder bs4-table-scraper, and create python_table_scraper.py inside.

From your terminal, install the libraries:

bash
pip3 install requests beautifulsoup4

Import them:

python
import requests
from bs4 import BeautifulSoup

Now send a simple HTTP request:

python
url = 'https://datatables.net/examples/styling/stripe.html'
response = requests.get(url)
print(response.status_code)

A 200 status code means success. Anything else? Your IP might be getting blocked by anti-scraping defenses. You could try adding custom headers to appear more human-like, but that's not always enough.

2. Integrating ScraperAPI to Avoid Anti-Scraping Systems

Here's where things get smoother. ScraperAPI handles the messy stuff—rotating IPs, managing headers, solving CAPTCHAs—so you don't have to. It uses machine learning to figure out the best way to access your target data.

When you're dealing with sites that have strict anti-bot measures, having a reliable tool like ScraperAPI in your corner means fewer headaches and more consistent results. No more getting blocked mid-scrape or spending hours tweaking headers.

👉 Get started with ScraperAPI and skip the anti-scraping headaches

python
import requests
from bs4 import BeautifulSoup

url = 'http://api.scraperapi.com?api_key=YOUR_API_KEY&url=https://datatables.net/examples/styling/stripe.html'

response = requests.get(url)
print(response.status_code)

Boom. Clean 200 response, zero drama.

3. Building the Parser Using Beautiful Soup

Turn that raw HTML into something usable:

python
soup = BeautifulSoup(response.text, 'html.parser')

Now we can navigate the parse tree using tags and attributes. The table we want has a class of stripe:

python
table = soup.find('table', class_='stripe')
print(table)

Pro tip: During testing, adding the second class (dataTable) didn't work. The returned element only had stripe as its class. You could also use id='example'.

4. Looping Through the HTML Table

Every row is a <tr> element containing <td> cells, all wrapped in <tbody>. Let's extract them:

python
for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')
print(rows)

Now loop through individual rows to grab specific data. Each cell's position in the index tells us what it contains:

python
for row in rows:
name = row.find_all('td')[0].text
position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text
print(name)

There's your employee names, printed clean and tidy. Same logic applies to the rest of the cells—just adjust the index position.

5. Storing Tabular Data Into a JSON File

Printing to console is cute, but let's store this data properly. Python's built-in JSON module makes this trivial—no installation needed.

Create an empty list outside your loop:

python
employee_list = []

Append each row's data as a dictionary:

python
employee_list.append({
'Name': name,
'Position': position,
'Office': office,
'Age': age,
'Start date': start_date,
'salary': salary
})

Verify it worked:

python
print(employee_list)
print(len(employee_list)) # Should return 57

Now dump it into a JSON file:

python
import json

with open('employee_data.json', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)

The indent=2 parameter keeps everything readable instead of cramming it into one endless line.

6. Running the Script and Full Code

Here's everything together:

python

dependencies

import requests
from bs4 import BeautifulSoup
import json

url = 'http://api.scraperapi.com?api_key=YOUR_API_KEY&url=https://datatables.net/examples/styling/stripe.html'

empty array

employee_list = []

requesting and parsing the HTML file

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

selecting the table

table = soup.find('table', class_='stripe')

storing all rows into one variable

for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')

looping through the HTML table to scrape the data

for row in rows:
name = row.find_all('td')[0].text
position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text

# sending scraped data to the empty array

employee_list.append({

'Name': name,

'Position': position,

'Office': office,

'Age': age,

'Start date': start_date,

'salary': salary

})

importing the array to a JSON file

with open('employee_data.json', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)

Run it with python3 python_table_scraper.py and watch your JSON file populate with clean, structured data.

Scrape HTML Tables with Complex Headers

Sometimes tables get fancy with nested headers, rowspans, or colspans. When that happens, you need to level up your parsing logic.

Check out this example table from datatables.net (https://datatables.net/examples/basic_init/complex_header.html). It has a two-tiered header: broader categories like "Name," "Position," and "Contact" on top, with subcategories underneath.

Setting Up the Scraping Environment

Import your libraries and set up ScraperAPI:

python
import requests
from bs4 import BeautifulSoup
import pandas as pd

api_key = 'YOUR_API_KEY'
url = 'https://datatables.net/examples/basic_init/complex_header.html'

Creating the Scraping Function

Build a function that handles the entire process:

python
def scrape_complex_table(url):
payload = {'api_key': api_key, 'url': url}
response = requests.get('https://api.scraperapi.com', params=payload)
soup = BeautifulSoup(response.text, 'html.parser')

Locating the Target Table

Find the table using its id:

python
table = soup.find('table', id='example')

Extracting and Combining Headers

Pull both header levels and combine them:

python
headers_level1 = [th.text.strip() for th in table.select('thead tr:nth-of-type(1) th')]
headers_level2 = [th.text.strip() for th in table.select('thead tr:nth-of-type(2) th')]

combined_headers = []
for i, header in enumerate(headers_level1):
if header == 'Name':
combined_headers.append(header)
elif header == 'Position':
combined_headers.extend([f"{header} - {col}" for col in ['Title', 'Salary']])
elif header == 'Contact':
combined_headers.extend([f"{header} - {col}" for col in ['Office', 'Extn.', 'Email']])

Extracting Data from the Table Body

Loop through rows and extract cell data:

python
rows = []
for row in table.select('tbody tr'):
cells = [cell.text.strip() for cell in row.find_all('td')]
rows.append(cells)

Creating the DataFrame

Build a Pandas DataFrame with your extracted data:

python
df = pd.DataFrame(rows, columns=combined_headers)
return df

Running the Scraper and Saving Data

Execute the function and save results:

python
result_df = scrape_complex_table(url)
print(result_df.head())

result_df.to_csv('complex_table_data.csv', index=False)
print("Data saved to 'complex_table_data.csv'")

Scraping Paginated HTML Tables with Python

Large datasets often get split across multiple pages. Traditionally, you'd fire up Selenium and deal with browser automation. But ScraperAPI's Render Instruction Set offers a cleaner approach.

Understanding Pagination Handling

Our example table has ">" and "<" buttons for navigation. To scrape everything, we need to:

Load the initial page
Click the ">" button
Wait for new data to load
Repeat until we've grabbed all pages

Using ScraperAPI's Render Instructions Set

Instead of manually controlling a browser, send instructions via API:

python
api_key = 'YOUR_API_KEY'
target_url = 'https://datatables.net/examples/styling/stripe.html'

config = [{
"type": "loop",
"for": 5,
"instructions": [
{
"type": "click",
"selector": {
"type": "css",
"value": "button.dt-paging-button.next"
}
},
{
"type": "wait",
"value": 3
}
]
}]

The loop instruction repeats the click action five times, with a 3-second wait after each click.

Making the Request to ScraperAPI

Convert your config to JSON and include it in the headers:

python
import json

config_json = json.dumps(config)

headers = {
'x-sapi-api_key': api_key,
'x-sapi-render': 'true',
'x-sapi-instruction_set': config_json
}

payload = {'url': target_url}
response = requests.get('https://api.scraperapi.com', headers=headers, params=payload)

Processing the Table Data

Parse and extract as usual:

python
soup = BeautifulSoup(response.text, 'html.parser')
employee_list = []

table = soup.find('table', class_='stripe')

for employee_data in table.find_all('tbody'):
rows = employee_data.find_all('tr')

for row in rows:

cells = row.find_all('td')

employee_list.append({

'Name': cells[0].text,

'Position': cells[1].text,

'Office': cells[2].text,

'Age': cells[3].text,

'Start date': cells[4].text,

'salary': cells[5].text

})

with open('employee_data.json', 'w') as json_file:
json.dump(employee_list, json_file, indent=2)

Benefits of Using Render Instructions

No Selenium installation or WebDriver management
Simpler code with fewer dependencies
Better anti-bot handling through ScraperAPI
More reliable execution with built-in waits
Easy server deployment without browser dependencies

Dealing with Errors while Scraping HTML Tables

Real-world tables love throwing curveballs: empty cells, nested elements, merged cells, malformed HTML. Here's how to handle them.

1. Handling Empty Cells and Missing Data

Empty cells can crash your script. Handle them gracefully:

python
def extract_cell_data(cell):
if not cell:
return "N/A"

if cell.text.strip() == "":

return "N/A"

return cell.text.strip()

2. JavaScript-Injected Tables

Some tables load dynamically via JavaScript. ScraperAPI's Render Instruction Set handles this without needing Selenium.

3. Malformed HTML Tables

For wonky HTML structures, use the html5lib parser:

python
soup = BeautifulSoup(response.text, 'html5lib')

4. Using Pandas over BeautifulSoup

Pandas' read_html method is faster and more reliable for table extraction since it's purpose-built for this task.

Scraping HTML Tables Using Pandas

Here's the quick-and-dirty approach: Pandas can scrape all tables from a page in just a few lines.

Create a new folder, install Pandas:

bash
pip install pandas

Import it:

python
import pandas as pd

Use read_html() to scrape the URL:

python
employee_data = pd.read_html('http://api.scraperapi.com?api_key=YOUR_API_KEY&url=https://datatables.net/examples/styling/stripe.html')

This returns a list of DataFrame objects—one for each table on the page.

Convert to JSON:

python
employee_data[0].to_json('./employee_list.json', orient='index', indent=2)

Full code:

python
import pandas as pd

employee_data = pd.read_html('http://api.scraperapi.com?api_key=YOUR_API_KEY&url=https://datatables.net/examples/styling/stripe.html')

employee_data[0].to_json('./employee_list.json', orient='index', indent=2)

Simple, clean, effective.

Wrapping Up

You're now equipped to scrape virtually any HTML table on the web. The key is understanding the structure and logic behind the page. Once you've got that down, extraction becomes straightforward.

One caveat: these methods work when data lives in the HTML file. If you encounter dynamically generated tables (loaded via JavaScript after the initial page load), you'll need different tactics. For those scenarios, check out our guide on scraping JavaScript tables with Python.

For complex scraping projects—especially when dealing with aggressive anti-bot systems, dynamic content, or large-scale data extraction—tools like ScraperAPI handle the technical complexity so you can focus on extracting insights instead of debugging connection issues.

👉 Try ScraperAPI for hassle-free web scraping across any site

Page updated

Google Sites

Report abuse