Building Your Own Dataset with Web Scraping – A Practical Guide

Data powers decisions, but scattered information means nothing without structure. Whether you're training machine learning models or analyzing market trends, you need organized, accessible datasets. Here's how web scraping lets you build custom datasets tailored to your exact needs, giving you unique insights competitors can't access.

What Actually Is a Dataset?

A dataset is structured information organized for analysis. Think of it as a digital filing cabinet where related data points live together. Sometimes it's tabular data with columns and rows. Sometimes it's documents or files. The key is structure.

The beauty of datasets is efficiency. Instead of hunting through scattered information, everything's right there, ready to analyze individually or collectively. It's the difference between a pile of receipts and a spreadsheet tracking every expense.

Public vs. Private Datasets – Why It Matters

Public datasets are available to everyone. Platforms like Kaggle share thousands of datasets for research, model training, and analysis. Anyone can download them.

Private datasets? Those belong to specific organizations. Companies guard them closely because they contain competitive advantages.

But here's the thing: paid doesn't equal private. If multiple businesses buy the same dataset from the same provider, they're all working with identical information. The only differentiator becomes analyst skill and execution speed.

The real advantage comes from building your own datasets from alternative sources. When you scrape unique data that competitors don't have access to, you're operating with genuine market intelligence. That's when businesses discover insights no one else can replicate.

How Most People Build Datasets

Several standard approaches exist for gathering data:

Surveys and Questionnaires – Direct information from respondents filling out forms
Public Data Sources – Government and institutional datasets available for download
Transactional Records – Sales data revealing demand patterns and customer behavior
Data APIs – Programmatic access to structured databases

These work fine when the data exists in accessible formats. But what about product listings scattered across e-commerce sites? Social media conversations buried in forums? Real estate prices updating daily across hundreds of websites?

When data lives on the web without APIs or download options, web scraping becomes the only practical solution for building comprehensive datasets.

Building a Dataset with Web Scraping

Web scraping extracts publicly available data from websites programmatically. It unlocks millions of data points you can transform into structured datasets for applications, business intelligence, and research.

For this tutorial, we're using Python to scrape book data from books.toscrape.com and build a structured dataset containing names, prices, descriptions, stock availability, UPCs, and categories.

Tools we're using:

Python
Requests – HTTP requests
BeautifulSoup – HTML parsing
Pandas – DataFrame creation
ScraperAPI – Avoiding IP bans and bypassing anti-scraping protection

When you're dealing with large-scale data collection, managing IP rotation and anti-bot measures becomes critical. Modern websites deploy sophisticated detection systems that can block scrapers within seconds. This is where having reliable infrastructure matters—tools that handle the technical complexity while you focus on extracting value from the data. 👉 See how automated IP rotation and header management keeps your scrapers running smoothly

The Complete Code

Here's what we're building:

python
import requests
from bs4 import BeautifulSoup
import pandas as pd

scraperapi = 'https://api.scraperapi.com?api_key=YOUR_API_KEY&url='
all_books = []

for x in range(1, 5):
response = requests.get(scraperapi + f'https://books.toscrape.com/catalogue/page-{x}.html')
soup = BeautifulSoup(response.content, 'html.parser')
onpage_books = soup.find_all('li', class_='col-xs-6')
for books in onpage_books:
r = requests.get(scraperapi + 'https://books.toscrape.com/catalogue/' + books.find('a')['href'])
s = BeautifulSoup(r.content, 'html.parser')
all_books.append({
'Name': s.find('h1').text,
'Description': s.select('p')[3].text,
'Price': s.find('p', class_='price_color').text,
'Category': s.find('ul', class_='breadcrumb').select('li')[2].text.strip(),
'Availability': s.select_one('p.availability').text.strip(),
'UPC': s.find('table').find_all('tr')[0].find('td').text
})

df = pd.DataFrame(all_books)
df.to_csv('book_data.csv', index=False)

Before running this, create a free ScraperAPI account and add your API key to the scraperapi variable.

Planning the Scraping Strategy

First, understand the site structure. The books site uses pagination showing 20 books per page from a total of 1000 books. The URL pattern is predictable: https://books.toscrape.com/catalogue/page-2.html where the number increases with each page.

Individual book URLs aren't as predictable, but they're accessible from the paginated pages. Our strategy:

Navigate through paginated pages gathering all book URLs
Send requests to each book page extracting detailed information
Format everything into a structured DataFrame

Setting Up the Project

Create a new directory with a book_scraper.py file. Import dependencies at the top:

python
import requests
from bs4 import BeautifulSoup
import pandas as pd

Instead of sending requests directly from your machine (which gets blocked quickly), route them through ScraperAPI. This handles IP rotation and proper headers automatically, maintaining near 100% success rates.

Set up the ScraperAPI integration:

python
scraperapi = 'https://api.scraperapi.com?api_key=YOUR_API_KEY&url='

Now you can concatenate this with target URLs, keeping your code clean.

Testing Requests

Verify that changing the page number to 1 returns the homepage:

python
response = requests.get(scraperapi + 'https://books.toscrape.com/catalogue/page-1.html')
print(response.status_code)

A 200 status code confirms it works. ScraperAPI is handling the request successfully.

Navigating Pagination

Use Python's range() function with f-strings to iterate through pages:

python
for x in range(1, 6):
response = requests.get(scraperapi + f'https://books.toscrape.com/catalogue/page-{x}.html')
print(f'Request {x}. Status: {response.status_code}')

This loops through pages 1-5, confirming each request succeeds.

Parsing HTML Responses

BeautifulSoup transforms raw HTML into navigable parse trees:

python
soup = BeautifulSoup(response.content, 'html.parser')

Now you can target specific elements using HTML tags and CSS selectors. Each book appears in a card element with class col-xs-6:

python
onpage_books = soup.find_all('li', class_='col-xs-6')
print(len(onpage_books))

This returns 20 books per page, exactly what we expect.

Grabbing Book URLs

Each book card contains an <a> tag with the book URL in its href attribute:

python
for books in onpage_books:
print(books.find('a')['href'])

This returns relative URLs like a-light-in-the-attic_1000/index.html. Prepend the base URL to make complete requests:

python
r = requests.get(scraperapi + 'https://books.toscrape.com/catalogue/' + books.find('a')['href'])
s = BeautifulSoup(r.content, 'html.parser')

Extracting Book Details

Now parse individual book pages. Start with the title:

python
print(s.find('h1').text)

For the description, multiple <p> tags exist without unique classes. Use indexing to select the correct one:

python
s.select('p')[3].text

Other elements use specific classes or require table navigation:

Price: s.find('p', class_='price_color').text
Category: s.find('ul', class_='breadcrumb').select('li')[2].text.strip()
Availability: s.select_one('p.availability').text.strip()
UPC: s.find('table').find_all('tr')[0].find('td').text

Building the Dataset

Create an empty list at the start:

python
all_books = []

Append structured data for each book:

python
all_books.append({
'Name': s.find('h1').text,
'Description': s.select('p')[3].text,
'Price': s.find('p', class_='price_color').text,
'Category': s.find('ul', class_='breadcrumb').select('li')[2].text.strip(),
'Availability': s.select_one('p.availability').text.strip(),
'UPC': s.find('table').find_all('tr')[0].find('td').text
})

Creating the DataFrame

Transform the list into a Pandas DataFrame:

python
df = pd.DataFrame(all_books)
print(df)

Export to CSV for further analysis:

python
df.to_csv('book_data.csv', index=False)

Your structured dataset is ready. You can now analyze pricing patterns by category, correlate ratings with descriptions, identify out-of-stock items, or build recommendation systems.

What You Can Do Next

With this dataset, you could:

Find which categories command the highest prices
Analyze correlations between ratings and pricing
Run sentiment analysis on descriptions to identify patterns in highly-rated books
Automate inventory alerts for out-of-stock items
Build recommendation engines based on multiple variables

Wrapping Up

You've just built a complete web scraping system that transforms scattered web data into structured datasets. Change the range to (1, 51) and you'll extract data from all 1000 books in minutes.

But here's the reality: scraping at scale means sending thousands of requests. Without proper infrastructure handling IP rotation, headers, and anti-bot measures, you'd get blocked after a handful of attempts. The technical complexity of maintaining successful request rates across different sites quickly becomes overwhelming.

That's exactly why having robust scraping infrastructure matters. When you can rely on automatic IP management and intelligent request handling, you focus on extracting insights instead of debugging connection failures. 👉 Get the infrastructure that handles the complexity so you can focus on building valuable datasets

Imagine applying this same approach to Amazon's millions of product pages, or any other data-rich platform. The possibilities are endless when you control your own data collection pipeline.

Page updated

Google Sites

Report abuse