Webscraping using BFS

Web Scraping with Requests and BeautifulSoup

Objective:

The purpose of this lab is to learn how to perform web scraping and extract links from a website using Python. We will employ the Breadth-First Search (BFS) technique to traverse a website and collect internal links.

Prerequisites:

Basic understanding of Python.
Knowledge of web scraping concepts.
Familiarity with data structures such as queues.
Installed Python libraries: requests, BeautifulSoup4.

To install required libraries, run the following command:

pip install requests beautifulsoup4

Lab Tasks:

Web scraping using requests and BeautifulSoup.
Validating and extracting internal links.
Implementing BFS traversal to visit website pages systematically.
Writing extracted links to a file.
Handling exceptions and adding delays to avoid overwhelming the server.

1. Checking Validity of Links

Write a function to validate an address to be scrapped. Simplest check is to see if the link belongs to the website to be scrapped, i.e. priceoye.pk.

def is_valid_link(link):

"""Check if the link is valid and should be followed (e.g., avoid external links)."""

return link and link.startswith("https://priceoye.pk")

2. Extracting Links from a Webpage

Let us write a function that fetches a webpage, parses it using BeautifulSoup, and extracts all valid internal links.

We use requests to scan all links on a webpage. These links will be later used to scan those pages:

def get_links_from_page(url):

"""Extract all valid links from the given webpage."""

try:

response = requests.get(url)

if response.status_code != 200:

return []

The webpage received from the website includes HTML tags and other garbage that is not of our interest. So we discard all those items using html parser:

soup = BeautifulSoup(response.text, 'html.parser')

links = set()

The following loop scans through all the pages whose addresses were retrieved from previous code snippet:

for a_tag in soup.find_all('a', href=True):

href = a_tag['href']

if href.startswith('/'):

href = "https://priceoye.pk" + href

if is_valid_link(href):

links.add(href)

It is possible that some links do not exist and that can hang our program. So catch the exceptions:

return links

except Exception as e:

print(f"Error fetching {url}: {e}")

return []

3. BFS Traversal of the Website

def bfs_traverse_website(start_url, max_depth=3):

"""Perform a breadth-first search (BFS) on the website starting from start_url."""

queue = deque([(start_url, 0)])

visited = set()

with open("priceoyelinks.txt", "w") as f:

while queue:

current_url, depth = queue.popleft()

if current_url in visited or depth > max_depth:

continue

visited.add(current_url)

print(f"Visiting: {current_url} (Depth: {depth})")

f.write(f"{current_url}\n")

links = get_links_from_page(current_url)

for link in links:

if link not in visited:

queue.append((link, depth + 1))

time.sleep(1) # Pause to avoid server overload

The BFS algorithm uses a queue to systematically visit pages and extract links, ensuring all pages are explored up to the specified depth.

4. Running the Script

if __name__ == "__main__":

start_url = "https://priceoye.pk" # Replace with your starting URL

bfs_traverse_website(start_url, max_depth=3)

The script starts the BFS traversal from the given URL and extracts links up to a depth of 3.

Observations:

BFS traversal ensures that all reachable internal links are extracted systematically.
Adding a delay (time.sleep(1)) between requests prevents overwhelming the server.
Writing the extracted links to a file (priceoyelinks.txt) helps in further analysis.
Error handling ensures that the script continues running even if some pages fail to load.

Applications:

Web scraping for price comparison websites.
Crawling and indexing webpages for search engines.
Extracting data for research purposes.
Identifying new product pages or updates on e-commerce sites.

Challenges and Limitations:

Some websites block web scrapers using robots.txt or CAPTCHA.
The script only extracts links but does not fetch additional data like prices or descriptions.
The BFS traversal depth should be chosen wisely to balance between coverage and efficiency.

Extensions and Improvements:

Modify the script to extract additional data like product names and prices.
Implement multi-threading to speed up the crawling process.
Store extracted data in a structured format (e.g., CSV, JSON, database).
Respect robots.txt and avoid scraping restricted pages.

Conclusion:

This lab introduced BFS-based web scraping, demonstrating how to systematically traverse a website and extract internal links. The concepts and code presented here can be extended to build more sophisticated web crawlers for various applications.

Assignment Questions:

What are the advantages of BFS over DFS for web crawling?
How can you modify the script to extract product prices along with links?
What are the ethical considerations when performing web scraping?
How would you implement multi-threading to improve the performance of this scraper?
What challenges might arise when crawling a large e-commerce website?

Page updated

Google Sites

Report abuse