Alpana A. Borse - 2.2.1

Welcome to Foundation of Data Science Laboratory

Program 2:

Step 1: Install Required Libraries

pip install requests beautifulsoup4 pandas

Step 2: Write the Python Code import requests

from bs4 import BeautifulSoup

import pandas as pd

import requests

# Step 1: Identify the URL of the website with tabular data

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

# Step 2: Send a request to fetch the webpage content

response = requests.get(url)

# Step 3: Parse the webpage content using BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Step 4: Locate the table you want to scrape (use inspect tool to find the right table)

table = soup.find('table', {'class': 'wikitable'})

# Step 5: Extract table headers

headers = []

for header in table.find_all('th'):

headers.append(header.text.strip())

# Step 6: Extract rows of the table

rows = []

for row in table.find_all('tr')[1:]: # Skip the header row

cells = row.find_all(['td', 'th'])

cells = [cell.text.strip() for cell in cells]

rows.append(cells)

# Step 7: Create a DataFrame using the scraped data

df = pd.DataFrame(rows, columns=headers)

# Step 8: Display the first few rows of the DataFrame

df.head()

Country (or dependent territory) Population ... % of world population

0 China[note 2] 1,412,600,000 ... 17.76%

1 India[note 3] 1,366,600,000 ... 17.56%

2 United States[note 4] 331,449,281 ... 4.22%

3 Indonesia[note 5] 276,361,783 ... 3.51%

4 Pakistan[note 6] 225,199,937 ... 2.83%

Requests to Fetch Webpage Content:
- The requests library is used to send a GET request to the URL of the Wikipedia page containing the table.
Parsing HTML with BeautifulSoup:
- The response content is parsed using BeautifulSoup with the 'html.parser' option.
Locating and Extracting the Table:
- The table is located by its class attribute ('wikitable').
- Headers (<th>) and rows (<tr>) are extracted and cleaned up using list comprehensions.
Storing in a DataFrame:
- The data is stored in a Pandas DataFrame with the headers as column names.
Displaying the Data:
- The first few rows of the DataFrame are displayed using df.head().

This approach will work for most simple tables on Wikipedia and other websites with similar structures.