Welcome to Foundation of Data Science Laboratory
Welcome to Foundation of Data Science Laboratory
Program 2:
Step 1: Install Required Libraries
pip install requests beautifulsoup4 pandas
Step 2: Write the Python Code import requests
from bs4 import BeautifulSoup
import pandas as pd
import requests
# Step 1: Identify the URL of the website with tabular data
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
# Step 2: Send a request to fetch the webpage content
response = requests.get(url)
# Step 3: Parse the webpage content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Step 4: Locate the table you want to scrape (use inspect tool to find the right table)
table = soup.find('table', {'class': 'wikitable'})
# Step 5: Extract table headers
headers = []
for header in table.find_all('th'):
headers.append(header.text.strip())
# Step 6: Extract rows of the table
rows = []
for row in table.find_all('tr')[1:]: # Skip the header row
cells = row.find_all(['td', 'th'])
cells = [cell.text.strip() for cell in cells]
rows.append(cells)
# Step 7: Create a DataFrame using the scraped data
df = pd.DataFrame(rows, columns=headers)
# Step 8: Display the first few rows of the DataFrame
df.head()
Country (or dependent territory) Population ... % of world population
0 China[note 2] 1,412,600,000 ... 17.76%
1 India[note 3] 1,366,600,000 ... 17.56%
2 United States[note 4] 331,449,281 ... 4.22%
3 Indonesia[note 5] 276,361,783 ... 3.51%
4 Pakistan[note 6] 225,199,937 ... 2.83%
Requests to Fetch Webpage Content:
The requests library is used to send a GET request to the URL of the Wikipedia page containing the table.
Parsing HTML with BeautifulSoup:
The response content is parsed using BeautifulSoup with the 'html.parser' option.
Locating and Extracting the Table:
The table is located by its class attribute ('wikitable').
Headers (<th>) and rows (<tr>) are extracted and cleaned up using list comprehensions.
Storing in a DataFrame:
The data is stored in a Pandas DataFrame with the headers as column names.
Displaying the Data:
The first few rows of the DataFrame are displayed using df.head().
This approach will work for most simple tables on Wikipedia and other websites with similar structures.