WEB SCRAPPINg
Project Code: WSXX
Project Code: WSXX
Mohd Taiyab Khan
Background
The project involved web scraping data about colleges from various sources on the internet and storing it in an Excel sheet. The data included college names, campus images, and logo images, which were combined with previously scraped data. The scraping was done using Python and libraries such as Beautiful Soup, Requests, and Pandas. The data was first stored in a Pandas dataframe and then converted to CSV format for ease of use.
The project was commissioned by Gurucool, an educational technology company that offers services such as online courses, career counseling, and college admissions assistance. Gurucool was interested in building a comprehensive database of colleges that could be used by its students and clients to make informed decisions about their education and career paths. The scraped data was intended to be used to populate this database.
The project was challenging because the data had to be scraped from multiple sources on the internet, which varied in terms of structure and format. The data also had to be cleaned and processed to ensure that it was accurate and useful. Additionally, the project had to be completed within a tight deadline, as Gurucool was eager to launch its new database as soon as possible.
Resources
Python
Beautiful Soup
Excel
Request
Team Updates
Get the feature to extract form CEO
Figure out if we need to extract the whole data again or just a small part
Write a basic web scrapping code
IP Blocked by CollegeDunia (5 Times)
Error Analysis of Code
Error Solved
Extracted the Data
Cleaned the Data
Stored & Shared with The CEO
Plan of Action
Define the scope and objectives of the project:
Identify the target audience for the data.
Determine what types of colleges to include in the data.
Decide on the level of detail to be included in the data.
Gather data:
Identify websites with relevant data.
Use web scraping tools like Beautiful Soup and Requests to extract data from the websites.
Store the data in a structured format like CSV or Excel.
Clean and transform the data:
Remove duplicate entries.
Handle missing or incomplete data.
Normalize the data by standardizing column names and formats.
Merge and combine datasets as necessary.
Analyze and visualize the data:
Use statistical techniques to identify trends and patterns in the data.
Create charts and graphs to visualize the data and highlight insights.
Generate descriptive statistics to summarize the data.
Share the data:
Publish the data in a user-friendly format.
Make the data available for download in various formats.
Provide documentation and a data dictionary to help users understand the data.
Future Plan
Expand the data: Gather additional data on colleges such as admission rates, acceptance rates, student demographics, and more. This will provide a more comprehensive view of each institution and allow for more detailed analysis.
Build a recommendation engine: Based on the data collected, build a recommendation engine to help students find the right college for them based on their preferences and needs. This can include factors such as location, cost, size, and academic programs.
Analyze trends: Analyze the data collected over time to identify trends and patterns in college admissions, student demographics, and other factors. This analysis can provide valuable insights into the higher education landscape and inform policy decisions.
Improve data visualization: Create interactive data visualizations to help users explore the data and gain insights. This can include heat maps, scatter plots, and other visualization techniques.
Collaborate with schools and organizations: Collaborate with schools and organizations to gather more data and validate the accuracy of the data collected. This can also provide opportunities for partnerships and collaborations.