Using Python
Originally, with a friend, was trying to create a database & and some sort of search engine of my campus faculty, but needed the data first.
For that, I tried to write up a fully functioning scraping script using python requests & bs4.
The script put up a request to the target website and scrape required data going through the relevant html elements and dump the data into JSON format. That way, we get a clean JSON object for future use.
The starting part is always the most crucial. Once the base code is established: that code can be developed and make for efficient to work with.
At First we need to look at the source of the data: The website of course. You can visit it at: https://www.ruet.ac.bd/
Now, we need faculty data. Which is stored into individual department pages. We got our target data into those pages. Such as: https://www.cse.ruet.ac.bd/ (Department of CSE)
Now if we inspect the html elements for each faculty member shown, we can see where our desired data are stored.
As we can see, the data are stored inside in <tr> tag for each faculty member. If navigating through the tag, we can see that the column wise data are actually inside individual <td> tag. We want the Name(en), The href that is linked with the name, Designation, Department, phone & office contact.
Time to write some code.
I used the famous BeautifulSoup. It is a Python library for pulling data out of HTML and XML files. Read the full documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
In line 4, we request for a response from the target website url using requests module. If the status_code(source) returns response object <200> then everything is okay and we can get the text using dot notation and store it. That way we get the html tags in the source object.
line 6, parsing the source object with lxml parser. One can use 'html parser' as well. Now, begins the test cases. We need to calibrate our next lines untill we can reach to our target elements. Originally we know our data are stored in some <td> tags within <tr> tags. So we search for <tr> tags first. Now, using the find_all() method we get all the <tr> tags. If looped through all the <tr> tags, can see our required data starts from the 2nd <tr> and ends at the last <tr>. in line 9 & 10, we actually check this only. Then we dive in the trs variable in line 13. We need the <td> tags within. So, for every <tr> we need all <td>.
In line 13-22: we navigate through the a <tr> tag (trs[2] for this case, means the 2nd index or 3rd element of the variable trs). Our all target data is in text form except the href. Its really an attribute of the corresponding <td> tag. We can see how to get the value of the attribute from a tag (line 17) and get the text from a tag in those lines.
Now we focus to look at the outputs and clean them accordingly. The 16-22 lines prints some ugly formatted output. Often it is an empty string or is full of lots of white spaces. To address empty string, we used if else statement for all outputs and used strip() method to strip away the unnecessary white spaces. Voila! We extracted data from the target url! Printing those variables will show the data in the console.
This code can go so far yet. For instance, this code is hard coded for a particular department. If we need to get other department data, we need to re write the link. We can use a for loop to loop through for all 18 departments.
Again, this code looks ugly to work with in future. We can rewrite the code distributing it into multiple sections. Such as we can create class and methods to work with. So, let's get into it.
We used the BeautifulSoup class and created a soup object. keeping that in mind, let's divide the code into multiple sections:
Creating BeautifulSoup object using target url
Getting web data chunk
Extracting and processing target data from data chunk
Store data locally into JSON format
Run for all departments
We can create a class of our own inheriting the BeautifulSoup class so that we can tweak as much as we like. Also we can we can converts the codes to get web data portion into a function. That function can call other static functions that is used to process the chunk data. We can use another portion to loop through all the department, for each department call our own created class and parse the corresponding url link there, call the web_data getting method from there and store the returned data into JSON format! That should look cleaner right? For that we need to do these:
Create a loop for all departments
write code to generate corresponding url links
create our own class, inheriting BeautifulSoup
create get_web_data() method
parse our web data getting portion into the method
create static method for all data processing
call these static methods from within the get_web_data() method
lastly write code to store data into local device into JSON format
We can follow these steps and get to a cleaner, prettier version of our original code!
The awaited output?
Department JSON file
All department JSON files
Contact [ahnaftanjid19@gmail.com] to get more information on the project