ARIJIT GUPTA - Web Scraping (Flipkart)

Web Scraping on Flipkart Mobile Data

A method to get data from website and make ready this scraped data for any analysis

Data collection is the first task for each and every data analysis. Now in days, data are coming broadly into two ways. First, from the beginning data is an already intact format that is ready to load in any programming language. Second, data are present in the different websites, these data can be retrieved in systematic ways. These types of data may be needed by many firms for their analysis, it can be accessed from the targeted website.

The retrieval of the data from the website can be done in three ways. These ways are Text pattern matching, API interface, and DOM parsing.

DOM Parsing approach is based on web browsers, programs can retrieve the content of the website, from the content data can be retrieved by understanding the pattern and CSS selectors of the webpage.

Since I have not any strong technical knowledge in HTML or CSS and my primary intention to focus on data, so I have used a chrome extension to find the CSS selectors instead of understanding the markup language of the website, the extension can be found here.

I have selected the Flipkart website for my web scraping. To be particular, I have scraped the data of Mobile Phones from the Flipkart website in between a certain range of pages(30), needless to say, the number, the number of pages can be accordingly exteneded. I'm able to retrieve 11 variables and 720 observations for each variable. This analysis is done using rvest package of R programming.

The following diagram represents a webpage of Flipkart and the targeted part of the webpage from where data can be retrieved.

The red boxes are the targeted areas from this webpage. I use this information as much as possible to scrape the data. I have found 11 variables :

Brand Name
Model Name
Price
ROM Capacity
RAM Capacity
Display Size
Battery Size
Color Name
Star
Ratings
Reviews

Now come to methods which I use here.

#library

xml2, rvest, stringr, dplyr libraries are used to do the complete scraping.

#URL creation

Created a vector named "flipkarturl" length of 30, where I store all webpage addresses.

#data collection

I have collected data from these web pages by defining a function. Later I combine the data for all pages and for all variables into "dataraw" named dataframe. I also extract the data as character vector for cleaning purpose from the dataframe.

#data cleaning

There are no fixed methods for the data cleaning process. It can be done according to the situation. Here I have cleaned data such as extraction of variables name, removing currency sign from price vectors, etc. By all these methods data are cleaned and formatted into the same pattern.

#naming data columns

Here I have named the data column after converting the character vector to the desired format such as factor and numeric vector.

#data frame creation

Till now, what I have got, compiled those data into a dataframe named "Flipkart_Mobile_Data".

#data pre-processings

After some checkings and summarizing the data, some pre-processings were required such as replace null value, unit conversion etc.

#export the data

Finally, the fresh and ready-to-use data was exported to a CSV file and it can be used for further analysis.

Here is the final data in Excel File :

Web_Scraping_on_Flipkart_Mobile_Data_updated

This is how I have done my Web Scraping on Flipkart Mobile Data. For complete code please check Github.

Page updated

Google Sites

Report abuse