David Gill

Predicting Sales For Multiple Walmart Stores

The goal of the project is to predict the sales for each triplet of the store, department and the date of its occurrence – in the test.csv file. Part of the challenge would be building a model to cater for markdowns on the given holidays, where ideal/complete historical data is absent.

Delivery 1 - Introduction: Overview on Data Sources and Methodology

Delivery 2 - Exploratory Data Analysis and Similar Research Projects

Delivery 3: EDA and Implementation of ML Algorithms

Delivery 1 - Introduction: Overview on Data Sources and Methodology

Delivery 1: Overview on Data Sources and Methodology .pptx

Historical sales data for 45 Walmart stores located in different regions, containing a number of departments has been downloaded from Kaggle.

In addition, Walmart runs several promotional markdown events throughout the year, which precede prominent holidays, the four largest being - the Super Bowl, Labor Day, Thanksgiving, and Christmas. Holiday weeks are weighted five times higher in the evaluation than non-holiday weeks. This information has also been provided.

GitHub Repository

METHODOLOGY:

To fulfil the goal the methodology I have planned to adopt is to use -

1) Csv files for storage on the local machine

2) R/Python to perform data cleansing and exploratory data analysis

3) R/Python to build a Model for training and testing purposes

4) R/Python to predict the sales

5) PowerBI/Tableau to present the data in meaningful visuals

CONSEQUENCES/USES:

Walmart is one of the oldest shopping marts in the US and now has international branches. It provides employment and services to a huge population. Being able to predict sales for the stores would be highly beneficial and would have the following consequences –

1) Scrupulous management of revenue.

2) Budget forecasting would be more accurate.

3) Stock take/ inventory control management would become more efficient, knowing the commodity being sold in which stores.

4) Growth of the stores could be planned in and effective manner.

5) Intelligent business decisions could be made ensuring profit for the stores.

Delivery 2 - Exploratory Data Analysis and Similar Research Projects

Delivery 2: EDA and Similar Research Projects .pptx

Exploratory Data Analysis Results

METHODOLOGY:

1) Csv files were saved on my local machine

2) Python was used to perform data cleansing and exploratory data analysis

EDA RESULTS:

1) Stores file contains 45 Rows, which means there are 45 stores

2) It contains 3 Columns/Attributes - Store, Type, Size

3) There are 3 distinct Store Types - A, B, C, with A being the largest and C the smallest

4) There is no overlapped area in the sizes of the stores

5) There is no missing data in this file

6) Features file contains 8190 rows and 12 attributes, showing the features/factors affecting sales

7) Train data contains 421570 records and test data contains 115064 records

8) Sales on holidays was found to be a little bit more compared to non-holidays

9) The Department with highest sales lies b/w 60 – 80

SIMILAR RESEARCHES/ PROJECTS:

1)Electronic Devices Sales Prediction Using Social Media Sentiment Analysis –

http://cs229.stanford.edu/proj2012/ZarghamNassirpourNasiri-ElectronicDevicesSalesPredictionUsingSocialMediaSentimentAnalysis.pdf

- In this project they predicted the sales of electronic devices based on the sentiment of the comments made about the products, before their release, on Twitter. Data used was pertaining to social media content and product sales. The sentiments of the comments were predicted using a machine learning framework based on recursive autoencoders (RAE) for sentence-level prediction of sentiment label distributions (semi-supervised), further using 70/30 cross-validation on this data, settling on classification accuracy of 83%. They, then used linear regression with four features to predict sales of the products based on the sentiment analysis.

2) Predicting sales in a food store department using machine learning

http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1108597&dswid=5794

-Their study aimed to compare three machine learning methods for sales prediction in the food industry - Multilayer Perceptron (MLP), Support Vector Machine (SVM) and Radial Basis Function Network (RBFN). The performance of the models was determined using the performance measures: Mean Average Percentage Error (MAPE) and Root Mean Squared Error (RMSE). Based on the results, the SVM performed with lower error measures than the other two methods and was concluded to be the best. The data consisted of sales data provided by a Swedish food company; pertaining to one department in one store from year 2012 to year 2016. Each validation set then consists of 180 daily sales in a department.

3)Drugs store sales forecast using Machine Learning https://s3.amazonaws.com/academia.edu.documents/59368319/191_report20190523-80443-dzybc.pdf?response-content-disposition=inline%3B%20filename%3DDrugs_store_sales_forecast_using_Machine.pdf&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWOWYYGZ2Y53UL3A%2F20200302%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200302T021231Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=07f88ef3f10d237291a4ae68c4935291b9f6de8be7cbb1dae1a90452aad42c24

- For this project, training data of 1115 of Rossmann stores’ daily sales dated back to 2013, with 1,017,209 entries in total, including features of promotion and competitors’ information was used. Since they had no access to the real sales amount for testing during Kaggle competition, so they used 70% of the contest given training data as the training set for their model, the rest 30% as test set for cross validation. They established an auto regression (AR) model and tested it using order numbers and calculated the test errors. Random forest (RF) and Support regression vector (SVR) were used, to help identify the most apt feature/factor influencing sales. They made good predictions based on the adoption of the above-mentioned models.

OBSERVATIONS:

1) Data used is pertaining to sales, including features/factors affecting it. I am working on a similar dataset. The goal of their project is like mine – predicting sales, only the domain is a little different

2) They have used a variety of models for testing dependency of features and the accuracy of the model. I am planning to use Random Forest and /or extra trees for sales prediction

3) In my EDA I have analyzed the type, sizes of stores, frequency of sales for each department and holidays affecting the sales

4) Their projects have used a method to verify accuracy of their models

5) I will proceed to merge and train the data to understand the features affecting sales. I must predict sales for 45 stores based on the departments.

Delivery 3: EDA and Implementation of ML Algorithms

Delivery 3 - EDA and ML Algorithm Implementation .pptx

Outcome of ML Models used

Comparison of Validation Metrics and Accuracy

VALIDATION METRICS USED

RMSE: Root Mean Square Error
MAE: Mean Absolute Error

MODELS USED

K- Nearest Neighbor:

Robust
Implementation is simple
Useful in regression and classification problems

Random Forest:

Uses ensemble method
Handles missing values
Higher accuracy

Extra Trees:

Equally robust as Random Forest
Quicker than Random Forest

Page updated

Report abuse

David Gill

Predicting Sales For Multiple Walmart Stores

The goal of the project is to predict the sales for each triplet of the store, department and the date of its occurrence – in the test.csv file. Part of the challenge would be building a model to cater for markdowns on the given holidays, where ideal/complete historical data is absent.

Table Of Contents

Delivery 1 - Introduction: Overview on Data Sources and Methodology

Delivery 2 - Exploratory Data Analysis and Similar Research Projects

Delivery 3: EDA and Implementation of ML Algorithms