The pandemic of COVID-19 has changed the world significantly, especially in people' shopping preference. People in quarantine or lockdown face a problem to obtain fruits at home. I am interested in helping suppliers to build an e-commence shop and deliver fruits to customers in timely manner.
Fruit4U is a factionary fresh produce company used to sell products in local markets. The pandemic hampered its sales of products dramatically. Oliver was recently promoted as a digital marketing manager in charge of embarking an e-commerce shop. Since fruits are the most profitable products in his company and he believe the future growth of company depends on online sales of fruits.
Oliver is leading a team to digitalize the product information from their suppliers. They have defined a data structure in Excel and are looking for a solution to combine different data format from suppliers to fit in the structure. One of their major suppliers uses JSON file format. They want to transform and integrate product information in this file into their pre-defined data structure in Excel.
To achieve the goal, data preparation and normalization will be performed on suppliers’ data to match the pre-defined data structure, then the normalized data will be integrated into target data without changing any structure or value of the target file.
In this case study, nearly 7000 data entries are processed via R and Excel. The whole process is composed of three phases: data preparation, normalization, and integration. The script is documented here.
Firstly, I examine the files of source data and target data. The supplier's data is compiled in Newline Delimited JSON format (ndjson) and our target data is Excel format. Therefore, transforming data format is necessary. Here I read the supplier's data as a data frame into R.
The suppliers data has repetitive entries for every item because it uses different entity id to mark different attributes, for example, the yield and condition of the same product are listed in two entries. To consolidate data, I remove entity id since the different number of entity id of the same product prevents it from merging and it is also a unnecessary information. With that, now every product has an unique entry and the long data is also transformed into wide data, which is easier to read and process. Therefore, I can examine and compare the values between supplier's data and target data to see if normalization is needed before data integration.
After examination, it shows that the definition of conditions is different between the supplier and Fruit4U. In Fruit4U, there are only three tiers of fruit condition: Fresh, Good, and Fine. However, the suppliers data has six tiers. According to the standard of Fruit4U, all six tiers can be categorized into their three tiers. Here, a Excel file is generated to categorized suppliers fruit condition into three tiers. Using a Excel file is much more agile as it can be easily edited to match the definition of condition from different suppliers in the future. This Excel file is then read into R to define conditions. Then all condition in the supplier's file is replaced according to the definition from Fruit4U.
Afterwards, the number of columns between source data and target data is examined and found there are four attributes missing in the source data. In order to merge two tables vertically, the same number of columns is necessary. Therefore, those missing columns are added into the source data with null value (ideally, we should ask the supplier to provide these values but that's not in the scope of this case study). Now, the source data is normalized and ready for integration.
With above process, the source data is ready to be integrated into the target data. Oliver's team only want to keep the same attributes from the supplier's data while get rid of the rest. Here, a vector is generated to record the shared attributes between two tables. Then it is used to instruct a function to integrate the two table into one data frame. Now, the supplier's data is totally integrated into the pre-defined data structure from Fruit4U and exported as a Excel file. They can use it to build an E-commerce shop to sell fruits from their supplier.
All input data are successfully integrated.
The values from the source file are normalized to match the criteria of target data.
The structure and values of target data remain intact.
A template of scrips is built to process incoming suppliers' data reproducibly. In the future, Fruit4U can use this template to integrate data from different suppliers into the pre-define data structure.