Why ? Well our motivation is pretty simple : we want to better understand the data (central tendency, variation and spread for example).
Also, a descriptive data summarization is needed for quality data preprocessing, which is the next step of the project.
Data dispersion characteristics
By doing a summary on the dataset using the R language (and the RStudio tool), we can access to the general information on our data. and in particular the data dispersion characteristics (median, max, min, quantiles,
That lead us to the following result:
Numerical dimensions
- Our dataset has 69 993 items.
- These items are distributed in 21 columns of 3333 rows