The following steps will serve as a guideline to build a box-plot graphic and relate it with numerical measures statistics obtained in the previous section:
Load the notebook with commands developed in step 2.3. (click on the link):
https://colab.research.google.com/drive/18AiwXUioJrA3vBclMSlhOEFCO5s_IANb?usp=sharing
Let's create a variable data that stored all values from the column 'January 2022' as a real number using the command astype(float). Then print the statistics, with the command describe(), using two decimals using the command round(2). Finally, create a box-plot related to the data using the command plot.box(). All commands are summarized as:
import matplotlib.pyplot as plt
data = df1['January 2022'].astype(float)
# Print the summary statistics
print(data.describe().round(2))
#Related box-plot graphic with extreme values
data.plot.box()
plt.grid()
plt.plot()
This command produces the following figure:
Previous figure has several circles which represent extremely high values or outliers (which can be also extremely low values). An alternative to better read how much percentage of the data is under a certain value is to filter the data in order to exclude outliers from the box-plot graph. This could be done by using the parameter showfliers=False in the command data.plot.box(showfliers=False):
import matplotlib.pyplot as plt
# Print the summary statistics
print(data.describe().round(2))
#Related box-plot graphic without extreme values
data.plot.box(showfliers=False)
plt.grid()
plt.plot()
This command produces the following figure:
Now the figure disagrees with the statistics summary presented on top. To solve this it is necessary to exclude the outliers before applying the command describe() to compute a summary of statistics. A value is considered an outlier if (data >= Q1 - 1.5 * IQR) & (data <= Q3 + 1.5 *IQR), where Q1 = data.quantile(0.25), Q3 = data.quantile(0.75), and IQR = Q3 - Q1 (#IQR is interquartile range.). All this knowledge is employed in the following commands:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range.
filter = (data >= Q1 - 1.5 * IQR) & (data <= Q3 + 1.5 *IQR)
data.loc[filter].describe().round(2)
This command produces the following statistics:
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1pNgoGenlGiQ2nWmJfUPwg9YHunpWF2g1?usp=sharing