The following steps will serve as a guideline to build a box-plot graphic and relate it with numerical measures statistics for a selection of data adapting the knowledge obtained in the previous section:
Load the notebook with commands developed in step 2.4. (click on the link):
https://colab.research.google.com/drive/18AiwXUioJrA3vBclMSlhOEFCO5s_IANb?usp=sharing
Let's reuse the variable df1_descendent_order to create a new variable top_ten_data that stored only the top ten values (through the selection of the first ten rows: df1_descendent_order.iloc[list(range(0,11)),2]) from the column 'January 2022' as a real number using the command astype(float). Then print the statistics, with the command describe(), using two decimals using the command round(2). Finally, create a box plot related to the data using the command plot.box(). All commands are summarized as:
import matplotlib.pyplot as plt
top_ten_data = df1_descendent_order.iloc[list(range(0,11)),2].astype(float)
# Print the summary statistics
print(top_ten_data.describe().round(2))
#Related box-plot graphic with extreme values
top_ten_data.plot.box()
plt.grid()
plt.plot()
This command produces the following figure:
Previous figure has two outliers represented by circles. An alternative to better read how much percentage of the data is under a certain value is to filter the data in order to exclude outliers from the box-plot graph. This could be done by using the parameter showfliers=False in the command data.plot.box(showfliers=False):
import matplotlib.pyplot as plt
# Print the summary statistics
print(top_ten_data.describe().round(2))
#Related box-plot graphic without extreme values
top_ten_data.plot.box(showfliers=False)
plt.grid()
plt.plot()
This command produces the following figure:
Now the figure disagrees with the statistics summary presented on top. To solve this it is necessary to exclude the outliers before applying the command describe() to compute a summary of statistics. A value is considered an outlier if (top_ten_data >= Q1 - 1.5 * IQR) & (top_ten_data <= Q3 + 1.5 *IQR), where Q1 = top_ten_data.quantile(0.25), Q3 = top_ten_data.quantile(0.75), and IQR = Q3 - Q1 (#IQR is interquartile range.). All this knowledge is employed in the following commands:
Q1 = top_ten_data.quantile(0.25)
Q3 = top_ten_data.quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range.
filter = (top_ten_data >= Q1 - 1.5 * IQR) & (top_ten_data <= Q3 + 1.5 *IQR)
top_ten_data.loc[filter].describe().round(2)
This command produces the following statistics:
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1_FyAxfvDsI-AMcMwhStic7BcaGl7-7hF?usp=sharing