2.5. Box-plot graph for a selection

1. Concepts & Definitions

1.1. Descriptive and Inference Statistics

1.2. Variable Types: Qualitative x Quantitative

1.3. Types of descriptive statistics

1.4. Central tendency of data and skewness

1.5. Main measures of variability

1.6. What is an outlier?

2. Problem & Solution

2.1. Read and clean UK data

2.2. Build a Bar graph

2.3. Numerical measures

2.4. Box-plot graph for all data

2.5. Box-plot graph for selected data

Building box-plot graph and comparing with statistics from selected UK Importation data

The following steps will serve as a guideline to build a box-plot graphic and relate it with numerical measures statistics for a selection of data adapting the knowledge obtained in the previous section:

Load the notebook with commands developed in step 2.4. (click on the link):

https://colab.research.google.com/drive/18AiwXUioJrA3vBclMSlhOEFCO5s_IANb?usp=sharing

Let's reuse the variable df1_descendent_order to create a new variable top_ten_data that stored only the top ten values (through the selection of the first ten rows: df1_descendent_order.iloc[list(range(0,11)),2]) from the column 'January 2022' as a real number using the command astype(float). Then print the statistics, with the command describe(), using two decimals using the command round(2). Finally, create a box plot related to the data using the command plot.box(). All commands are summarized as:

import matplotlib.pyplot as plt

top_ten_data = df1_descendent_order.iloc[list(range(0,11)),2].astype(float)

# Print the summary statistics

print(top_ten_data.describe().round(2))

#Related box-plot graphic with extreme values

top_ten_data.plot.box()

plt.grid()

plt.plot()

This command produces the following figure:

Previous figure has two outliers represented by circles. An alternative to better read how much percentage of the data is under a certain value is to filter the data in order to exclude outliers from the box-plot graph. This could be done by using the parameter showfliers=False in the command data.plot.box(showfliers=False):

import matplotlib.pyplot as plt

# Print the summary statistics

print(top_ten_data.describe().round(2))

#Related box-plot graphic without extreme values

top_ten_data.plot.box(showfliers=False)

plt.grid()

plt.plot()

This command produces the following figure:

Now the figure disagrees with the statistics summary presented on top. To solve this it is necessary to exclude the outliers before applying the command describe() to compute a summary of statistics. A value is considered an outlier if (top_ten_data >= Q1 - 1.5 * IQR) & (top_ten_data <= Q3 + 1.5 *IQR), where Q1 = top_ten_data.quantile(0.25), Q3 = top_ten_data.quantile(0.75), and IQR = Q3 - Q1 (#IQR is interquartile range.). All this knowledge is employed in the following commands:

Q1 = top_ten_data.quantile(0.25)

Q3 = top_ten_data.quantile(0.75)

IQR = Q3 - Q1 #IQR is interquartile range.

filter = (top_ten_data >= Q1 - 1.5 * IQR) & (top_ten_data <= Q3 + 1.5 *IQR)

top_ten_data.loc[filter].describe().round(2)

This command produces the following statistics:

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1_FyAxfvDsI-AMcMwhStic7BcaGl7-7hF?usp=sharing

Page updated

Google Sites

Report abuse