Descriptive Statistic
✍️Statisticians are essential in any analysis of data, regardless of its size or storage in large databases. They enable us to understand the overall picture of the data without having to examine every single piece of information in the available dataset. This post will explain descriptive statistics and show us how to utilise it as a tool to investigate our data.
✍️In Descriptive statistics, we are describing our data with the help of various representative methods using charts, graphs, tables, excel files, etc. In descriptive statistics, we describe our data in some manner and present it in a meaningful way so that it can be easily understood. Most of the time it is performed on small data sets and this analysis helps us a lot to predict some future trends based on the current findings. Some measures that are used to describe a data set are measures of central tendency and measures of variability or dispersion.
🌝Types of Descriptive Statistics:
☘️Measures of Central Tendency
☘️Measure of Variability
☘️Measures of Frequency Distribution
🌝Measures of Central Tendency:
✍️It represents the whole set of data by a single value. It gives us the location of the central points. There are three main measures of central tendency:
☘️Mean
☘️Mode
☘️Median
🌝Mean:
✍️It is the sum of observations divided by the total number of observations. It is also defined as average which is the sum divided by count.
🌝Median:
✍️It is the middle value of the data set. It splits the data into two halves. If the number of elements in the data set is odd then the center element is the median and if it is even then the median would be the average of two central elements.
🌝Mode:
✍️It is the value that has the highest frequency in the given data set. The data set may have no mode if the frequency of all data points is the same. Also, we can have more than one mode if we encounter two or more data points having the same frequency.
✍️Types of Mode:
🌝Calculation of Mean, Median and Mode using Python3:
🌝Measure of Variability:
✍️Measures of variability are also termed measures of dispersion as it helps to gain insights about the dispersion or the spread of the observations at hand. Some of the measures which are used to calculate the measures of dispersion in the observations of the variables are as follows:
☘️Range
☘️Variance
☘️Standard deviation
🌝Range:
✍️The range describes the difference between the largest and smallest data point in our data set. The bigger the range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
🌝Variance
✍️It is defined as an average squared deviation from the mean. It is calculated by finding the difference between every data point and the average which is also known as the mean, squaring them, adding all of them, and then dividing by the number of data points present in our data set.
✍️Variance Calculation using Python3:
🌝Standard Deviation
✍️It is defined as the square root of the variance. It is calculated by finding the Mean, then subtracting each number from the Mean which is also known as the average, and squaring the result. Adding all the values and then dividing by the no of terms followed by the square root.
✍️Formula: Standard Deviation=Positive quare root of Variance of the dataset.
🌝Measures of Frequency Distribution:
✍️Measures of frequency distribution help us gain valuable insights into the distribution and the characteristics of the dataset. Measures like,
☘️Count
☘️Frequency
☘️Relative Frequency
☘️Cumulative Frequency
are used to analyze the dataset on the basis of measures of frequency distribution.
🌝Univariate v/s Bivariate:
✍️While performing the data analysis if we are considering only one variable to gain insights about it then it is known as the univariate data analysis. But if we are trying to gain insights into one variable with respect to some other variable like calculating correlation or covariance between two variables then this is known as data analysis.
✍️Bivariate analysis helps us establish a relationship between two variables which helps us make better decisions so, that we can manipulate one to cause a positive or negative effect on the other based on the relationship between the two variables. Even though we can use multivariate data analysis to analyze and derive relationships between more than two variables these methodologies are considered to come under the domain of machine learning.
🌝Descriptive Statistics v/s Inferential Statistics
✍️Generally, there are two types of statistics that are used to deal with the data when the requirements of the analyst are different. The main difference lies in the final requirements only if the person just wants to extract meaningful insights from the data at hand then the domain of statistics that will be used by him is known as Descriptive Statistics but if we would like to use the observations to predict the future data let’s say in the time series related dataset then our objective contains the process of inference as well and hence it is also known as Inferential Statistics.
✍️We can summarize this as when we would like to make some predictions/inferences based on some dataset at hand then those statistical methods are known as inferential statistics.
🌝Measures of Central Tendency:
✍️Usually, frequency distribution and graphical representation are used to depict a set of raw data to attain meaningful conclusions from them. However, sometimes, these methods fail to convey a proper and clear picture of the data as expected. Therefore, some measures, also known as Measures of Central Tendency or Average are used as a single measurement to determine the main characteristics of the given series. Hence, the Measure of Central Tendency is a single value used for the representation of a complete set of data. Another name for Measure of Central Tendency is Measure of Location. Average is a typical value of the given data set to which most of the observations of the data fall closer than any other value. The three principal measures that are used in Statistical Analysis are Arithmetic Mean, Median, and Mode.
✍️The different measures of central tendency can be classified into three categories; viz., Mathematical Averages (Arithmetic Mean(AM), Geometric Mean(GM), and Harmonic Mean(HM)), Positional Averages (Median(M or Me) and Mode(Z)), and Commercial Averages (Moving Average, Progressive Average, and Composite Average).
💥An average is an attempt to find one single figure to describe all the figures.
– Clark and Sekkade
💥Average is a value which is typical of representative of a set of data.
– Spiegal
💥An average is a single value within the range of the data that is used to represent all the values in the series. Since an average is somewhere within the range of data, it is sometimes called as a Measure of Central Value.
– Croxton and Cowden
🌚Objective and Functions of Averages:
🌝Presentation of Huge Data in Summarised Form
✍️It is not easy for an individual to grasp a large number of figures. Averages or Measures of Central Tendency summarise the given data set into a single numerical figure which is easy to understand and remember. For example, it is easy to remember the average marks of students in different sections of a class, than to remember the marks of each student in every section of a class.
🌝Making Comparison Easier
✍️As different averages reduce the mass statistical data into a single figure, they help in making comparative studies either at a single point of time or over a period of time. For example, The average sales figures for the current month of an organisation can be easily compared with the previous month’s sales figures or with other firms’ sales figures with the help of different measures of central tendency.
🌝Help in Decision-Making
✍️The values provided by Averages act as a guideline for decision-makers. Most of the decisions to be taken by these decision-makers for their research or planning are based on the average value of the variables of the given series or data set. For example, if the average monthly cost of production of an organisation is rising, then the production manager will have to improve the cost of production.
🌝Know about Universe from a Sample
✍️Averages involve the use of sample data which can also help an investigator obtain an idea of the complete universe. It means that to understand the average of the population, one can take the help of the average of a sample to obtain an easy-to-understand and clear picture.
🌝Trace Precise Relationship
✍️When an individual wants to establish a relationship between the different groups of data in quantitative terms, Averages become important. For example, It is irrelevant to state that the average salary of the employees of ABC Ltd. is more than the average salary of the employees of XYZ Ltd. only on the basis of observation. However, if this statement is said when their respective salaries are expressed in terms of averages, then it is said to be precise.
🌝Base for Computing Other Measures
✍️By taking Averages as a base, one can compute other different measures like skewness, kurtosis, dispersion, etc., for other phases of statistical analysis.
🌚Some essentials of a good average are as follows:
☘️Rigidly Defined
☘️Easy to Understand and Calculate
☘️It should be least affected by Fluctuations of Sampling
☘️Not Affected much by Extreme Values
☘️Based on all the Observations
☘️Capable of further Algebraic Treatment
🌝Measures of Dispersion | Types, Formula and Examples:
✍️Measures of Dispersion are used to represent the scattering of data. These are the numbers that show the various aspects of the data spread across various parameters.
Let’s learn about the measure of dispersion in statistics , its types, formulas, and examples in detail.
🌚Dispersion in Statistics:
✍️Dispersion in statistics is a way to describe how spread out or scattered the data is around an average value. It helps to understand if the data points are close together or far apart.
✍️Dispersion shows the variability or consistency in a set of data. There are different measures of dispersion like range, variance, and standard deviation.
🌚Measure of Dispersion in Statistics:
✍️Measures of Dispersion measure the scattering of the data. It tells us how the values are distributed in the data set. In statistics, we define the measure of dispersion as various parameters that are used to define the various attributes of the data.
These measures of dispersion capture variation between different values of the data.
🌚Types of Measures of Dispersion:
✍️Measures of dispersion can be classified into the following two types :
☘️Absolute Measure of Dispersion
☘️Relative Measure of Dispersion
🌚These measures of dispersion can be further divided into various categories. They have various parameters and these parameters have the same unit.
Let’s learn about them in detail.
🌝Absolute Measure of Dispersion:
✍️The measures of dispersion that are measured and expressed in the units of data themselves are called Absolute Measure of Dispersion. For example – Meters, Dollars, Kg, etc.
🌝Some absolute measures of dispersion are:
☘️Range: It is defined as the difference between the largest and the smallest value in the distribution.
☘️Mean Deviation: It is the arithmetic mean of the difference between the values and their mean.
☘️Standard Deviation: It is the square root of the arithmetic average of the square of the deviations measured from the mean.
☘️Variance: It is defined as the average of the square deviation from the mean of the given data set.
☘️Quartile Deviation: It is defined as half of the difference between the third quartile and the first quartile in a given data set.
☘️Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile is called Interterquartile Range. Its formula is given as Q3 – Q1.
🌚Quartile Deviation and Coefficient of Quartile Deviation: Meaning, Formula, Calculation, and Examples:
✍️The extent to which the values of a distribution differ from the average of that distribution is known as Dispersion. The measures of dispersion can be either absolute or relative. The Measures of Absolute Dispersion consist of Range, Quartile Deviation, Mean Deviation, Standard Deviation, and Lorenz Curve.
🌚What is Quartile Deviation?
✍️Quartile Deviation or Semi-Interquartile Range is the half of difference between the Upper Quartile (Q3) and the Lower Quartile (Q1). In simple terms, QD is the half of inter-quartile range. Hence, the formula for determining Quartile Deviation is as follows:
🌚What is Coefficient of Quartile Deviation?
✍️As Quartile Deviation is an absolute measure of dispersion, one cannot use it for comparing the variability of two or more distributions when they are expressed in different units. Therefore, in order to compare the variability of two or more series with different units it is essential to determine the relative measure of Quartile Deviation, which is also known as the Coefficient of Quartile Deviation. It is studied to make the comparison between the degree of variation in different series.
The formula for determining Coefficient of Quartile Deviation is as follows:
Problem-1: From the following table giving marks of students, calculate the interquartile range, quartile deviation, and coefficient of quartile deviation.
Problem-2: Calculate interquartile range, quartile deviation, and coefficient of quartile deviation from the following figures:
Standard Deviation
🌺What is Standard Deviation?
🌺Standard Deviation Formula
🌺How to Calculate Standard Deviation?
🌺What is Variance
🌺Variance Formula
🌺How to Calculate Variance?
🌺Standard Deviation of Ungrouped Data
🌺Standard Deviation of Discrete Grouped Data
🌺Standard Deviation of Continuous Grouped Data
🌺Standard Deviation of Probability Distribution
🌺Standard Deviation of Random Variables
🌺Standard Deviation Formula Example
🌺What is Standard Deviation?
✍️Standard Deviation is defined as the degree of dispersion of the data point to the mean value of the data point. It tells us how the value of the data points varies to the mean value of the data point and it tells us about the variation of the data point in the sample of the data.
✍️Standard Deviation of a given sample of data set is also defined as the square root of the variance of the data set. Mean Deviation of the n values (say x1, x2, x3, …, xn) is calculated by taking the sum of the squares of the difference of each value from the mean, i.e.
✍️Mean Deviation is used to tell us about the scatter of the data. The lower degree of deviation tells us that the observations xi are close to the mean value and the depression is low, whereas the higher degree of deviation tells us that the observations xi are far from the mean value and the dispersion is high.
Note: It is evident to note that both formulas look the same and have only slide changes in their denominator. Denominator in case of the sample is n-1 but in case of the population is N. Initially the denominator in the sample standard deviation formula has “n” in its denominator but the result from this formula was not appropriate. So, a correction was made and the n is replaced with n-1 this correction is called Bessel’s correction which in turn produced the most appropriate results.
Relative Measure of Dispersion
✍️We use relative measures of dispersion to measure the two quantities that have different units to get a better idea about the scattering of the data.
✍️Here are some of the relative measures of dispersion:
☘️Coefficient of Range: It is defined as the ratio of the difference between the highest and lowest value in a data set to the sum of the highest and lowest value.
☘️Coefficient of Variation: It is defined as the ratio of the standard deviation to the mean of the data set. We use percentages to express the coefficient of variation.
☘️Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to the value of the central point of the data set.
☘️Coefficient of Quartile Deviation: It is defined as the ratio of the difference between the third quartile and the first quartile to the sum of the third and first quartiles.
Calculate the average, variance and standard deviation in Python using NumPy
Numpy in Python is a general-purpose array-processing package. It provides a high-performance multidimensional array object and tools for working with these arrays. It is the fundamental package for scientific computing with Python. Numpy provides very easy methods to calculate the average, variance, and standard deviation.
Average
✍️One can calculate the average by using numpy.average() function in python.
🌞Syntax:
☘️ numpy.average(a, axis=None, weights=None, returned=False)
🌞Parameters:
☘️ a: Array containing data to be averaged
☘️ axis: Axis or axes along which to average a
☘️ weights: An array of weights associated with the values in a
☘️ returned: Default is False. If True, the tuple is returned, otherwise only the average is returned