Below we have some notes about course studies on a very important branch of the universe of mathematics. Statistics is the field of mathematics that relates facts and numbers in which there is a set of methods that allow us to collect data and analyze them, so that it is possible to perform some interpretation of them in a generalized way.
One of the problems to be solved by statistical inference is to test Hypotheses. A statistical hypothesis assumes a given population parameter, such as mean, standard deviation, correlation coefficient, etc.
Hypothesis testing is a procedure for deciding on the veracity or falsehood of a given hypothesis.
For a statistical hypothesis to be validated or rejected with certainty, it would be necessary to examine the entire population — which in practice is unfeasible.
Alternatively, a random sample of the population of interest is extra. Because the decision is made based on the sample selection, errors may occur:
reject a hypothesis when it is true
don’t reject a hypothesis when it’s false
A Statistical Hypothesis Test is a decision procedure that allows us to decide between H° (null hypothesis) or Hª (alternative hypothesis), based on the information contained in the sample.
The Null Hypothesis states that a population parameter (such as mean, standard deviation, and so on) is equal to a hypothetical value. The Null Hypothesis H° is often an initial claim based on previous analyses or expertise.
H° is what we take as a result of research through a sample; that is, we reach a value based on sample analysis, tabulate the answers, and come to a product — a hypothesis, since, in practice, we do not consult the entire population.
The alternative hypothesis states that a population parameter is smaller, higher, or different from the hypothetical value in the null hypothesis. The alternative hypothesis is one that we believe can be true or hope to prove to be true.
For example, based on a satisfaction survey result, 95% of customers are satisfied with the service. We want to challenge the Research Hypothesis — the Status Quo.
Therefore, we launched another hypothesis (alternative), contesting that the level of satisfaction is higher, lower, or different than 95%. Consequently, we performed a hypothesis test to reject or not the null hypothesis.
Because we are analyzing sample data and not population data, errors can occur:
Type 1 error: is the probability of rejecting the null hypothesis when it is true. False — Positive
Type 2 error: is the probability of rejecting the alternative hypothesis when it is effectively true.
The definition of hypotheses is one of the most critical points in a hypothesis test. In practice, we have a business problem, and we have to interpret it and start from that to define the null Hypothesis and the Alternative Hypothesis.
A wrong definition compromises all future work. The definition of this hypothesis is a business problem, where we interpret a scenario and define hypotheses.
A researcher has exam results for a sample of students who have taken a training course for a national exam. The researcher wants to know if the students who were students scored above the national average of 82.
In this case, an alternative hypothesis can be used because the researcher is specifically raising the hypothesis that the scores for trained students are higher than the national average.
In the penultimate paragraph, we have that the national average for an exam is 82. However, a researcher wants to verify whether or not students who complete the training have a population above this average. Based on this, we could define the hypotheses:
H°: μ = 82 — status quo
Hª: μ > 82 — alternative hypothesis
Based on this, we can now choose which hypothesis test we’re going to work on:
Unilateral Hypothesis Test
Bilateral Hypothesis Test
In this case, we’re working on the right one-sided hypothesis test. This is a definition of hypotheses: interpreting the business problem, understanding what is being requested, and defining what H° is and what is Hª.
If we reverse that definition, we will probably apply the Hypothesis Test, but our conclusions will be completely different.
Step 1 is to define the hypotheses (null and alternative). We should keep in mind that the only reason we’re testing the null hypothesis is that we think it’s wrong. We state what we believe is wrong about the Null Hypothesis in an alternative hypothesis.
Step 2 is to define the criteria for the decision. To define the criteria for a decision, we declare the significance level for the test. It could be 0.5%, 1% or 5%. Based on the significance level, we decided to reject the null hypothesis or not. It is based on business requirements; that is, the definition of the level of significance depends on the business area with which we are working — a hypothesis test for the health area should have a minimum margin of error.
Step 3 is to calculate statistics and probability. Higher probability has enough evidence not to reject the null hypothesis.
Step 4 is to decide. Here, we compared the p-value with the predefined significance level, and if it is less than the significance level, we reject the null hypothesis. By deciding to reject the Null Hypothesis, we can make mistakes because we are looking at a sample and not an entire population.
Therefore, we first formulate the null and alternative hypotheses from the understanding of the business problem. We then collected a sample size n and calculated the sample mean, considering that the mean is the parameter we are studying.
We traced the sample mean on the x-axis of the sample distribution, and we chose an alpha significance level based on the severity of the type I error.
Next, we calculate statistics, critical values, and critical regions. And then we make the decisions. If the sample average is in the allowed area of the chart, we do not reject the null hypothesis. If the sample average is in one of the tails, we reject the null hypothesis.
We have two kinds of hypothesis tests. The One-Sided or One-tailed Test is used when the alternative hypothesis is expressed as: < or >
Ex: We have two definite hypotheses. H° is the average of any study equal to 1.8, and the alternative hypothesis Hª indicates that the mean is less than 1.8.
In this case, we have a lower tail test or left unilateral test. The same reasoning fits for the upper tail or right unilateral test; the only difference is that we change the position ode analysis within the graph.
If the mean is within the white region of the chart, we do not reject the null hypothesis; otherwise, we reject it.
A school has a group of students (population) considered obese. The probability distribution of the weight of students in this school between 12 and 17 years is normal, with an average of 80 kilos and a standard deviation of 10 kilos.
The school principal proposes a medically monitored treatment campaign to combat obesity. 44The treatment will consist of diets, physical exercises, and a change of eating habits. The doctor states that the result of the treatment will be presented in 4 months. And those students will have their weights decreased in these periods.
H°: μ = 80 — status quo
Hª: μ < 80 — an alternative hypothesis
Where: μ = average of students’ weights after four months.
So, what we want is to challenge the status quo. The principal says that the moment he starts a weight reduction campaign, he will decrease the average weight of students. In this case, we will use the mean less than 80, i.e., H. μ < 80, a left one-sided test.
A bilateral hypothesis test is used whenever the alternative hypothesis is expressed as “different from”; that is, we are not concerned with whether the alternative hypothesis is greater or lower than a given value; we want to know if the Alternative Hypothesis is different a given value.
We have H°, setting the average to 1.8, and we have the alternative hypothesis Hª with the average different from 1.8. If the average is different from 1.8, it can be greater or lesser than the value. Because of this, we need two rejection areas on the chart.
The curve above represents the sampling distribution of the average broadband utilization. It is assumed that the population average is 1.8GB, according to the null hypothesis H°: μ = 1.8 — status quo.
A cookie factory packs boxes weighing 500 grams. Weight is monitored periodically.
The quality department has established that we should maintain the weight at 500 grams. What is the condition for the quality department to stop the production of the biscuits?
H°: μ = 500- status quo
Hª: μ ≠ 500 — an alternative hypothesis
The null hypothesis indicates that each box weighs 500 grams. However, we want to pass to quality control a check to change the weight of the package to stop production. It doesn’t matter if the box is 499g or 501g — if it’s different from 500, we stop production. In this case, we apply a bilateral test to result in one of the two tails.
The purpose of the hypothesis test is to verify the validity of a statement about a parameter of the population based on sampling.
As we are taking a sample as a basis, we are exposed to the risk of wrong conclusions about the population due to sampling errors.
The Null hypothesis may be true if we have collected a sample that is not representative of the population or is very small.
To test the Null Hypothesis, H°, defining a decision rule to establish a rejection zone of the hypothesis is necessary to determine a significance level, α, with the most common values 0.10, 0.05, and 0.01.
We have a level of confidence, each associated with a level of significance. According to the value of the α significance level that we define in the Hypothesis Test, we can increase or decrease the level of confidence with which we reject or not the Null Hypothesis.
Suppose the value of the population parameter, defended by null hypothesis H°, falls in the rejection zone. In that case, this value is doubtful to be the actual value of the population, and null hypothesis H° will be rejected in favor of the alternative hypothesis Hª.
Eventually, although rejected based on data from a sample, the Null Hypothesis is true. In that case, we’d be making a mistake in deciding. This error is called A Type I Error, the probability of which depends on the level of significance α chosen.
According to the business problem, we can use one value or another of α. So, we will increase the degree of confidence or not — The Hypothesis Test is a business tool, helping the decision taker.
When the value defended by Null Hypothesis H° falls outside the rejection zone, we consider that there is no evidence to reject H° to the detriment of the Alternative Hypothesis. But here, we may also be making a mistake if the Altercation Hypothesis, although discarded by the data we have at hand, is, in fact, true — this error is called Type II.
After one year, the effectiveness of a particular vaccine is 25% (i.e., the immune effect extends for more than one year in only 25% of people taking it). A new vaccine develops, more expensive, and one wishes to know if this is, in fact, better.
H°: p = 0.25 — status quo
Hª: p > 0.25 — alternative hypothesis
We want to challenge the Null Hypothesis by verifying that the p-value is greater than 25%.
Type I errorB approve the vaccine when, in reality, it has no effects more significant than that of the current vaccine.
Type II error: reject the new vaccine when it is, in fact, better than the current vaccine.
To adjust the two errors, we depend on the significance level of alpha. We will increase or decrease the alpha value to increase or decrease confidence when rejecting the Null Hypothesis — the choice is the data scientist.
The probability of making a Type I Error depends on the values of the population parameters and is called α = significance level.
We then say that the significance α of a test is the maximum probability we want to run the risk of a Type I Error.
The alpha value is typically predetermined, and common choices are α= 0.05 and α = 0.01. The probability of making a Type II error is called β.
Confidence Interval is a range of values that are likely to contain the actual value of the population. Note that in the confidence interval definition, a probability is associated. At this probability, we call it:
Confidence Level
Degree of Trust
Confidence Coefficient
These probabilities can come from common choices of the degree of confidence that one wishes to achieve, among the most common we have:
A Confidence Interval acts as an indicator of the accuracy of your measurement. And it indicates how stable your estimate is, which can be calculated to determine how close we are to our original estimate when we perform one or more experiments. Therefore, the confidence interval is associated with a degree of confidence that measures our certainty that the gap contains the population parameter.
The p-value helps us interpret the results of a Hypothesis Test. The method of constructing a Hypothesis Test is part of setting the level of significance α. This procedure can reject the null hypothesis for an α value and the non-rejection to a lower value.
So, let’s assume we have a business problem where:
define hypotheses H° and Hª;
collect the data sample;
calculate the statistics and set the value of α.
We verified that the null hypothesis H° fell into the rejection area! If we keep everything the same and change the value of α, it may be that with this new value, the H° hypothesis does not fall into the rejection area. After all, what’s right?
Therefore, we need something more to help interpret whether or not we should reject H°, that is, to increase our Degree of Trust …
Another way to proceed is to present the probability of significance (p-value) or descriptive level. This probability is the value on which we base our decision, so statisticians give that probability a particular name, p-value, or “plausibility value.”
This indicates the probability of more extreme statistical values than that observed under the hypothesis that Null Hypothesis H° is true.
p-value is another indicator that helps make the right decision on rejecting or not H°.
The p-value is a probability of 0 to 1, where 0 indicates impossible, while 1 indicates absolute certainty. Therefore, if we have a p-value of 0.001, which indicates a chance in a thousand, it is improbable the occurrence of a given phenomenon.
The p-value represents the chance or probability of the effect (or difference) observed between treatments/categories due to chance and not to the factors being studied. The p-value helps us to increase a little in our level of accuracy.
Let’s assume that one researcher tested the efficiency of two treatments and observed that the mean treatment “A” was higher than the average of treatment “B.” After performing the appropriate statistical analyses, the researcher found a p-value = 0.3.
This means that the chance of this difference between the averages is due to chance (and not an effect of the treatments) is 30%. After all, the p-value is a probability of significance. If the researcher states that the differences between the means occurred because of the treatments, he has a 30% chance of being mistaken.
Analyzing another point of view, that of probability, if the researcher performs the same experiment 100 times, he will find similar results in 70 investigations. After all, he will have a chance to be mistaken since the p-value is equal to 0.3.
In classical statistics, the p-value (also called descriptive level or probability of significance) is the probability of obtaining a test statistic equal to or more extreme than that observed in a sample from the perspective of the null hypothesis.
For example, in hypothesis tests, we can reject the null hypothesis if the p-value is less than 5%. Thus, another interpretation for the p-value is that this is the lowest probability of significance with which we would reject the null hypothesis. In general terms, a small p-value means that the likelihood of obtaining a test statistic value as observed is improbable, thus leading to the rejection of the null hypothesis.
The p-value is the lowest probability of significance with which the null hypothesis would be rejected.
The p-value is the low value for probability in the blue region of the y-axis; that is, it is the most negligible probability with which we would reject the null hypothesis.
In general terms, a small p-value means that the probability of obtaining a test statistic value as observed is improbable, thus leading to the rejection of the null hypothesis. In short, we found no evidence in the data so that we can reject the null hypothesis.
We conducted the Hypotheses Test to challenge the Status quo, challenging what we have today by rejecting the Null Hypothesis H°.
A low p-value says that the data we observed would be very unlikely if our null hypothesis were true; that is, the null hypothesis has a low “plausibility.”
We started with a model, and now that same model informs us that the data, we have is unlikely to happen. That’s surprising. In this case, the model and the data are in conflict with each other, so we have to make a choice: the null hypothesis is correct, and we have just seen something remarkable, or the null hypothesis, that is, the model is wrong.
Suppose we believe in data more than assumptions, then, given this choice. In that case, when we are faced with a low p-value, we should reject the null hypothesis and consider the alternative hypothesis because we do not find evidence to support the current null hypothesis.
In that case, we didn’t see anything unlikely or surprising. The data are consistent with the null hypothesis model, and we have no reason to reject the null hypothesis. Events that have a high probability of happening happen all the time.
No! We know that many other similar hypotheses may also explain the data we have seen. The most we can say is that it doesn’t seem to be fake. Formally, we say that “we no longer reject” the null hypothesis. That may seem like a pretty weak conclusion, but that’s all we can tell when the p-value isn’t low enough. All this means that the data is consistent with the model with which we started.
Low p-value, we reject H° in favor of Hª
high p-value, we do not reject H°, and the test will be inconclusive.
The p-value is not the probability that the null hypothesis is true but rather a probability of significance to verify whether or not we have evidence within the data to reject the current hypothesis.
The p-value is not the probability that the null hypothesis has been deceptively rejected.
The magnitude of the p-value does not indicate the size or importance of an observed effect. For example, in clinical research where two treatments are compared, a relatively small p-value is not an indicator that there is a significant difference between the treatments’ effects.
The p-value and significance level are not synonymous. The p-value is obtained from a sample, while the significance level is usually set before data collection.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
Data Science is a multidisciplinary area that involves statistics, mathematics, programming, computer science, and knowledge in business areas, having statistics as one of the fundamental pillars in Data Science.
We will discuss some Fundamental Concepts briefly in Statistical Analysis: defining statistics, data types, descriptive statistics, univariate and bivariate analysis tools, central trend measures, dispersion measures, distribution measures, and correlation coefficient.
Statistics is the science that allows us to learn from the data. As we live in the age of Big Data, this large set of data is generated at high volume, wide variety, and high speed; it is easy to understand that statistics have become a crucial analysis tool today.
Therefore, we need techniques, tools, and processes to analyze the amount of data. Statistics provide us with many of these tools to extract a lot of information relevant to understanding the current situation and decision-making.
1.Collect data: statistics allows us to collect data; that is, it provides us with tools for sampling techniques — We will hardly collect all data on a single phenomenon. A widespread example is an electoral research, where entities research with samples of the population based on statistical techniques and procedures.
2.Organize data: in addition to collecting, we can organize the data with statistical tools. We can tabulate, calculate frequencies, place data in an organized way, and perform analysis processes or even predictive modeling in sequence.
3.Submit data: with statistics, we can also present the data through statistical graphs where visualizations summarize or simplify what the data constitutes.
4.Describe data: we can describe the data! What is the average of a given attribute, its median, or the highest value? Do the data follow a normal distribution or not? This description helps us understand how data is organized to facilitate decision-making work.
5.Interpret data: Finally, we can accomplish perhaps the most critical work of all, interpreting the data. From this interpretation, through statistical tools, we can make inferences about populations through small shows.
In short, statistics offer us a series of tools that allow us to collect, organize, present, describe and interpret data.
Therefore, we need to define what type of data we are working on to know the most appropriate statistical analysis technique to employ. We have two main classifications — quantitative and qualitative:
1.Nominal Qualitatives: profession, sex, religion — there is no defined hierarchy between the data. The nominal qualitative data represent descriptions for the data and do not allow ranking.
2.Ordinal qualitatives: in some situations, we have an apparent ordering or hierarchy between the categories (ranking), for example, schooling, social class, and positioning in a queue.
3.Discrete Quantitative: these are values that can count. We can count the number of children, the number of cars parked, the number of hits of a website, thumbs up in publications, and finite values — value and integer.
4.Continuous Quantitative: data that can assume any value within a range of values, i.e., weight, height, salary, etc. These are observations that can be measured and usually by decimal values, values that are measured.
This type of division is necessary because we will choose the best statistical technique depending on the data type. We have a set of methods for qualitative data and a set of procedures for quantitative data.
We already know that statistics help us collect, organize, present, describe and interpret the data. We also understand that the data have to be qualitative or quantitative. Now we will see the types of studies:
1.Experimental study: in an experimental study, each individual is randomly assigned to a treatment group, then the specific data and characteristics are observed and collected. Experimental studies that help protect against potential unknown bias that interferes with the outcome of the analysis will better suit each study according to our goal.
2.Observational study: in an observational study, the specific data and characteristics are collected and observed. However, there is no ambition to modify the studies being conducted. That is, we are watching the phenomenon, collecting and analyzing the data. Observational studies do not offer the same level of protection against confounding factors, such as experimental ones.
As its name suggests, descriptive statistics is a set of statistical methods used to describe the main characteristics of the data; these methods are graphic or numerical. We use descriptive statistics to begin our analysis process to understand our data.
There are several methods available to assist in describing the data, each method being designed to provide distinct insight into the available information or an already common hypothesis.
1.Graphical methods: the primary purpose of graphical methods is to organize and present data in a managerial and agile way — data visualization plays a crucial role in the entire data science process.
2.Data Summarization: descriptive statistics propose to sum up and show the data so that we can quickly get an overview of the information being analyzed and better understand a set through its main characteristics.
3.Main descriptive measures
Representative values: mean and median
Dispersion and variation: variance and standard deviation
Nature (shape) of distribution: bell, uniform, or asymmetric
Therefore, we collected the data and applied descriptive statistics to obtain a representative value, evaluate the dispersion, and assess the distribution of these data.
Based on data information and how it is organized, we’ll decide which tools to use to treat, clean, transform, normalize, and standardize data for predictive modeling. The decisions that will come in the sequence depend on Mean, variance, standard deviation, distribution, etc.
1.Frequency table to describe data: one of the simplest ways to describe data is through frequency tables — charts represent frequency tables, which reflect the observations made in the data. We observe a particular phenomenon, collect the data, and then tabulate — we create a frequency table. Each line or value corresponds to a class — category. The frequency varies depending on the count of each class type in the set. One of the main applications of artificial intelligence is natural language processing — a computer application recognizing voices and taking action or doing text readings and generating summaries by itself!
2.Frequency Distribution: a frequency distribution is one of the main tools of descriptive statistics, showing several data observations at a specific interval — a way to put more information in a frequency table. To create a frequency distribution, we create a list, define a range, determine the number of classes, determine the class range, and build the frequency distribution (frequency table with more information), to better understand the data.
We will apply descriptive statistics primarily during the initial phase of the analysis project. Still, We can use the techniques and tools offered by descriptive statistics almost at any stage of the process — tools to summarize data, visualize data, visualize relationships, summarize data frequency, etc.
Frequency Table: Shows the occurrence of elements within the dataset;
Contingency Table: used when we own two variables, and we want to visualize the relationship between these variables.
Charts: Understand how data is organized, distributed, and how it relates.
1. Frequency Table: base for almost all other tools.
2. Bar Chart: is one of the most used charts in data analysis. We can represent frequency tables through bars on a bar chart. That is, each bar represents precisely the proportion of the frequency in the frequency table.
3. Pareto Chart: This can be constructed with bars representing each of the classes in the frequency table. The height of each bar is entirely associated with the frequency and proportionally
The red line that passes through the entire chart is constructed so that the left side of the line has the leading cause of the problem, and on the right side, the reasons less relevant to the problem.
We can see that the leading cause of the incorrect medication problem is the dosage error, while the smallest of the issues is self-medication.
4. Pie Chart: very friendly, little recommended — every chart has its value as long as it is well constructed. The pie chart has the potential to lead to potential misinterpretations in its construction and interpretation.
5. Line Chart: We use the line chart to show the variable’s evolution over the x-axis and the accumulated variable in y.
The consideration we owe with the line chart concerns its scale and the bar chart — we can easily manipulate this type of information by changing the chart’s scale.
The greatest attention we owe to the line chart is its scale and the bar chart — we can easily manipulate this type of information by changing the chart’s scale.
6. Stem and Leaf: widely used in statistics, not so much in Data Science. This chart divides the data into two parts, where the stem represents the most significant values to the left of the vertical stroke.
The sheets are the smallest values to the right of the vertical stroke. Listing all the sheets to the right of each stem, we can please describe how the data is distributed.
7.Histogram: in the histogram, the bars are glued to each other, an appearance very close to the bar chart — the information held in a histogram is about just one feature.
The main goal in a histogram is to show frequency distribution and, thus, analyze whether or not the data follow a normal distribution — check how the data is distributed, which is already of great help.
In general, before starting the predictive modeling process, we use the histogram during preprocessing. We take the set, create the histogram and analyze how the data is distributed — depending on the algorithms we are going to use later, we have to change the distribution of the data by applying normalization so that the data has a standard normal distribution before feeding the algorithm.
See two very tools used for bivariate analysis when we want to represent two variables: understand how two variables relate.
1.Contingency Table: This is a table that shows the numerical relationship between two variables.
See that we have the Male and Female labels that are the values of the gender variable and the numbers representing the type of animal, dog, or dog — we are relating two different information (Sex * Animal) and, in addition to these relationships, has its totals.
It is a table widely used in classification problems in Machine Learning to interpret the results of machine learning models.
2.Scatter Plot: it is one of the main tools for this type of analysis, allowing us to study the relationship between two variables.
In the chart above, we illustrate a correlation between per capita income and the degree of happiness. Unsurprisingly, we can see that as income increases, the degree of joy also increases — unbelievable.
In the chart above, we illustrate a correlation between per capita income and the degree of happiness. Unsurprisingly, we can see that as income increases, the degree of joy also increases — unbelievable.
However, the objective of the scatterplot is not to study causality; that is, we cannot affirm based on the graph that a higher per capita income infers in the degree of joy- happiness can be the consequence of factors other than income.
1.Measure of central tendency — the centrality
In addition to the tools we saw earlier, descriptive statistics also offers us several measures that we can use to interpret the data: central tendency measures, dispersion measures, and shape measures.
These are the leading measures of central tendency used in descriptive statistics: mode, median, and the arithmetic average.
1.Mean: The primary measure of the central trend of the data is a number around which an entire dataset is distributed. It is a unique number that can estimate the value of the complete dataset. Averages are the simplest ways to identify trends in a dataset.
However, averages can bring pitfalls that lead to distorted conclusions — we cannot rely solely on the Mean; it is only a starting point of analysis. The disadvantage of using the Mean is when we have extreme values in the set (outliers), compromising the Mean’s consistency.
2.Median: The median is the value that divides the data into two equal parts; that is, the number of terms on the right side is similar to the number of terms on the left side when the data is organized in ascending or descending order.
The advantage of using the median is that it is a measure that is not affected by extreme values.
3.Mode: The term appears most often in the dataset, that is, the term with the highest frequency. However, there may be a dataset that there is no mode because all values appear the same number of times.
If two values appear simultaneously and more than the rest of the values, the dataset is bimodal. If three values appear simultaneously and more than the rest of the values, the dataset is trimodal, and for n modes, this dataset is multimodal.
The central trend measures that we saw above help us understand the centrality of the data. However, we also need to know how far away the data is from the center of the distribution; if we have the Mean of the data, we also need to understand how the data points around the Mean are dispersed — the variability of the data set.
Therefore, we will always work with the central trend measures in conjunction with the dispersion measures — most of the time, the average is not enough to get an idea of how the data is organized.
Standard Deviation: This is the measure of the mean distance between each element, and the set means. That is, how the data is distributed from the average. A low standard deviation indicates that data points tend to be concentrated close to the data set mean. In contrast, a high standard deviation indicates that data points are spread more widely.
Variance: is the square of the standard deviation. In some situations, we use variance; in others, the standard deviation. The big difference is that we don’t have a unit with variance, while the standard deviation has a unit that we are studying.
Amplitude: it is one of the most straightforward techniques of descriptive statistics. The amplitude is the difference between the smallest and highest value of the dataset.
Percentile: This is a way to represent the position of a value in the dataset. To calculate the percentile, the values in the dataset must always be ascending me. We have 99 percentiles within a data set; that is, at any time, we can search for data in a given percentile (position) in the group and from that make inferences, stating that the vast majority of the data is lowered or above a specific place.
Quartile: These values divide the data into quarters, as long as the data is sorted in ascending order. There are four sections, each with 25% of the elements of the set. The interquartile interval is the difference between Q3-Q1, a measure that shows the concentration of 50% of the data.
The measurements of skewness and kurtosis characterize the shape of the distribution of elements around the Mean. It is a fundamental concept because we should apply some techniques to adjust the data before predictive modeling according to the data’s distribution.
1.Perfect symmetry: in a perfect normal distribution, the tails on each side of the curve are exact mirror images — mean, median, and mode have the same value.
2. Positive Asymmetry: when a distribution is tilted to the right, the tail on the right side of the curve is larger than the tail on the left side, and the average is higher than the mode. This situation is called positive asymmetry. According to our goal, if we identify a positive asymmetry data distribution, we might need to apply a statistical technique to bring the data to asymmetric distribution to feed machine learning algorithms.
3. Negative Asymmetry: When a distribution is tilted to the left, the tail on the left side of the curve is greater than the tail on the right side, and the average is smaller than the mode. This situation is called negative asymmetry.
To calculate the coefficient of asymmetry, we use the coefficient of skewness based on fashion: (average-fashion) / standard deviation, or we have the option to use based on the average where 3 * (average-median) / standard deviation:
The signal gives the direction of asymmetry.
Zero means no asymmetry.
A negative value means that the distribution is negatively asymmetric.
A positive value means that the distribution is positively asymmetric.
The coefficient comments the sample distribution with a normal distribution. The higher the value, the more the distribution differs from a normal distribution.
The Kurtosis coefficient is one of the most used coefficients to measure the flattening degree of a distribution curve, or simply coefficient (k), calculated from the interquartile interval of the percentiles of orders 10 and 90.
So far, we have seen several measures that help describe the data: central trend measures to verify the centrality of the data, dispersion measures to determine the variability of the data.
In some situations, we will want to beyond these measures, check the relationship between two variables, one with the other — for this, we calculate the correlation coefficient.
The correlation coefficient is often used during the exploratory analysis phase of the data to understand beyond the definition of a single variable and its relation to the other variables in the set.
The correlation allows determining how strongly the pairs of variables are related; that is, the correlation allows analyzing two variables and then extracting the relationship’s strength. The main result of a correlation is the correlation coefficient (r), ranging from -1.0 to +1.0. The closer to +1.0, the closer the two variables are related.
It is up to the data scientist to choose the appropriate tool for each step of the data analysis process. We will continue in Part II with the concepts of Probability and Inferential Statistics, the other two pillars of statistical analysis.
And there we have it. I hope you have found this useful. Thank you for reading. 🐼
As estatísticas podem parecer realmente maçante às vezes, e não é surpresa, já que a aplicação das estatísticas existe há milhares de anos.
Quando se trata de entrevistas de ciência de dados, no entanto, há diversos conceitos que os entrevistadores testam. Aqui foram compilados 10 conceitos estatísticos mais frequentes.
Ao realizarmos um teste de hipóteses, partimos de um dado valor de alfa (nível de significância), pré-fixado, para construir a regra de decisão, comumente estabelecido em 5%. Uma alternativa é deixar a cargo de quem vai utilizar as conclusões do teste a escolha do valor para a probabilidade de alfa, que não precisará ser fixado à priori.
A ideia consiste em calcular, supondo que a hipótese nula seja verdadeira, a probabilidade de se obter estimativas mais desfavoráveis ou extremas (à luz da hipótese alternativa) do que a que está sendo fornecida pela amostra. Esta probabilidade será o nível descritivo, denotado por valor-p.
Valores pequenos do valor-p evidenciam que a hipótese nula atual é falsa pois, sendo a amostra nossa ferramenta de inferência sobre a população, ela estima uma probabilidade muito pequena de acontecer, se H0 fosse verdadeira.
O conceito do que é “pequeno” fia a cargo do usuário, que assim decide qual parâmetro usar para comparar ao valor obtido do valor-P. Portanto, quanto menor o valor-P do teste, mais evidências temos para rejeitar a hipótese nula.
Entretanto, um valor-P muito baixo não constitui prova de que a hipótese nula é falsa, somente que esta hipótese nula é provavelmente falsa.
Para usar o Valor-P na decisão de um teste de hipóteses, basta compararmos o valor-P com:
Rejeitamos H0 por ser um evento raro
Embora este seja um método para decidirmos sobre qual hipótese devemos aceitar, não trabalharemos com ele aqui. Usar o valor-P é muito comum quando estamos lidando com um software que nos fornece este valor de P.
Interpretação Valor-P
Aplicamos um teste de Hipóteses quando queremos testar um determinado valor para um parâmetro populacional Theta, ou seja, uma média, proporção, desvio-padrão com base em uma amostra.
Portanto, pegaremos uma amostra e a partir da população e faremos alguma suposição para verificar se através de um teste estatístico é possível estimar que a população tenha o valor do parâmetro testado pela amostra.
Iremos admitir um valor hipotético para um parâmetro populacional, e com base informações da amostra realizaremos um teste estatístico, para aceitar ou rejeitar o valor hipotético.
Como a decisão para aceitar ou rejeitar a hipótese será tomada de acordo com elementos de uma amostra, fica evidente que a decisão estará sujeita a erros. Com base nos resultados obtidos em uma amostra, não é possível tomar decisões que sejam definitivamente corretas.
Intervalos de confiança e testes de hipóteses compartilham uma relação muito próxima. Um intervalo de confiança pode ser mostrado como “10 +/- 0,5” ou [9.5, 10.5], por exemplo.
O teste de hipóteses é a base de qualquer questão de pesquisa e muitas vezes se resume a tentar provar que algo não aconteceu por acaso.
O Teste Z serve para uma proporção, ou seja, é um teste estatístico para uma proporção populacional p. O teste Z pode ser usado quando uma distribuição binomial é dada como np (sucesso) ≥ 5 e nq (fracasso) ≥ 5. O teste estatístico é:
O Teste Z é um teste de hipótese com Distribuição Normal que usa uma Estatística Z. Um teste z é p aplicado quando conhecemos a variância populacional ou na falta da variância populacional temos um grande tamanho amostral.
Já o Teste T é um teste de hipótese com uma distribuição T que usa uma Estatística-T. Usamos o Teste-T pelo contrário, quando já não conhecemos a variância populacional e temos um pequeno tamanho amostral.
Você pode ver a imagem abaixo como uma referência para guiar qual teste você deve usar:
A Regressão Linear é um dos algoritmos mais fundamentais utilizados para modelar relações entre uma variável dependente e uma ou mais variáveis independentes. Em termos mais simples, envolve encontrar a “linha de melhor ajuste” que representa duas ou mais variáveis.
A linha de melhor ajuste é encontrada minimizando as distâncias quadradas entre os pontos e a linha de melhor ajuste — isso é conhecido como minimizar a soma dos resíduos quadrados. Um residual é simplesmente igual ao valor previsto menos o valor real.
Comparando a linha verde de melhor ajuste à linha vermelha, podemos observar como as linhas verticais (os resíduos) são muito maiores para a linha verde do que a linha vermelha. Isso faz sentido porque a linha verde está tão longe dos pontos que não é uma boa representação dos dados.
Linearidade: A relação entre X e a média de Y é linear.
Homoscedasticidade: A variância do residual é a mesma para qualquer valor de X.
Independência: As observações são independentes umas das outras.
Normalidade: Para qualquer valor fixo de X, Y é normalmente distribuído.
A Regressão Logística é semelhante à regressão linear, mas é usada para modelar a probabilidade de um número discreto de desfechos, tipicamente dois desfechos. Por exemplo, podemos prever se uma pessoa está viva ou morta dada a sua idade.
De relance, a Regressão Logística soa muito mais complicada do que a regressão linear, mas na verdade só há um passo a mais no processo
Primeiro, calculamos uma pontuação usando uma equação semelhante à equação para a linha de melhor ajuste para regressão linear.
O passo extra é alimentar a pontuação calculada na função sigmoide abaixo para retornar uma probabilidade. Essa probabilidade pode então ser convertida em uma saída binária, 1 ou 0.
Para encontrar os pesos da equação inicial para calcular a pontuação, métodos como descida gradiente ou probabilidade máxima são usados.
Existem 5 maneiras principais de amostrar dados: Amostragem aleatória simples, Sistemática, Conveniência, Clusterização e Amostragem Estratificada:
Uma amostra aleatória simples requer o uso de números gerados aleatoriamente para escolher uma amostra. Mais especificamente, ele inicialmente requer um quadro de amostragem, uma lista ou banco de dados de todos os membros de uma população.
Em seguida, geramos aleatoriamente um número para cada elemento, usando o Excel, por exemplo, e selecionamos n amostras necessárias.
A Amostragem Sistemática pode ser ainda mais fácil de aplicar, pegamos um elemento da amostra, pulamos uma quantidade predefinida (n) e, em seguida, tomamos o próximo elemento. Por exemplo, podemos pegar colher uma amostra a cada quatro nomes da lista.
A amostragem por conveniência remove uma amostra de um grupo fácil de contatar, por exemplo, perguntando às pessoas fora de um shopping center, ou seja, amostramos as primeiras pessoas que encontramos. Essa técnica é muitas vezes considerada uma má prática para usar, pois seus dados podem ser interpretados com viés.
A Amostragem de clusters começa dividindo uma população em grupos, ou aglomerados. O que torna a amostragem estratificada é que cada aglomerado deve ser representativo da população. Em seguida, selecionamos aleatoriamente clusters ou grupos completos para amostrar.
A amostragem aleatória estratificada começa dividindo uma população em grupos com atributos semelhantes. Em seguida, uma amostra aleatória é colhida de cada grupo. Este método é usado para garantir que diferentes segmentos em uma população sejam igualmente representados.
Para exemplo, uma pesquisa realizada em uma escola para determinar a satisfação geral. Pode fazer sentido usar amostragem aleatória estratificada para representar igualmente as opiniões dos alunos de cada departamento.
O teorema do limite central estima que a distribuição da amostra seja uma distribuição normal. Para dar um exemplo, pegaríamos uma amostra de um conjunto de dados e calcularíamos a média dessa amostra extraída. Uma vez repetido várias vezes, traçaríamos todos os seus meios e frequências em um gráfico e detectaríamos uma curva de sino, também conhecida como distribuição normal.
A média dessa distribuição se assemelhará muito à dos dados originais. Podemos melhorar a precisão da média e reduzir o desvio padrão, coletando amostras maiores de dados e mais amostras no geral.
Combinações e permutações são duas maneiras ligeiramente diferentes de selecionarmos objetos de um conjunto para formar um subconjunto. As permutações levam em consideração a ordem do subconjunto, enquanto as combinações não.
Combinações e permutações são extremamente importantes se você estiver trabalhando em segurança de rede, análise de padrões, pesquisa de operações e muito mais.
Definição: Uma permutação de n elementos é qualquer arranjo desses n elementos em uma ordem definitiva. Existem n formas fatorais (n!) de organizar n elementos. A ordem importa!
O número de permutações de n coisas tomadas r-em-um-momento é definido como o número de r-tuplas que podem ser retirados de n diferentes elementos e é igual à seguinte equação:
Por exemplo, quantas permutações uma placa tem com 6 dígitos?
Definição: O número de maneiras de escolher r fora de n objetos onde a ordem não importa. O número de combinações de n coisas tomadas r-em-um-momento é definido como o número de subconjuntos com elementos r de um conjunto com elementos n e é igual à seguinte equação:
Por exemplo, de quantas maneiras podemos sacar 6 cartas de um baralho de 52 cartas?
Note que essas são perguntas muito simples e que pode ficar muito mais complicada do que isso, mas você deve ter uma boa ideia de como funciona com os exemplos acima!
O Teorema de Bayes é uma declaração de probabilidade condicional, essencialmente olha para a probabilidade de um evento (B) acontecer dado que outro evento (A) já aconteceu.
Um dos algoritmos de aprendizagem de máquina mais populares, Naïve Bayes, é construído sobre esses dois conceitos. Além disso, se adentrarmos no reino do aprendizado de máquina, provavelmente estará usando métodos bayesianos.
Equação Teorema de Bayes
Equação de Probabilidade Condicional
Uma distribuição de probabilidades é uma maneira fácil de encontrar suas probabilidades de diferentes resultados possíveis em um experimento. Existem muitos tipos de distribuição diferentes que você deve aprender, mas alguns eu recomendaria são Normais, Uniformes e Poisson.
A distribuição normal, também conhecida como distribuição gaussiana, é uma curva em forma de sino que é bastante proeminente em muitas distribuições, incluindo a altura das pessoas e escores de QI.
A média da distribuição normal é igual a μ e a variância é igual a σ.
A distribuição Poisson é uma distribuição discreta que dá a probabilidade do número de eventos independentes ocorrerem em tempo fixo. Um exemplo de quando você usaria isso é se você quiser determinar a probabilidade de pacientes X entrarem em um hospital em uma determinada hora.
A média e a variância são ambos iguais a λ.
Uma distribuição uniforme é usada quando todos os resultados são igualmente prováveis, bem como o lançamento de dados.
We have already briefly addressed Descriptive Statistics, to describe how the data are organized and Probability Theory to measure the variability of casual phenomena according to their occurrence. From now on, we will study the third major area of statistics, Inferential Statistics.
In Inferential Statistics we have a population — a complete set of data, we apply a sampling technique to collect a sample of this population, analyze this sample and make inferences about the population.
The inferential statistic aims at extrapolating the results (obtained with descriptive statistics) to the population, that is, we applied a sampling technique to extract a representative sample, otherwise we will reach inferences that do not represent reality.
From this sample, we calculate the mean, median, fashion and several other statistics and then, applying a series of inferential statistics techniques, we will make inferences about the population.
The concept of sampling is extremely important in Data Science, in practice we work with samples all the time. During the preprocessing phase, what we do is prepare samples and then train the Machine Learning model — using even the same sampling techniques that some companies use in election surveys, but now applied in Data Science.
To start our discussion on inferential statistics, we have to make very clear the difference between population and sample.
Population: it is the set of all elements or results under investigation.
Sample: Is any subset of the population. In the vast majority of cases, we will only work with the samples.
Let’s assume a scenario: We were invited to conduct a survey to measure the durability of the lamps produced by a particular factory. What approach would we use?
Test all the lamps produced by that factory
Obtain a representative sample of the lamp population and infer the durability of all lamps produced.
It’s not hard to come to a conclusion… of course, it is not feasible to make this analysis, daily, in all the lamps manufactured by the company. Therefore, what can be done is to collect a representative sample and, from this sample, make inferences about the population of lamps produced.
When we work with sampling, we have an expected error rate — including, we can calculate this error rate and put confidence intervals. After all, as representative as the sample is, it is not the population itself.
Inferential statistics provide us with tools so that through the analysis of this sample we can make inferences and estimate characteristics about the entire population. The concept of sampling is quite simple: to know if the pie is delicious, just eat a slice (an inference on the whole pie).
While a census (a very expensive budget) involves an examination of all elements of a given group, sampling involves a study of only a representative part of the elements.
The sampling theory studies the relationships between a population and the samples extracted from that population. Sampling is very useful for evaluating unknown population quantities:
Voting intention poll for elections
Audience calculation of television programs
Or to determine whether the differences observed between two samples are due to chance or if they are significant.
We will see what are the techniques and procedures to collect representative samples of a population. We have basically two main sampling methods: Random and Non-Random Methods.
Non-Aletory Methods: intentional sampling, snowball, quotas, and convenience. These types of samplings are little used, ideally using random methods that do not interfere with the study results — especially in Machine Learning.
Random methods: simple, systematic, stratified, clustered, multi-phase, multiphase random sampling
Probabilistic or Random Sampling: in this type of sampling the samples are randomly obtained, that is, any unit of the population has the same probability to be chosen. In this method, all we want is to choose the sample components purely randomly. We have three variants of random sampling.
Simple random sampling: it is the most commonly used method when it is necessary to separate training and test data.
Simple random sampling without replacement: in this type of sampling without data reset, the elements of the population are numbered from 1 to n, that is, any number of population elements. We draw with equal probability one of the n observations of the population without the next draw the observation returns to the “pot”.
Simple random sampling without replacement: we have the opposite, when drawing the first element of the population it is directed to the sample, when we proceed to the second draw the element already drawn back to the “pot” — bootstrapping helps us to choose with or without replacement.
Systematic Sampling: When population elements are ordered and are removed periodically, an example would be a production line that the element is removed in a specific range of items — quality control.
Stratified Sampling: in this type of sampling, a heterogeneous population is stratified, and divided into homogeneous subpopulations — in each of these extracts a sample is taken, that is, we make some previous divisions in the data population and then make the selection to make up the sample.
Conchlorinated sampling: it is a concept almost inverse to stratified sampling, in the stratified we divided into several groups and took units from each group, here we divide into groups and we stay with the groups.
We will now address one of the main tools of Inferential Statistics, hypothesis tests. One of the main problems to be solved by statistical inference is to test hypotheses. A statistical hypothesis is an assumption of veracity or falsehood about a given population parameter, such as mean, standard deviation, correlation coefficient, etc.
For a statistical hypothesis to be validated or rejected, it would be necessary to examine the entire population — which in practice is unfeasible. Alternatively, we extracted a random sample of the population of interest and made the decision-based in this sample — some errors may occur:
reject a hypothesis when it is true
don’t reject a hypothesis when it’s false
That is, what we want is to use a hypothesis test to validate a population parameter through a random sample — from the hypothesis test we make an inference about the population parameter.
Therefore, a statistical hypothesis test is a decision procedure that allows us to decide between H° (null hypothesis) and Hª (alternative hypothesis), based on the information contained in the sample.
The null hypothesis states that a population parameter (such as mean, standard deviation, and so on) is equal to a hypothetical value. The null hypothesis is often an initial claim based on previous analyses.
The alternative hypothesis states that a population parameter is smaller, higher, or different from the hypothetical value in the null hypothesis. The alternative hypothesis is one that is believed to be true or true.
Because we are analyzing sample data and not population data, errors can occur:
Type I error: This is the probability of rejecting the null hypothesis when it is ephemeral true.
Type II error: is the probability of rejecting the alternative hypothesis when it is effectively true.
One of the secrets behind the hypothesis test is the correct definition of what H° and what is Hª — the Data Scientist is responsible for defining the null hypothesis or the alternative hypothesis. We have to understand the definition of the business problem and from this problem identify what is each of the two hypotheses. An incorrect definition can compromise the entire process.
Example: A researcher has some exam results for a sample of students who took a skilled course for a national exam. The researcher wants to know if the formed students scored above the national average of 78.
The researcher wants to do an analysis based on some business needs, check the test results of samples of students who have taken a specific course, and compare to verify whether or not the average of students is according to the national average.
An alternative hypothesis can be used because the researcher is specifically raising the hypothesis that the scores for formed students are higher than the national average.
H°: population mean is equal to 78 — a statement we have today
Hª: population average is greater than 78 — what we want to prove
We want to prove as an alternative hypothesis that students who have training have an average higher than 78.
Here we have basically a sequence of steps used in a hypothesis test, commonly used in Digital Marketing and A/B Testing:
Formulate null and alternative hypotheses — the problem of interpretation;
Collect a sample size n and calculate the sample mean;
Plot the sample mean on the x-axis of the sample distribution;
Set a significance level α based on the severity of error 1;
Calculate statistics, critical values, and critical region;
If the sample mean is in the white area of the chart we do NOT reject the null hypothesis;
If the sample means swerves into one of the tails we REJECT the null hypothesis.
The Unilateral or One-tailed test is used when the alternative hypothesis is expressed as: < or >, that is, we have the null hypothesis and then we define the alternative hypothesis as a sign of greater or with a sign of minor.
We have the null hypothesis that the average is 1.8 — in the box on the left, we have Hª with an average value > 1.8 and in the other box of direct Hª with < 1.8.
Bringing this into a graphical translation, it’s basically like having a chart with a normal distribution where we find the average. If the mean is in the white area — we do not reject H°, if the mean is in the yellow region, we reject H°.
Note that we will be on one side or the other of the tail of the normal distribution, because of this the test is called one-sided or one-tailed test.
If the mean is within the white region of the chart, we do not reject the null hypothesis we already have, otherwise we reject it.
Example: A school has a group of students (population) considered obese. The probability distribution of the weight of students in this school between 12 and 17 years is normal with an average of 80 kgs and standard deviation of 10 kgs. The school principal proposes a treatment campaign to combat obesity. The doctor states that the result of treatment will be presented in and months. And that students will have their weights decreased in this period.
H° μ = 80 — status quo, i.e. the current reality
Hª μ < 80 — proof of different reality
Just as we can apply a one-sided hypothesis test, that is, if the alternative hypothesis is higher or lower than the average, we also have the option of performing a bilateral hypothesis test. The bilateral test is used whenever the alternative hypothesis is expressed ≠.
Now we don’t want to know if it’s bigger or smaller, we want to know if the alternative hypothesis is simply different.
The curve above represents the sampling distribution of the average broadband utilization. It is assumed that the population means is 1.8 GB, according to the null hypothesis H° μ = 1.8. Because there are two yellow regions of rejection in the graph, this is called a bilateral or two-tailed hypothesis test — since the null hypothesis is expressed with ≠.
Example 2: A cookie factory packs boxes weighing 500 grams. Weight is monitored periodically. The quality department has established that the weight should be maintained at 500 grams. What is the condition for the quality department to stop the production of the biscuits?
H° μ = 500g — status quo, i.e. the current reality
Hª μ ≠ 500g — proof of the different reality
The purpose of the Hypothesis Test is to verify the validity of a statement about a population parameter, based on sampling. As we are taking samples as a basis, we are exposed to the risk of wrong conclusions about the population, due to sampling errors.
To test H°, it is necessary to define a decision rule in order to establish a zone of rejection of the hypothesis, that is, to define a significance level, α, the most consensual being alphas 0.10, 0.05, and 0.01:
That is, basically what we are defining is our margin of error. We define this through the level of α.
If the value of the population parameter, defended by the null hypothesis, falls in the rejection zone, this value is very unlikely to be the true value of the population and the null hypothesis will be rejected to the detriment of the alternative hypothesis.
It may happen that although rejected based on data from a sample, the null hypothesis is actually true. In that case, we’d be making a mistake in deciding. This error is called Type 1Error, the probability of which is occurring depends on the alpha you choose.
When the value defended by the Null Hypothesis falls outside the rejection zone, that is, falls in the white region of the distribution, then we consider that there is no evidence to reject the null hypothesis to the detriment of the alternative hypothesis. However, we can also make a mistake if the alternative hypothesis, although discarded, is indeed true. This is Type II Error
Type 1 error is associated with the significance level, we can increase or decrease the significance level and thus have or not the type 1 error within the hypothesis testing process.
Example 3: The effectiveness of a certain vaccine after one year is 25% (i.e., the immune effect extends for more than 1 year in only 25% of the people taking it). A new vaccine develops, more expensive and one wishes to know if this is, in fact, better.
H° p = 0.25- status quo, i.e. the current reality
Hª p > 025 — proof of diferent reality
Type I error: approve the vaccine when, in reality, it has no effect greater than that of the vaccine in use.
Type II error: reject the new vaccine when it is, in fact, better than the vaccine in use
The probability of making a Type I Error depends on the values of the population parameters (α — significance level). We then say that the level of α significance of a test, is the maximum probability with which we want to run the risk of a Type I Error — typically α = 5%. The probability of making a Type II error is called β.
Previously, we saw one of the main areas of statistics, the renowned Descriptive Statistics — concepts for describing and understanding data, a fundamental knowledge during the exploratory analysis. From now on, we will address the main concepts of Probability Theory in a simple but detailed way.
Probability is a number that ranges from 0 to 1 and that measures the chance of a given result.
The closer to 0 the probability, the lower the chances of the result occurring, and the closer to 1 the probability, the greater the chances of occurrence. It is important to point out that we will not have a negative probability or above 1, considering several machine learning algorithms are based on probability theory.
In fact, there is a category of machine learning algorithms — probabilistic algorithms, one of which is the famous Naive Bayes that we will see later.
Probabilities can be expressed in a variety of ways, including decimals, fractions, and percentages. For example, the chance of occurrence of a given event can be expressed as 10%, 10 in 100, 0.10, or 1/10 — different ways to represent probability.
Probability theory consists of using human intuition to study the phenomena of our daily lives. For this, we will use the basic principle of human learning, which is the idea of experiment.
We apply probability theory to measure or find a number that helps us predict something; we want to measure uncertainty.
An experiment is any activity performed that can present different results. An experiment is said to be random when we cannot affirm the result obtained before performing the experiments. On the other hand, an experiment is equiprobable if all possible outcomes have the same chance of occurring.
We can classify the experiments into two types:
Random (casual): is the most common type of experiment, which we do not know the result a priori; that is, we are not sure of the result. Therefore, we will measure the uncertainty of the final result.
Non-random (deterministic): deterministic experiments are fully characterized a priori; that is, they are phenomena in which the result is known even before it occurs and, in this way, we have nothing to do. So, we already have an idea of the outcome.
The concept of random experiments can cause some confusion, although relatively simple to learn. A random experiment is a phenomenon that has unpredictable results when repeated numerous times in similar processes.
For example, do we know that the coin has only two faces — before we flip the coin (experiment), do we know which face will fall upwards? NO, we’re not sure what the outcome will be. Still, we can apply probability theory to measure uncertainty and get a notion of the probability of each result of each experiment (flip the coin).
Experiments are actions, activities, acts, executions, operations… Throw a dice or toss a coin are considered examples of random experiments. In the case of the dice, we can have 6 different results {1, 2, 3, 4, 5, 6} and in the coin toss, 2 {heads, tails}.
An event is one or more results of an experiment, an activity that presents a possible result (event) for an experiment (action). Event is any subset of sample space (S).
Possible results are a subset of the sample space (S). Therefore, consider a 6-sided dice — the sample space has 1 to 6 sets of all possible results of the experiment or random phenomenon.
For example, we are from a quality area, and we must measure the probability that a product is defective, in which case our sample space could have these absolute definitions: S = {defective, not defective}.
An important concept associated with events and sample space is the complementary event. Of course, we know that an event may or may not occur. However, being p the probability that it occurs (success) and q the probability that does not occur (failure), for the same event, there is always the relationship:
We say that two events are independent when one of the events does not affect the probability of the realization of the other and vice versa.
Thus, since p1 is the probability of the 1st event and p2 the probability of the 2nd event, the probability that such events are simultaneous is given by:
Classic Probability is used when we know the number of possible outcomes of the event of interest, and we can calculate the probability of the event. That is, we already have data and possible results at our disposal.
Empirical Probability: Involves conducting an experiment to collect the data and observing how often an event occurs. We do not know the possible results, and we will conduct experiments to observe the frequency of events.
Subjective Probability: When data or experiments are not available to calculate the probability. We don’t have anything, which means it’s subjective.
We still have 6 basic rules that we cannot violate by applying probability theory when we talk about probability.
If P(A) = 1, then we can guarantee that event A will occur
If P(A) = 0, then we can ensure that event A WILL NOT occur
The probability of any event WILL ALWAYS be between 0 and 1.0
The probability of any event will NEVER be negative > 1.0
The sum of all probabilities in an almost space will be = 1.0
The Complement P(A’) of the event (A) is defined as all results in a sample space, NOT part of event A.
Let’s say there’s a department store that wants to reward its customers. Based on this, the store will survey its customers as they enter the store, asking which of these 3 prizes they would most like to win: guitar, camera, or bicycle. The results were:
Looking at the lines, we view the total number of respondents to men and women with their respective interests.
If these clients randomly choose the winner, the probability of selecting a woman is just the corresponding relative frequency (since we have the same probability of selecting any of the 478 customers). Thus, there are 251 women in the data out of a total of 478, resulting in a probability of:
It is called marginal probability because it depends only on the totals found in the table margins. The same method works for more complicated events.
For example, what is the probability of choosing a woman whose preferred prize is the camera? As 91 women named the camera as their preference, then the probability is:
Probabilities like these are called joint probabilities because they give the probability of two events occurring together — female and choosing camera as the desired prize.
Here we’ll look at the concept behind one of the main probabilistic algorithms — Naive Bayes for sorting tasks; understanding probabilistic concepts will help you understand one of the most popular Conditional Probability algorithms.
Let’s use as a base the same frequency table we saw earlier:
Our sample space is 478 clients (sample size), and we can recognize relative frequencies as probabilities as well — by sorting a subset from that set, we can only do this with 478 people.
If we have the information that the customer chosen is a woman, would that change the likelihood that the customer’s prize is a bicycle?
Initially, all of our sample space is 478 clients — when we calculate all the probabilities, we calculate about that total — However, do we have a condition that the randomly chosen person is already a woman, would that change the likelihood that the prize is a bicycle? YES, now we no longer look at 478; we restrict only 251 as a base — at that moment, we no longer care about looking at male people.
We write the probability of a selected customer wanting a bike since we select a woman as:
We no longer base ourselves on 478 because we start from a precondition that the drawn is a woman — because of that, we will calculate a conditional probability.
Generally speaking, this is the essence of the operation of the Naive Bayes algorithm — a fundamental concept.
Just as we put as a condition the client is female; we can also do this for male clients. Therefore, of the 227 men, 60 said their preferred prize was a bicycle:
In general, when we want the probability of an event from a conditional distribution, we write P(B| A) and pronounce it “the probability of B given A.”
A probability that takes into account a given condition like this is called conditional probability. So, we work with counts, but we could also work with the odds. So, for example, 30 women wanted a bicycle as a prize, and there were 251 female customers. So, we found the probability of being 30/251.
To find the probability of event B given the event A, we restricted the results in A. Formally, we write:
The formula for conditional probability requires a constraint. The formula works only when the event that is given has a probability greater than 0.
The formula does not work if P(A) is 0 because it means that we had already “given” the fact that A is true even if A's probability was 0, which would be a contradiction.
The equation in the conditional probability definition contains the probability of A and B.
The reorganization of the equation provides the general rule of multiplication for compound events that do not require that the events are independent:
The probability that two events, A and B, occur is that event A will occur multiplied by the probability that event B also occurs — that is, by the probability that both occur simultaneously. Thus, we are placing conditions for the occurrences of events; we are adding layers of complexity to measure the uncertainty of more complex phenomena.
P(A or B) = P(A) + P(B) — P(A and B): here, we look for the probability of at least one event occurring, that is, calculating the probability of event A occurring + calculating the probability of event B occurring and subtracting the probability of event A and B occurring together, and thus we can calculate the probability of event A or event B.
P(A or B) = P(A) + P(B): the probability of A or probability of B is simply the probability of A + the probability of B. Therefore, it is first necessary to identify whether the event is mutually exclusive and then calculate the correct probability.
P(A and B) = P(A) * P(B| A) = P(A| B) * P(B): here, we calculate the probability of two events being simultaneous, that is, the probability of A multiplied by the probability of B given A (event B occurred because event A has already occurred). The reverse is true — Probability of A given B (event A occurred because event B has already occurred), multiplied by the probability of B.
P(A and B) = P(A) * P(B): if the events are independent, the probability of A and B occur, is the probability of A is equal to multiplied by the probability of B — common probability.
Both concepts seem to have similar ideas of separation and distinction, but in fact, mutually exclusive events cannot be independent.
Consider two mutually exclusive events {we received an A grade in math discipline} and {we received a B in math discipline}. They are mutually exclusive because they have no results in common.
Let’s say we know the grade was A in math. Now, what’s the probability of having a B? We can’t get both notes, so the probability is 0. Knowing that the first event (receiving A) has already occurred, we have changed our probability of the second event occurring (zero). So these events are not independent; they are mutually exclusive — the occurrence of one excludes the other.
One of the most used tools in descriptive statistics, which is also very useful in Probability Theory, is the contingency table — showing the relationship between two variables and collecting a series of insights and probabilities.
Contingency tables organize the information corresponding to the data using 2 criteria: in this case, we have the variable of animals represented in columns, the sex variables represented by rows, and the margins with the respective totals.
Contingency tables allow us to represent the data, whether qualitative or quantitative — it is a handy tool for data analysis, allowing us to collect important statistics about our data distribution.
Example: A real estate survey in rural towns ranked homes in two price categories {low < $150k and high > $150k}. The survey also looked at whether the houses had at least two bathrooms or not {true, false}.
About 56% of the houses had at least two bathrooms, 62% had a low price, and 22% had both. That’s enough information to fill in the table.
Now, finding another probability is simple. What is the probability that a high-priced house will have at least two bathrooms? Again, it’s a conditional probability:
P (at least two bathrooms | high price) = P (at least two bathrooms and high AND high price) / P(high price) = 0.34 / 0.38 = 89.5%
A probability distribution describes the chance that a variable (discrete or continuous) can assume over a space of values.
As we saw up there, the set of all possible results of a random experiment is called sample space. The elements of this set can be numeric or not — the result of a probability experiment is usually a count or a measure. When this occurs, the result is called a random variable, which can be of the discrete or continuous type.
1.Discrete random variable: assumes values in an enumerable set, not being able to assume decimal or non-integer values — number of children, employees, number of cars produced, etc.
2.Continuous random variable: can assume various values in the range of real numbers — household income, company billing, weight, length, height, etc.
Let’s describe the three main distributions of discrete random variables: binomial, Poisson, and hypergeometric. For continuous random variables, the most used are: normal, uniform, exponential, and t-Student.
In Machine Learning, we need to identify whether or not a particular variable follows a normal distribution — many machine learning algorithms expect to receive data (a given numeric variable) that is in a normal distribution to train the model.
The binomial distribution is used to describe scenarios in which the results of a random variable can be grouped into two categories. These categories must be mutually exclusive, so there is no doubt in the classification of the result of the variable.
This distribution is widely used in quality control when sampling an infinite or huge population.
The result can only assume two values: {pregnant, not pregnant}, for example — in general, the two categories of a binomial distribution are commonly classified as Success and Failure. Therefore, the probability of success we call p and the probability of failure we call q.
The Average of a Binomial Distribution represents the long-term average of expected successes, based on the number of observations.
The Variance of a Binomial Distribution represents the variation in the number of successes (p) over a number (n) of observations.
Poisson Distribution is used to describe scenarios where the event is likely to occur in a continuous interval; this continuous interval is usually given by time or area. For example, several customers served per hour.
This distribution is characterized by the single parameter called λ (lambda), representing the average rate of occurrence per measure.
One of the key points of the Binomial Distribution and Poisson Distribution is that the events are independent. Each sample of each experiment is a new set of data. Thus, the probability of success or number of occurrences remains constant.
Hypergeometric Distribution is a discrete probability distribution that describes the number of successes in a sequence of n extractions of a finite population, that is, without replacement. When sampling is unreset, the probability of success changes during the sampling process, this violates the requirements for a Binomial Probability Distribution — and the Hypergeometric Distribution is required.
Regardless of the probability distribution we are working on; we will never have a fully accurate probability value. Therefore, it will also be necessary to calculate the mean, standard deviation, and eventually, the variance to present how well distributed are the probabilities of occurrence within a distribution.
The Normal Distribution, also known as Gaussian, is a more used and most important probability distribution because it allows modeling many natural phenomena, studies of human behavior, industrial processes, etc.
The Continuous Distribution considers continuous random variables, that is, values of the set of real numbers. It is used in numerous analyses, including pre-processing data before machine learning training and the calculation of various inferential statistics.
It is useful when data tends to be close to the center of the distribution (close to the average) and when extreme values (outliers) are rare.
The Uniform Distribution is used to describe data when all values have the same chance of occurring, that is, the same probability.
Another way to say “Uniform Distribution” would be “a finite number of results with equal chances of happening.” This distribution is used when we assume equal intervals of the random variable that has the same probability.
The Exponential Distribution is used to describe the data when lower values tend to dominate the distribution and when very high values do not occur frequently — much like the Discrete Poisson Distribution, where the aleatory variable is defined as the number of occurrences in a period of determinate, with the average of occurrences λ (lambda).
While in Exponential Distribution, the random variable is defined as the time between two occurrences, the average time being between occurrences 1/ λ (lambda) — in practice, we first have to identify whether the data set follows a Poisson Distribution pattern. This type of distribution is widely used in quality control as a model for the distribution of times to the failure of electronic components.
The t-Student Distribution is one of the main probability distributions, with numerous applications in statistical inference.
We can see in the chart that the distribution is symmetrical around the mean and in the bell format — much like the standard continuous normal distribution. However, the t-Student distribution has longer and wider tails, thus generating more extreme values than a normal distribution.
The parameter with several degrees of freedom (k) characterizes the shape of the t-student distribution. The higher the parameter (k) value, the more the t-student distribution approaches a standardized normal distribution.
A Statistical Distribution is a function that defines a curve, and the area under that curve determines the probability of the event correlated by it.
Normal models are defined by two parameters: an average and a standard deviation. By convention, we indicate the parameters with Greek letters. For example, we represent the mean of such a model with the Greek letter μ — mean(m) and the standard deviation(s) — σ
The Continuous Normal Distribution is the most important among statistical distributions. Also known as Gaussian distribution, it is a symmetrical curve around its midpoint, thus presenting the bell shape. This normal curve represents the behavior of various processes in companies and many natural phenomena, such as height, weight, blood pressure of a group, and the time students perform a test.
There is a different Normal Model for each combination of m and s. However, we commonly need only the model with the mean 0 standard deviations 1. We call this standard normal distribution (Standard Normal Distribution — this summarizes much of what is done in data preprocessing before training machine learning algorithms.
We should not use a Normal Model for all datasets. If the histogram is not bell-shaped, the scores z are not well modeled by the Normal Model. Standardization will not help because standardization does not change the shape of the distribution. Therefore, always check the histogram of the data before using the Normal model.
Standardizing the data does not mean that we transform the distribution into normal; we are only standardizing in a normal distribution, that is, the data already had a Normal Model — a normal model can have any mean and any standard deviation. Still, ideally, we have the data in the normal distribution with mean 0 and standard deviation 1.
The Normal Distribution is so important, which is associated with one of the fundamental theorems of all statistics. The Central Limit Theorem — states that when the sample increases, the sample distribution of its mean tends to approach more and more a Normal Distribution.
The Central Limit Theorem is fundamental for statistics since several common statistical procedures require that the data be approximately normal. The Central Limit Theorem allows applying these useful procedures to strongly non-normal populations.
This theorem makes it possible to measure how much your sample mean will vary without taking another sample mean to make the comparison. The theorem basically says that the sample mean (sample mean) has a Normal Distribution, regardless of the appearance of the distribution of the original data.
In the Central Limit Theorem, we do not refer to the distribution of the data; we talk about the Distribution of Sample Means that, according to the Central Limit Theorem, as the sample size increases, the distribution of the sample means tends to a Normal Distribution and from this, we can make a series of inferences.
Many procedures assume that a Normal Distribution is a Symmetric Distribution — Asymmetry indicates variation in the distribution format. For example, in normal, symmetric data distribution, we expect 68, 95%, and 99.7% of the values in, respectively, 1, 2, and 3 standards deviations above and below the mean.
Therefore, we can find the values we need within these ranges. First, we need to find the standard deviation; that is, we can make a series of inferences with normal distribution! In a symmetric curve of the data, virtually all data will be in up to 3 standards deviations from the data center (mean) — this way, it is much easier to detect outliers, values in addition to 3 standards deviations of the mean.
The area below the normal curve represents a 100% probability associated with the occurrence of a variable; that is, the probability of a random variable taking a value between any two points is equal to the area between these two points.
However, as there is a multitude of normal distributions, one for each mean and standard deviation, we transform the unit we are studying into a unit Z, which indicates the number of standard deviations from the mean, and from the Z score we put the variable into a standard normal distribution of mean 0 and standard deviation 1.
Previously, we saw one of the main areas of statistics, the renowned Descriptive Statistics — concepts for describing and understanding data, a fundamental knowledge during the exploratory analysis. From now on, we will address the main concepts of Probability Theory in a simple but detailed way.
Probability is a number that ranges from 0 to 1 and that measures the chance of a given result.
The closer to 0 the probability, the lower the chances of the result occurring, and the closer to 1 the probability, the greater the chances of occurrence. It is important to point out that we will not have a negative probability or above 1, considering several machine learning algorithms are based on probability theory.
In fact, there is a category of machine learning algorithms — probabilistic algorithms, one of which is the famous Naive Bayes that we will see later.
Probabilities can be expressed in a variety of ways, including decimals, fractions, and percentages. For example, the chance of occurrence of a given event can be expressed as 10%, 10 in 100, 0.10, or 1/10 — different ways to represent probability.
Probability theory consists of using human intuition to study the phenomena of our daily lives. For this, we will use the basic principle of human learning, which is the idea of experiment.
We apply probability theory to measure or find a number that helps us predict something; we want to measure uncertainty.
An experiment is any activity performed that can present different results. An experiment is said to be random when we cannot affirm the result obtained before performing the experiments. On the other hand, an experiment is equiprobable if all possible outcomes have the same chance of occurring.
We can classify the experiments into two types:
Random (casual): is the most common type of experiment, which we do not know the result a priori; that is, we are not sure of the result. Therefore, we will measure the uncertainty of the final result.
Non-random (deterministic): deterministic experiments are fully characterized a priori; that is, they are phenomena in which the result is known even before it occurs and, in this way, we have nothing to do. So, we already have an idea of the outcome.
The concept of random experiments can cause some confusion, although relatively simple to learn. A random experiment is a phenomenon that has unpredictable results when repeated numerous times in similar processes.
For example, do we know that the coin has only two faces — before we flip the coin (experiment), do we know which face will fall upwards? NO, we’re not sure what the outcome will be. Still, we can apply probability theory to measure uncertainty and get a notion of the probability of each result of each experiment (flip the coin).
Experiments are actions, activities, acts, executions, operations… Throw a dice or toss a coin are considered examples of random experiments. In the case of the dice, we can have 6 different results {1, 2, 3, 4, 5, 6} and in the coin toss, 2 {heads, tails}.
An event is one or more results of an experiment, an activity that presents a possible result (event) for an experiment (action). Event is any subset of sample space (S).
Possible results are a subset of the sample space (S). Therefore, consider a 6-sided dice — the sample space has 1 to 6 sets of all possible results of the experiment or random phenomenon.
For example, we are from a quality area, and we must measure the probability that a product is defective, in which case our sample space could have these absolute definitions: S = {defective, not defective}.
An important concept associated with events and sample space is the complementary event. Of course, we know that an event may or may not occur. However, being p the probability that it occurs (success) and q the probability that does not occur (failure), for the same event, there is always the relationship:
We say that two events are independent when one of the events does not affect the probability of the realization of the other and vice versa.
Thus, since p1 is the probability of the 1st event and p2 the probability of the 2nd event, the probability that such events are simultaneous is given by:
Classic Probability is used when we know the number of possible outcomes of the event of interest, and we can calculate the probability of the event. That is, we already have data and possible results at our disposal.
Empirical Probability: Involves conducting an experiment to collect the data and observing how often an event occurs. We do not know the possible results, and we will conduct experiments to observe the frequency of events.
Subjective Probability: When data or experiments are not available to calculate the probability. We don’t have anything, which means it’s subjective.
We still have 6 basic rules that we cannot violate by applying probability theory when we talk about probability.
If P(A) = 1, then we can guarantee that event A will occur
If P(A) = 0, then we can ensure that event A WILL NOT occur
The probability of any event WILL ALWAYS be between 0 and 1.0
The probability of any event will NEVER be negative > 1.0
The sum of all probabilities in an almost space will be = 1.0
The Complement P(A’) of the event (A) is defined as all results in a sample space, NOT part of event A.
Let’s say there’s a department store that wants to reward its customers. Based on this, the store will survey its customers as they enter the store, asking which of these 3 prizes they would most like to win: guitar, camera, or bicycle. The results were:
Looking at the lines, we view the total number of respondents to men and women with their respective interests.
If these clients randomly choose the winner, the probability of selecting a woman is just the corresponding relative frequency (since we have the same probability of selecting any of the 478 customers). Thus, there are 251 women in the data out of a total of 478, resulting in a probability of:
It is called marginal probability because it depends only on the totals found in the table margins. The same method works for more complicated events.
For example, what is the probability of choosing a woman whose preferred prize is the camera? As 91 women named the camera as their preference, then the probability is:
Probabilities like these are called joint probabilities because they give the probability of two events occurring together — female and choosing camera as the desired prize.
Here we’ll look at the concept behind one of the main probabilistic algorithms — Naive Bayes for sorting tasks; understanding probabilistic concepts will help you understand one of the most popular Conditional Probability algorithms.
Let’s use as a base the same frequency table we saw earlier:
Our sample space is 478 clients (sample size), and we can recognize relative frequencies as probabilities as well — by sorting a subset from that set, we can only do this with 478 people.
If we have the information that the customer chosen is a woman, would that change the likelihood that the customer’s prize is a bicycle?
Initially, all of our sample space is 478 clients — when we calculate all the probabilities, we calculate about that total — However, do we have a condition that the randomly chosen person is already a woman, would that change the likelihood that the prize is a bicycle? YES, now we no longer look at 478; we restrict only 251 as a base — at that moment, we no longer care about looking at male people.
We write the probability of a selected customer wanting a bike since we select a woman as:
We no longer base ourselves on 478 because we start from a precondition that the drawn is a woman — because of that, we will calculate a conditional probability.
Generally speaking, this is the essence of the operation of the Naive Bayes algorithm — a fundamental concept.
Just as we put as a condition the client is female; we can also do this for male clients. Therefore, of the 227 men, 60 said their preferred prize was a bicycle:
In general, when we want the probability of an event from a conditional distribution, we write P(B| A) and pronounce it “the probability of B given A.”
A probability that takes into account a given condition like this is called conditional probability. So, we work with counts, but we could also work with the odds. So, for example, 30 women wanted a bicycle as a prize, and there were 251 female customers. So, we found the probability of being 30/251.
To find the probability of event B given the event A, we restricted the results in A. Formally, we write:
The formula for conditional probability requires a constraint. The formula works only when the event that is given has a probability greater than 0.
The formula does not work if P(A) is 0 because it means that we had already “given” the fact that A is true even if A's probability was 0, which would be a contradiction.
The equation in the conditional probability definition contains the probability of A and B.
The reorganization of the equation provides the general rule of multiplication for compound events that do not require that the events are independent:
The probability that two events, A and B, occur is that event A will occur multiplied by the probability that event B also occurs — that is, by the probability that both occur simultaneously. Thus, we are placing conditions for the occurrences of events; we are adding layers of complexity to measure the uncertainty of more complex phenomena.
P(A or B) = P(A) + P(B) — P(A and B): here, we look for the probability of at least one event occurring, that is, calculating the probability of event A occurring + calculating the probability of event B occurring and subtracting the probability of event A and B occurring together, and thus we can calculate the probability of event A or event B.
P(A or B) = P(A) + P(B): the probability of A or probability of B is simply the probability of A + the probability of B. Therefore, it is first necessary to identify whether the event is mutually exclusive and then calculate the correct probability.
P(A and B) = P(A) * P(B| A) = P(A| B) * P(B): here, we calculate the probability of two events being simultaneous, that is, the probability of A multiplied by the probability of B given A (event B occurred because event A has already occurred). The reverse is true — Probability of A given B (event A occurred because event B has already occurred), multiplied by the probability of B.
P(A and B) = P(A) * P(B): if the events are independent, the probability of A and B occur, is the probability of A is equal to multiplied by the probability of B — common probability.
Both concepts seem to have similar ideas of separation and distinction, but in fact, mutually exclusive events cannot be independent.
Consider two mutually exclusive events {we received an A grade in math discipline} and {we received a B in math discipline}. They are mutually exclusive because they have no results in common.
Let’s say we know the grade was A in math. Now, what’s the probability of having a B? We can’t get both notes, so the probability is 0. Knowing that the first event (receiving A) has already occurred, we have changed our probability of the second event occurring (zero). So these events are not independent; they are mutually exclusive — the occurrence of one excludes the other.
One of the most used tools in descriptive statistics, which is also very useful in Probability Theory, is the contingency table — showing the relationship between two variables and collecting a series of insights and probabilities.
Contingency tables organize the information corresponding to the data using 2 criteria: in this case, we have the variable of animals represented in columns, the sex variables represented by rows, and the margins with the respective totals.
Contingency tables allow us to represent the data, whether qualitative or quantitative — it is a handy tool for data analysis, allowing us to collect important statistics about our data distribution.
Example: A real estate survey in rural towns ranked homes in two price categories {low < $150k and high > $150k}. The survey also looked at whether the houses had at least two bathrooms or not {true, false}.
About 56% of the houses had at least two bathrooms, 62% had a low price, and 22% had both. That’s enough information to fill in the table.
Now, finding another probability is simple. What is the probability that a high-priced house will have at least two bathrooms? Again, it’s a conditional probability:
P (at least two bathrooms | high price) = P (at least two bathrooms and high AND high price) / P(high price) = 0.34 / 0.38 = 89.5%
A probability distribution describes the chance that a variable (discrete or continuous) can assume over a space of values.
As we saw up there, the set of all possible results of a random experiment is called sample space. The elements of this set can be numeric or not — the result of a probability experiment is usually a count or a measure. When this occurs, the result is called a random variable, which can be of the discrete or continuous type.
1.Discrete random variable: assumes values in an enumerable set, not being able to assume decimal or non-integer values — number of children, employees, number of cars produced, etc.
2.Continuous random variable: can assume various values in the range of real numbers — household income, company billing, weight, length, height, etc.
Let’s describe the three main distributions of discrete random variables: binomial, Poisson, and hypergeometric. For continuous random variables, the most used are: normal, uniform, exponential, and t-Student.
In Machine Learning, we need to identify whether or not a particular variable follows a normal distribution — many machine learning algorithms expect to receive data (a given numeric variable) that is in a normal distribution to train the model.
The binomial distribution is used to describe scenarios in which the results of a random variable can be grouped into two categories. These categories must be mutually exclusive, so there is no doubt in the classification of the result of the variable.
This distribution is widely used in quality control when sampling an infinite or huge population.
The result can only assume two values: {pregnant, not pregnant}, for example — in general, the two categories of a binomial distribution are commonly classified as Success and Failure. Therefore, the probability of success we call p and the probability of failure we call q.
The Average of a Binomial Distribution represents the long-term average of expected successes, based on the number of observations.
The Variance of a Binomial Distribution represents the variation in the number of successes (p) over a number (n) of observations.
Poisson Distribution is used to describe scenarios where the event is likely to occur in a continuous interval; this continuous interval is usually given by time or area. For example, several customers served per hour.
This distribution is characterized by the single parameter called λ (lambda), representing the average rate of occurrence per measure.
One of the key points of the Binomial Distribution and Poisson Distribution is that the events are independent. Each sample of each experiment is a new set of data. Thus, the probability of success or number of occurrences remains constant.
Hypergeometric Distribution is a discrete probability distribution that describes the number of successes in a sequence of n extractions of a finite population, that is, without replacement. When sampling is unreset, the probability of success changes during the sampling process, this violates the requirements for a Binomial Probability Distribution — and the Hypergeometric Distribution is required.
Regardless of the probability distribution we are working on; we will never have a fully accurate probability value. Therefore, it will also be necessary to calculate the mean, standard deviation, and eventually, the variance to present how well distributed are the probabilities of occurrence within a distribution.
The Normal Distribution, also known as Gaussian, is a more used and most important probability distribution because it allows modeling many natural phenomena, studies of human behavior, industrial processes, etc.
The Continuous Distribution considers continuous random variables, that is, values of the set of real numbers. It is used in numerous analyses, including pre-processing data before machine learning training and the calculation of various inferential statistics.
It is useful when data tends to be close to the center of the distribution (close to the average) and when extreme values (outliers) are rare.
The Uniform Distribution is used to describe data when all values have the same chance of occurring, that is, the same probability.
Another way to say “Uniform Distribution” would be “a finite number of results with equal chances of happening.” This distribution is used when we assume equal intervals of the random variable that has the same probability.
The Exponential Distribution is used to describe the data when lower values tend to dominate the distribution and when very high values do not occur frequently — much like the Discrete Poisson Distribution, where the aleatory variable is defined as the number of occurrences in a period of determinate, with the average of occurrences λ (lambda).
While in Exponential Distribution, the random variable is defined as the time between two occurrences, the average time being between occurrences 1/ λ (lambda) — in practice, we first have to identify whether the data set follows a Poisson Distribution pattern. This type of distribution is widely used in quality control as a model for the distribution of times to the failure of electronic components.
The t-Student Distribution is one of the main probability distributions, with numerous applications in statistical inference.
We can see in the chart that the distribution is symmetrical around the mean and in the bell format — much like the standard continuous normal distribution. However, the t-Student distribution has longer and wider tails, thus generating more extreme values than a normal distribution.
The parameter with several degrees of freedom (k) characterizes the shape of the t-student distribution. The higher the parameter (k) value, the more the t-student distribution approaches a standardized normal distribution.
A Statistical Distribution is a function that defines a curve, and the area under that curve determines the probability of the event correlated by it.
Normal models are defined by two parameters: an average and a standard deviation. By convention, we indicate the parameters with Greek letters. For example, we represent the mean of such a model with the Greek letter μ — mean(m) and the standard deviation(s) — σ
The Continuous Normal Distribution is the most important among statistical distributions. Also known as Gaussian distribution, it is a symmetrical curve around its midpoint, thus presenting the bell shape. This normal curve represents the behavior of various processes in companies and many natural phenomena, such as height, weight, blood pressure of a group, and the time students perform a test.
There is a different Normal Model for each combination of m and s. However, we commonly need only the model with the mean 0 standard deviations 1. We call this standard normal distribution (Standard Normal Distribution — this summarizes much of what is done in data preprocessing before training machine learning algorithms.
We should not use a Normal Model for all datasets. If the histogram is not bell-shaped, the scores z are not well modeled by the Normal Model. Standardization will not help because standardization does not change the shape of the distribution. Therefore, always check the histogram of the data before using the Normal model.
Standardizing the data does not mean that we transform the distribution into normal; we are only standardizing in a normal distribution, that is, the data already had a Normal Model — a normal model can have any mean and any standard deviation. Still, ideally, we have the data in the normal distribution with mean 0 and standard deviation 1.
The Normal Distribution is so important, which is associated with one of the fundamental theorems of all statistics. The Central Limit Theorem — states that when the sample increases, the sample distribution of its mean tends to approach more and more a Normal Distribution.
The Central Limit Theorem is fundamental for statistics since several common statistical procedures require that the data be approximately normal. The Central Limit Theorem allows applying these useful procedures to strongly non-normal populations.
This theorem makes it possible to measure how much your sample mean will vary without taking another sample mean to make the comparison. The theorem basically says that the sample mean (sample mean) has a Normal Distribution, regardless of the appearance of the distribution of the original data.
In the Central Limit Theorem, we do not refer to the distribution of the data; we talk about the Distribution of Sample Means that, according to the Central Limit Theorem, as the sample size increases, the distribution of the sample means tends to a Normal Distribution and from this, we can make a series of inferences.
Many procedures assume that a Normal Distribution is a Symmetric Distribution — Asymmetry indicates variation in the distribution format. For example, in normal, symmetric data distribution, we expect 68, 95%, and 99.7% of the values in, respectively, 1, 2, and 3 standards deviations above and below the mean.
Therefore, we can find the values we need within these ranges. First, we need to find the standard deviation; that is, we can make a series of inferences with normal distribution! In a symmetric curve of the data, virtually all data will be in up to 3 standards deviations from the data center (mean) — this way, it is much easier to detect outliers, values in addition to 3 standards deviations of the mean.
The area below the normal curve represents a 100% probability associated with the occurrence of a variable; that is, the probability of a random variable taking a value between any two points is equal to the area between these two points.
However, as there is a multitude of normal distributions, one for each mean and standard deviation, we transform the unit we are studying into a unit Z, which indicates the number of standard deviations from the mean, and from the Z score we put the variable into a standard normal distribution of mean 0 and standard deviation 1.