Random variables can be either discrete or continuous. Discrete variables and their distributions were explained in previous CLO. Recall that a discrete variable cannot assume all values between any two given values of the variables. On the other hand, a continuous variable can assume all values between any two given values of the variables. With this, a normal distribution can be used to describe a variety of variables.
Statistics can be inferred via estimation, which is the process of estimating the value of a parameter from information obtained from a sample. Under this CLO, topics such as normal distribution, confidence intervals and sample size, hypothesis testing, and correlation between two variables will discused.
A normal distribution can be used to describe a variety of variables, such as heights, weights, and temperatures. A normal distribution is bell-shaped, unimodal, symmetric, and continuous; its mean, median, and mode are equal. Since each variable has its own distribution with mean m and standard deviation s, mathematicians use the standard normal distribution, which has a mean of 0 and a standard deviation of 1. Other approximately normally distributed variables can be transformed to the standard normal distribution with the formula z =(X -μ)/σ.
Gaussian Function or Bell Curve is the graph of the associated probability density function is bell shape which has the peak at the mean. The curve was developed in 1733 by Abraham de Moivre, which was used as an approximation to the binomial distribution.
When the data values are evenly distributed about the mean, a distribution is said to be a symmetric distribution.
When the majority of the data values fall to the right of the mean, the distribution is said to be a negatively or left-skewed distribution. The mean is to the left of the median, and the mean and the median are to the left of the mode.
When the majority of the data values fall to the left of the mean, a distribution is said to be a positively or right-skewed distribution. The mean falls to the right of the median, and both the mean and the median fall to the right of the mode.
The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. The z value is actually the number of standard deviations that a particular X value is away from the mean. All normally distributed variables can be transformed into the standard normally distributed variable by using the formula for the standard score as shown in the left figure.
Ways to determine the normality is by drawing a histogram and check it shape.
Skewness can be tested by applying Pearson’s Index (PI) of Skewness. If + 1 ≤ PI ≤ −1, data are significantly skewed PI will fall outside the range. The formula is shown in the right figure.
Take note that data should also be checked for outliers
As the sample size n increases without limit, the shape of the distribution of the sample means taken with replacement from a population with mean μ and standard deviation σ will approach a normal distribution. As previously shown, this distribution will have a mean μ and a standard deviation σ/√n.
It’s important to remember two things when you use the central limit theorem:
When the original variable is normally distributed, the distribution of the sample means will be normally distributed, for any sample size n.
When the distribution of the original variable might not be normal, a sample size of 30 or more is needed to use a normal distribution to approximate the distribution of the sample means. The larger the sample, the better the approximation will be.
A normal distribution is often used to solve problems that involve the binomial distribution since when n is large (say, 100), the calculations are too difficult to do by hand using the binomial distribution.
Also, recall that a binomial distribution is determined by n (the number of trials) and p (the probability of a success). When p is approximately 0.5, and as n increases, the shape of the binomial distribution becomes similar to that of a normal distribution. The larger n is and the closer p is to 0.5, the more similar the shape of the binomial distribution is to that of a normal distribution.
A correction for continuity is a correction employed when a continuous distribution is used to approximate a discrete distribution
Estimation is one aspect of inferential statistics; it is the process of estimating the value of a parameter from information drawn from a sample. Estimation is used to determine the approximate value of a population parameter on the basis of a sample statistic. Sample statistic refers as the estimator of the population parameter. The computed sample statistic is called the estimate.
An estimate may be a point estimate or an interval estimate. A point estimate is the value of a sample statistic that is used to estimate a population parameter. Point estimation used to calculate the margin of error associated with that point. An Interval Estimate of a parameter is an interval or a range of value used to estimate the parameter. This estimate may or may not contain the value of the parameter being estimated.
Each interval is constructed with regard to a given confident level called confidence Interval. Confidence level is associated with a confidence interval states how much level we have that is interval contains the true population parameter.
Formula for the Confidence Interval of the Mean for a Specific α is shown in the left figure.
Sample size determination is related to estimation. It requires maximum error of estimates, population standard deviation, and degree of confidence.
The formula for sample size is derived from the maximum error of the estimate formula is shown in the right figure, where E is the maximum error of estimate. If necessary, round the answer up to obtain a whole number. That is, if there is any fraction or decimal portion in the answer, use the next whole number for sample size n.
In many cases, the population standard deviation is unknown and the sample size is less than 30 (n < 30). In this case, the sample standard deviation can be used in place of the population standard deviation for confidence intervals. The t-distribution is the most appropriate and the variable is normally distributed.
The formula for determining a confidence interval about the mean by using the t distribution is shown in the left figure.
Statistical distribution use the concept of degrees of freedom, and the formulas for finding the degrees of freedom vary for different statistical tests. The degrees of freedom are the number of values that are free to vary after a sample statistic has been computed.
Population can be taken from populations or samples. The following symbols and equations as shown in the right figure will be used to determine the confidence intervals and sample size proportions.
The formula can be found by solving the maximum error of the estimate value for n in the formula E = Zα/2√(p̂q̂ /n).
There are two situations to consider. First, if some approximation of p̂ is known (e.g., from a previous study), that value can be used in the formula. Second, if no approximation of is known, you should use p̂ = 0.5. This value will give a sample size sufficiently large to guarantee an accurate prediction, given the confidence interval and the error of estimate. The reason is that when p̂ and q̂ are each 0.5, the product is at maximum.
Hypothesis Testing is a statistical method that is used in making statistical decisions using experimental data. These include ssumptions made about population parameter.
Statistical hypothesis is a conjecture about a population. There are two types of statistical hypotheses: the null and the alternative hypotheses. The null hypothesis states that there is no difference, and the alternative hypothesis specifies a difference.
The three methods used to test hypotheses are a.) The traditional method b.) The P-value method and c.) The confidence interval method.
A statistical test uses the data obtained from a sample to make a decision about whether the null hypothesis should be rejected. The numerical value obtained from a statistical test is called the test value.
In the hypothesis-testing situation, there are four possible outcomes. In reality, the null hypothesis may or may not be true, and a decision is made to reject or not reject it on the basis of the data obtained from a sample. The four possible outcomes are shown in the right figure.
Note that a type I error occurs if you reject the null hypothesis when it is true. On the other hand, a type II error occurs if you do not reject the null hypothesis when it is false.
The level of significance is the maximum probability of committing a type I error. This probability is symbolized by α (Greek letter alpha). That is, P(type I error) = α
The critical value separates the critical region from the noncritical region. The symbol for critical value is C.V. The critical or rejection region is the range of values of the test value that indicates that there is a significant difference and that the null hypothesis should be rejected. The noncritical or nonrejection region is the range of values of the test value that indicates that the difference was probably due to chance and that the null hypothesis should not be rejected.
A one-tailed test indicates that the null hypothesis should be rejected when the test value is in the critical region on one side of the mean. A one-tailed test is either a righttailed test or left-tailed test, depending on the direction of the inequality of the alternative hypothesis.
In a two-tailed test, the null hypothesis should be rejected when the test value is in either of the two critical regions.
Shown in the left figure are the critical and non critical regions for a one-tailed and two-tailed test.
The z test is a statistical test for the mean of a population. It can be used when n≥30, or when the population is normally distributed and σ is known.
Listed below are the procedures for one sample z test:
Step 1. State the hypotheses and identify the claim.
Step 2. Find the critical value(s).
Step 3. Compute the test value.
Step 4. Make the decision to reject or not reject the null hypothesis. Step 5. Summarize the results.
The t test is a statistical test for the mean of a population and is used when the population is normally or approximately normally distributed, s is unknown.
The formula for t-test is shown in the left figure.
A hypothesis test involving a population proportion can be considered as a binomial experiment when there are only two outcomes and the probability of a success does not change from trial to trial.
The formula for z test for proportions is shown in the right figure.
Correlation is a statistical method used to determine whether a relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables, that is, positive or negative, linear or nonlinear.
In a simple relationship, there are two variables—an independent variable, also called an explanatory variable or a predictor variable, and a dependent variable, also called a response variable. A simple relationship analysis is called simple regression, and there is one independent variable that is used to predict the dependent variable.
In a multiple relationship, called multiple regression, two or more independent variables are used to predict one dependent variable.
Simple relationships can also be positive or negative. A positive relationship exists when both variables increase or decrease at the same time. For instance, a person’s height and weight are related; and the relationship is positive, since the taller a person is, generally, the more the person weighs. In a negative relationship, as one variable increases, the other variable decreases, and vice versa.
The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is r (Greek letter rho).
Shown in the right figure is the formula used to obtain correlation coefficient.
If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line, which is the data’s line of best fit.
Shown in the left figure is the formula used to compute the equation for regression line equation.
The CLO2 tackled the normal distribution, using confidence intervals and sample size, hypothesis testing, and correlation & regression between two or more variables. After thoroughly reading the modules and answering the activities given by our instructor, I have gained lots of takeaways in this module. I learned how to determine normal intervals for specific tests using normal distribution, for it can be used to describe many variables considering that the deviations in normal distribution are small. I also realized how estimation is highly important in making estimates of parameters. In addition, I have learned that evaluating claims about a population under study can be claimed through the use of hypothesis testing. This helped me to state particular hypotheses to be investigated when conducting a research study, how to use significance level, selecting appropriate number of samples from a population, and performing calculations for the said statistical test. Lastly, I have grasped knowledge on how to determine and describe relationship between existing variables under study through correlation and regression. Understanding the concepts and formulas used in correlation and regression helped me in determining if two or more variables are related and also in determining the strength of relationship between these variables.