Most of our modules are about tests for statistical significance for various types of research questions and designs. However, obtaining a statistically-significant result is just the outcome of a hypothesis test, and should not be the major goal of your statistical analysis or even your research study.
Still, we rely heavily on the outcome of hypothesis testing in the conclusions of our analysis. So, it is important to understand the mechansims underlying the hypothesis-testing procedure.
Conceptually, what are the factors that may influence your study to be statistically significant or not?
From a practical perspective, you may want to know how large the sample size you need to obtain for your research question, and how to estimate the minimum sample size for your study.
Hence, we need power analysis.
In hypothesis testing,
alpha (α) – the probability of committing a Type I error in hypothesis testing (i.e., rejecting the null hypothesis when it's true).
beta (β) – the probability of committing a Type II error in hypothesis testing (i.e., failing to reject the null hypothesis when it's false).
If H1 is true, we would want to be able to reject H0. Statistical power refers to how likely this happens, i.e., an actual effect to be detected by the hypothesis test.
statistical power (or simply "power" in short) – the probability of rejecting a false null hypothesis when it is false, in other words, power is the probability of not making a Type II error. Therefore,
Power = 1 – β
Conceptually, power is the probability for the statistical test to do correctly produce a "positive" result, i.e., to say "yes, there is an effect" when the effect is present.
Power (the purple area in the figure below) is the likelihood that the statistical test is correctly rejecting the null hypothesis (the green area). It is complementary to the probability of comitting a Type II error or β (the yellow area).
Because it is a probability, statistical power ranges from 0 to 1. If power is closer to 1 (high statistical power), the probability of making a Type II error gets closer to 0. It means that it is very likely that the hypothesis test detects an effect when it is present, i.e., rejecting a false null hypothesis when the alternative hypothesis is true. In contrast, low statistical power implies the hypothesis test has a low probability to detect the effect even if it is present.
Researchers set the statistical power depending on their study. It's commonly set as .80 (80%). But some researchers may want to capture a significant effect more easily by setting a statistical power of 0.9 (90% chance of the test having significant results if the effect is present).
Power analysis involves four elements: (1) Statistical Power, (2) Sample Size, N, (3) Effect Size, and (4) Significant Level (alpha, usually set as .05).
Statistical power is a function of sample size, alpha, and effect size. There are multiple reasons to carry out a power analysis:
Sample Size Determination: calculate a minimum N when both the alpha, effect size, and power are decided. We usually do it before conducting the study (A priori power analysis)
Power Estimation: calculate the power when both the alpha, sample size, and effect size are provided. We usually do it when the study is conducted (Post hoc power analysis)
Effect Size Calculation: calculate the minimum, detectable, and hypothetical effect size when both alpha, sample size, and power are provided. (sensitivity power analysis)
Statistical significance criterion - alpha (α)
When α is set at 0.05, the probability of the null hypothesis being true needs to be less than 0.05 in order to be rejected. In other words, the alternative hypothesis implying an observed effect needs to be greater than 0.95 to be accepted which will reduce the power.
If you want to increase the power of the test, you may adopt a less conservative α (setting α to a greater value, like 0.05 to 0.10). However, if you increase your power by having a more liberal alpha level, you are also increasing the chance of having Type I error at the same time. While setting a more conservative criterion (e.g. α = 0.05 to 0.01) will reduce the power, but, at the same time, reduce the Type I error rate and increase the robustness of your test results.
Sample size
The amount of sampling error in the test results depends on the sample size. Increasing sample size can reduce sampling error and increase the probability to detect a true effect.
Increasing the sample size also increases the statistical power of a test.
The effect size of interest
Effect size is about the strength of an effect that is of your interest. It can be small or large.
A greater effect size will increase the power- in order words, a stronger effect will be more likely to be correctly detected.
Speaking of effect sizes, they can be grouped into two families: d family (strength of difference) and r family (strength of association)
d family effect size is about the strength of difference,
ranging from 0 to infinity (|0.2| = small, |0.5| = medium, |0.8| = large, by Cohen's 1988 interpretation)
manifested by the standardized mean difference between two values.
calculated by the mean difference between two values divided by the standard deviations (SD)
Commonly known as the Cohen’s d, that is, the standardized mean difference of an effect
if the sample size is too small (N < 50), a bias-corrected effect size, Hedges’ g (by multiplying Cohen's d by a correcting factor), will be preferable
According to the test you conducted, the mean difference and SD will be slightly different. But the rationale is:
mean difference / SD
Different tests will have their own mean difference and standardizer (denominator). Using the independent t-test, with an experimental-control groups design, as an example, the Cohen's d formula will be:
Cohen's d: (the mean of experimental group) - (mean of control group) / pooled SD
while the pooled SD will be:
pooled SD: the square root of [(SD^2 of experimental group + SD^2 of control group ) / 2 ]
some may use the sample size weighted SD
r family is about the strength of association. The most commonly known effect size is the Pearson correlation coefficient, r.
r can range from -1 (perfectly negative relationship) to 0 (no relationship) to 1 (perfectly positive relationship), yet -1 or 1 rarely occur in the real world.
r can also be used to describe the proportion of variance. for example, a correlation (r) of 0.5 indicates 25% (r2) of the variance is explained by the predictor. In multiple regression analysis, R2 is the effect size indicating the variance of the dependent variable is explained by a group of predictors.
Other effect sizes in the r effect sizes family include Eta-squared ( η2 ) and Omega-squared (ω2), but we won't be discussing them in-depth.
To determine your desired effect size or interpret the obtained results, you may refer to the following table:
Notice that Cohen's d and r (and other effect sizes under certain context) are interchanagable: you can calculate Cohen's d from r, and vice versa.
In jamovi, you can perform a power analysis with the module jpower. The jpower - Power Analysis for Common Research Designs is a module to compute power for three types of t-test (one-sample, dependent sample, and independent sample t-test). To use jpower, you will have to install jpower in jamovi.
To install jpower, please follow the steps:
Install the latest version of jamovi and run it.
Go to Modules menu on the upper right corner, and go to the jamovi library.
Go to the Available tab, look for jpower.
Install jpower.
You can now click the jpower icon and select an analysis for your power analysis.
Note: the minimally-interesting effect size (delta, δ) means the minimum standardised mean difference you would like to detect.
Recently, a new claims that girls are more likely to get better grades than boys in school. You would like to test if this is actually true and are planning to conduct a study on the gender difference in academic performance. But you are not so sure about the sample size. So you conducted a power analysis to determine the sample size for your study in an independent t-test (female group and male group).
You decided to go with a medium effect size of 0.5, and a statistical power of 80% (0.8), which is within the acceptable range.
The A Priori Power Analysis
The minimum sample size N necessary to detect an effect size of 0.5 with a power of 80% for the study is 64 participants in each group (a total of 128 participants), assuming a two-tailed criterion α at 0.05.
Power by Effect Size
The table illustrates how power to detect change according to the true effects of increasing sizes, and the meaning of the changes.
It shows how likely we are to correctly conclude that the alternative hypothesis (presence of effect) is true when the effect size is large enough to be considered as a significant supporting evidence.
Power Contour
The power contour plot shows how the sensitivity of the test changes with the hypothetical effect size and the sample sizes in the design. Our hypothetical minimally interesting effect size is 0.5, resulting in a sample size estimation of 64 participants in group 1.
When we increase our hypothetical effect size while keeping the desired power, the estimated sample size will reduce accordingly. In the other way round, as we increase the sample sizes, smaller effect sizes become reliably detectable.
Power Curve by N
The power curve shows how the sensitivity of the test and design is larger for larger effect sizes.
To ensure the sensitivity, which is the desired power of 0.8 of our design and test for detecting an effect with a minimum effect size of 0.5, we would need a sample size of at least 64 in each group.
Similarly, power increases with sample size. When we increase our desired power, the minimum number of sample we need in each group will increase accordingly.
Power Demonstration
The power demonstration figure shows two sampling distributions assuming a sample size of 64 in each group:
the sampling distribution 1 of the estimated effect size when Cohen's d = δ= 0 (purple), that is the null hypothesis (no effect)
the sampling distribution 2 when δ= 0.5 (green), that is the alternative hypothesis (presence of effect)
with vertical dashed line representing the two-tailed statistical significance criterion, α = 0.05
When the green distribution (observed effect size) is far enough away from 0 to be more extreme than the criteria of α = 0.05 and the hypothetical effect size ≥ 0.5, we can then correctly reject the null hypothesis and claim that the design's power for detecting effects of |δ|≥ 0.5 is thus 0.8.
Otherwise, we accept the null hypothesis.
You may want to conduct research of other experimental designs, for example, factorial ANOVA, correlation, etc. However, power analysis within jamovi is only available for t-tests only at the moment. For conducting studies of various designs, you may use another tool for power analysis - G*Power.
G*Power is an alternative tool to compute statistical power analyses for many different t-tests, F tests, χ2 tests, z tests and some exact tests. If you are also interested in knowing more about other power analysis and G*Power, you can check out the G*Power website:
https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html
Now, if you think you're ready for the exercise, you can check your email for the link.
Remember to submit your answers before the deadline in order to earn the credits!