Statistics

Quantitative Data Analysis: IMPACT RATING (IR) 8/10

The two main branches of statistics are:

Descriptive Statistics – Summarises and organises data using measures such as:

- Measures of central tendency (mean, median, mode)
- Measures of dispersion (range, variance, standard deviation)
- Data visualization (graphs, charts, histograms)

Example: Reporting the average exam scores of students in a class.

Inferential Statistics – Uses sample data to make predictions or generalizations about a larger population. It includes:

- Hypothesis testing (t-tests, chi-square tests)
- Confidence intervals
- Regression analysis
- Probability distributions

Example: Using a sample of patients to infer the effectiveness of a new drug for the entire population.

The flow diagram below will help to guide you through to the correct statistical test. There are key terms you'll need to understand. The following terms are described in more depth below.

Catagorical vs Numerical Data
Paired vs Unpaired data
Parametric vs unparametric data sets

Please refer to the explainations below for more guidance.

Try this MS Forms tool to establish your Statistical Data test

Data types: Catagorical vs Numerical Data

In the world of research, data comes in two main flavors:

Numerical
Categorical

Numerical data

As the name suggests, deals with numbers. This could be anything from exam scores and reaction times to income levels and website visits. You can perform math operations on numerical data, like calculating averages or finding correlations. There are 2 main types:

Discrete: Numbers of patients, visits to a hospital

Imagine a bucket full of marbles. Each marble is a separate entity, and you can only have whole numbers of them. This type of data, called discrete data, represents distinct values and often comes from counting things. It can be finite (like the number of apples in a basket) or infinite (like the number of stars in the sky). Here are some common examples of discrete data:

Counts: The number of people attending an event, the number of cars on a highway, the number of correct answers on a multiple-choice test.

Ranks: Positions in a race, shirt sizes (S, M, L, XL), letter grades on a report card.

Continuous: Height, weight, age

Now, imagine pouring water into a measuring cup. The amount of water can take on any value on a continuous scale, from a single drop to overflowing. This type of data, called continuous data, represents values that can fall anywhere along a spectrum. It often comes from measurements and can be expressed in fractions or decimals. Examples of continuous data include:

Measurements: Height, weight, temperature, distance traveled, time elapsed.
Ratios: Speed (distance divided by time), proportions of ingredients in a recipe, unemployment rate (percentage of people without jobs).

Categorical data

On the other hand, is all about classification. It tells you what category something belongs to, like hair color (blonde, brunette, etc.) or preferred brand of shoes (Nike, Adidas, etc.). Categorical data doesn't involve numbers themselves, but rather labels or groups. Understanding these differences is crucial for researchers to choose the right tools to analyse their data and draw meaningful conclusions. There are 2 main types:

Nominal: Eye or hair colour, Male or female (no particular order), Yes / No
Ordinal: Mild-moderate-severe, Pass-merit-distinction.

Differences vs Correlations

Here's why we care about differences and correlations when analysing paired data:

Differences:

Imagine you want to see if a new fertiliser helps plants grow taller. You measure the height of 10 plants before and after using the fertiliser. This creates paired data (each plant's height is a "pair").
We don't care much about the absolute heights before (might depend on the plant variety). What matters is the difference in height after using the fertiliser. Did each plant grow taller compared to its own starting height?
By focusing on the differences within each pair (plant), we remove the influence of factors like initial plant size and isolate the effect of the fertiliser.

Correlations:

While differences tell us if there's a change, correlations can hint at the direction and strength of the relationship between the two measurements in your pairs.
Going back to the plant example, a positive correlation between the difference in height and the amount of fertilizer used might suggest more fertilizer leads to a bigger difference in growth (taller plants).
However, correlation doesn't necessarily mean causation (maybe healthier plants received more fertilizer). It just shows a tendency for the differences to change together.

Remember:

Paired data is all about comparing within pairs (like twins).
Differences tell you if something changed within each pair.
Correlations give you a clue about the direction and strength of that change (positive, negative, strong, weak).

By considering both differences and correlations, you can get a clearer picture of what's happening within your paired data and draw more meaningful conclusions from your research.

Paired vs Unpaired data

In simple terms if your statistical test in comparing the mean bewteen groups this is unpaired, if you are comparing the mean in the same group (before & after) this is unpaired. For a further explaination see below.

Paired AKA Related Data (Paired t-test):

Data comes from the same group or subjects measured under two different scenarios. Think of it as comparing the same individuals before and after something (treatment, intervention, etc.). Because it's the same group, any underlying characteristics of the subjects (age, genetics, etc.) are assumed to be relatively consistent between the two scenarios. This allows the paired t-test to focus solely on the differences caused by the change in scenario, making it less reliant on the assumption of equal variance between groups.
Methods paired tests are used for:
1. Pre test post test method: e.g Measure stress levels (e.g., using a validated scale) before and after a 6-week mindfulness program for the same group of students.
2. Matched-Pair design: e.g Match participants in the app group with those in the standard care group based on similar age, gender, and baseline HbA1c levels. Measure HbA1c after 12 weeks.
3. Repeated Measures on the Same Subject : Measure fatigue levels at the start, midpoint, and end of the shift for each nurse.
4. Crossover Trials: In a crossover design, participants receive caffeine in one session and a placebo in another (order randomized). Measure reaction times after each condition.

Unpaired AKA Unrelated Data (Unpaired t-test):

Data comes from two completely independent or unrelated groups. Imagine comparing the average height of students from two different schools with no connection between them. Since the data originates from separate groups, there's no guarantee that underlying characteristics are similar between the groups. Therefore, the unpaired t-test assumes the variances of the two independent groups are equal to ensure a fair comparison of means.
Methods unpaired tests are used for:
1. Comparing Two Independent Groups: Collect exam scores from a sample of male and female students.
2. Control vs. Intervention Groups: Randomly assign participants to a drug group or a placebo group and measure their blood pressure after 12 weeks.
3. Cross-Sectional Studies: Collect physical activity data (e.g., step counts) from a random sample of urban and rural residents.
4. Different Cohorts or Time Periods: Compare admission rates from two independent samples, one before and one after the policy implementation.
5. Unmatched Case-Control Studies: Collect smoking rates for cases (lung cancer patients) and controls (non-cancer patients).
6. Two Independent Treatments: Randomly assign patients to two groups (e.g., Program A or Program B) and measure knee function scores after the intervention.
7. Group Differences in Surveys: Administer a satisfaction survey to two independent groups of patients, one from private hospitals and one from public hospitals.

Parametric vs non parametric data

Here's how you can approach testing whether your data is parametric or non-parametric:

Normality Tests:

The most common way to assess if your data is parametric is to perform a normality test. These tests evaluate how closely your data resembles a normal distribution (bell-shaped curve). There are several options, each with its own advantages and limitations:
Shapiro-Wilk Test: This is a widely used general-purpose test for normality. It works well for small and large samples.
Kolmogorov-Smirnov Test: This test compares the cumulative distribution function (CDF) of your data to the CDF of a normal distribution. It's good for detecting deviations from normality in the tails of the distribution.
Anderson-Darling Test: This test is similar to the Kolmogorov-Smirnov test but may be more sensitive to non-normality in the center of the distribution.

Interpreting Normality Test Results:

These tests typically provide a p-value. If the p-value is greater than a certain threshold (often 0.05), we fail to reject the null hypothesis that the data is normally distributed. This suggests the data might be suitable for parametric tests. However, keep in mind:
Normality tests are not perfect. They can be influenced by sample size (smaller samples might not detect non-normality as easily).
Even if the p-value suggests normality, it's always a good idea to visually inspect your data for normality using techniques like histograms and Q-Q plots.

Visual Inspection:

Histograms: Create a histogram of your data. A symmetrical, bell-shaped histogram suggests normality. Skewness (leaning to one side) or multiple peaks indicate non-normality.
Q-Q Plots (Quantile-Quantile Plots): These plots compare the quantiles of your data to the quantiles of a normal distribution. If the points fall roughly along a straight line, the data might be normal. Deviations from the line suggest non-normality.

Conclusion:

By combining normality tests with visual inspection, you can get a good sense of whether your data is likely parametric or non-parametric. If the data appears significantly non-normal, consider using non-parametric tests even if a normality test doesn't definitively reject normality (due to limitations of the test itself).

Page updated

Google Sites

Report abuse