Data Distribution

While many different types of distributions exist (e.g., normal, binomial, Poisson), working with SEM generally only needs to distinguish normal from non-normal distributions. PLS-SEM is a nonparametric statistical method. Different from maximum likelihood (ML)–based CB-SEM, it does not require the data to be normally distributed.

Even thoughPLS-SEM’s statistical properties provide very robust model estimations with data that have normal as well as extremely nonnormal (i.e., skewness and/or kurtosis) distributional properties (Reinartz et al., 2009; Ringle, Götz, Wetzels, & Wilson, 2009), it is nevertheless worthwhile to consider the distribution when working with PLS-SEM.

"It is important to verify that the data are not too far from normal as extremely non-normal data prove problematic in the assessment of the parameters’ significance. Specifically, extremely non-normal data inflate standard errors obtained from bootstrapping and thus decrease the likelihood that some relationships will be assessed as significant"

(Hair et al., 2017)

What is Normality?

Refers to the shape of data distribution for individual variable.
If variation from normal distribution is large, all resulting statistical tests are invalid, because F & t-statistics assume normality (Hair et al., 2010).
Normality can have serious effects in small samples (<50), but the impact effectively diminishes when sample sizes > 200.

Lack of normality in variable distributions can distort the results of multivariate analysis. This problem is much less severe with PLS-SEM, but researchers should still examine PLS-SEM results carefully when distributions deviate substantially from normal.

How to Test for Normality?

Histogram: Compare the observed data values with a distribution approximating normal distribution.
Normal Probability Plot: Compare the cumulative distribution of actual data values with the cumulative distribution of a normal distribution.
Skewness and Kurtosis Statistics.
Shapiro-Wilks (sample < 2000).
Kolmogorov-Smirnov (sample > 2000).

Skewness

The degree of symmetry in the variable distribution.

If the distribution of responses for a variable stretches toward the right or left tail of the distribution, then the distribution is characterized as skewed.

Threshold:

-2 ≤ skewness ≤ 2 (Curran et al., 1996; West et al., 1995).

Absolute skewness and/or kurtosis values of greater than 1 are indicative of nonnormal data (Hair et al., 2017).

Negatively Skewed

Normal: No Skew

Perfectly symmetrical distribution

Positively Skewed

Kurtosis

The degree peakedness/flatness in the variable distribution. It measures whether the distribution is too peaked (a very narrow distribution with most of the responses in the center).

Threshold:

-7 ≤ Kurtosis ≤ 7 (Curran et al., 1996; West et al., 1995).

Absolute skewness and/or kurtosis values of greater than 1 are indicative of nonnormal data (Hair et al., 2017).

Kurtosis = 0

Normal distribution

Mesokurtic Distribution

Kurtosis > 0

High degree of peakness

Leptokurtic Distribution

Kurtosis < 0

Low degree of peakness

Platykurtic distribution

Multivariate Normality

The multivariate normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions.

"When multivariate data are analyzed, the multivariate normal model is the most commonly used model"

(Hair et al., 2016)

To test for multivariate normality, please click here.

The expected Mardia’s skewness is 0 for a multivariate normal distribution and higher values indicate a more severe departure from normality.

According to Bentler (2005) and Byrne (2010), the critical ratio value of multivariate kurtosis should be less than 5.0 to indicate a multivariate normal distribution.

Data is extremely not normal, how to remedy?

Check and remove outlier cases.
Remove non-normal item from the model.
Bootstrapping (i.e., re-sampling process in the existing data-set with replacement).

Outlier

Distinctly different observation from the others.

Outliers can result from...

Data collection of entry errors (e.g., manual coding of “77” instead of “7” on a 1 to 9 Likert scale).
Exceptionally high or low values can also be part of reality (e.g., an exceptionally high income).
Combinations of variable values are particularly rare (e.g., spending 80% of annual income on holiday trips).

IDENTIFYING OUTLIERS

The first step in dealing with outliers is to identify them. Standard statistical software packages offer a multitude of univariate, bivariate, or multivariate graphs and statistics, which allow identifying outliers.

Univariate Outlier

(Box Plot)

Examines distribution of observations for each variable and selects as outliers those cases falling at the outer ranges (high or low) of the distribution.

Bivariate Outlier

(Scatter Plot)

Relates individual independent variable with individual dependent variable.

Multivariate Outlier

(Mahalanobis Distance)

Evaluates the position of each observation compared with the center of all observations on a set of variable.

To test for multivariate outliers, Hair et al. (2010) and Byrne (2010) suggested to identify the extreme score on two or more constructs by using Mahalanobis distance (Mahalanobis D²). It evaluates the position of a particular case from the centroid of the remaining cases. Centroid is defined as the point created by the means of all the variables (Tabachnick & Fidell, 2007).

Based on a rule of thumb, the maximum Mahalanobis distance should not exceed the critical chi-square value, given the number of predictors as degree of freedom. Otherwise, the data may contain multivariate outliers (Hair, Tatham, Anderson, & Black, 1998)

Dealing with outliers

In the second step, the researcher must decide what to do. Below is the guideline:

If there is an explanation for exceptionally high or low values, outliers are typically retained, because they represent an element of the population. However, their impact on the analysis results should be carefully evaluated. That is, one should run the analyses with and without the outliers to ensure that a very few (extreme) observations do not influence the results substantially.
If the outliers are a result of data collection or entry errors, they are always deleted or corrected (e.g., the value of 77 on a 9-point scale).
If there is no clear explanation for the exceptional values, outliers should be retained. See Sarstedt and Mooi (2014) for more details about outliers.