Statistical Methods and Data Analysis

Anonymous help from a fellow scientist / practitioner...

Click here to return to the main I/O Psychology Comps Review Page



Baron, R.M. & Kenny, D.A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic and statistical considerations. Journal of Personality and Social Psychology, 51, 1173-1182.< namespace="" prefix="o" xml="true">

Overview: makes distinctions between the properties of mediator vs. moderator variables, and offer analytic procedures for testing mediators/moderators (separately and when both are of interest).


·         Moderator function – partitions a focal IV into subgroups that establish its domains of maximal effectiveness in regard to a given DV

·         Mediator function – represents the generative mechanism through which the focal IV is able to influence the DV of interest

·         These “third variables” are often confused!  Can’t be used interchangeably…


The Nature of Moderators

·         In general, a moderator is qualitative (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation b/w an IV ( or predictor) and a DV (or criterion.

·         Specific to correlational analysis framework, a moderator is a 3rd variable that affects the zero-order correlation b/w 2 other variables…

o         Example: positivity of the changing life events-severity of illness relationship was considerably stronger for uncontrollable events (death of spouse) than for controlled event (divorce)…OR, if controllable life changes had reduced likelihood of illness thereby changing direction of the relation b/w life-event change and illness from positive to negative

·         In ANOVA terms, a basic moderator effect can be represented as an interaction between a focal IV and a factor that specifies the appropriate conditions for its operation

o         Example: dissonance-forced compliance; investigator’s ability to establish the effects of sufficient justification required specification of such moderators as commitment, personal responsibility, and free choice


·         Three causal paths into the outcome variable of task performance

o         Impact of noise intensity as predictor (Path a)

o         Impact of controllability as a moderator (Path b)

o         Interaction of the two (Path c).

·         Moderator hypothesis is supported if the interaction is significant

·         In addition, it is desirable that:

o         The moderator variable be uncorrelated with both the predictor and the criterion to provide clearly interpretable interaction term, and

o         Moderator variables always function as IVs whereas mediating events shifts from effects to causes depending on the focus of the analysis (i.e., X àM  OR MàY in XàMàY)

Testing Moderation

·         Within this framework, moderation implies that the causal relation b/w 2 variables changes as a function of the moderator variable

o         Stat analysis must measure and test the differential effect of the IV on the DV as a function of the moderator

o         Depends on level of measurement of the IV and the moderator variable

·         Case 1 – both moderator and IVs are categorical (dichotomous)

o         IV’s effect on DB varies as function of another dichotomy…

o         Analysis is 2 x 2 ANOVA and moderation is indicated by interaction

·         Case 2 – moderator is categorical, IV is continuous

o         Example: gender might moderate the effect of intentions on behaviors

o         Measure by correlating intentions with Behaviors separately for each gender and then test the difference BUT 2 DEFICIENCIES (b/c correlations influenced by changes in variance):

·         Presumes that the IV has equal variance at each level of the moderator (i.e., variance of intentions the same for male and female); if variance differ, then for levels of the moderator with less variance, the IV-DV correlation tends to be less than for the other level with more variance (the source of this difference is restriction in range)

·         If the amount of measurement error in the DV varies as a function of the moderator, then the correlations b/w the IV and DV will differ spuriously

o         B/c regression coefficients are not affected by problems above, it is PREFERABLE to measure the effect of the IV on the DV by unstandardized (not betas) regression coefficients; Tests of the difference b/w regression coefficients – Cohen & Cohen…This test should be performed first, before the 2 slopes are individually tested…

·         Case 3 – moderator is continuous, IV is categorical

o         Ex: IV=rational vs. fear-arousing attitude-change message; moderator = IQ test score (fear-arousing message more effective for low-IQ; rational message more effective for high-IQ)

o         To measure moderators, need to know a priori how the effect of the IV varies as a function of the moderator.

o         3 ways the moderator can alter the effect of the IV on the DV: linear, quadratic, step (SEE Figure at right)

o         Linear hypothesis is tested by adding the product of the moderator and the dichotomous IV to the regression equation;   Y is regressed on X, Z, and XZ – moderator effects are indicated by significant effect of XZ while X and Z are controlled.

o         Quadratic moderation effect can be tested by dichotomizing the moderator at the point at which the function is presumed to accelerate…complicated!                      (**you might read if you are interested in knowing more about this).

·         Case 4 – both variables are continuous

o         If believe the moderator alters the IV-DV relation in a step function, dichotomize the moderator at the point where the step takes place then follow Case 2.

o         If presumes the effect of X on Y varies linearly or quadratically with respect to Z, then follow product variable approach in Case 3 (for quadratic, moderator squared has to be used)…

o         Note: measurement error in either moderator or IV complicates things; will result in low power in the interactive tests…


The Nature of Mediator Variables

·         In general, a given variable may be said to function as a mediator to the extent that it accounts for the relation b/w the predictor and the criterion.

·         Whereas moderators specify when certain events will hold, mediators speak to how/why such effects occur.

·         Path diagram as model for depicting a causal chain:

·         Assumes 3 variable system; 2 causal paths feeding into the outcome variable: the direct impact of the IV (Path c) and the impact of the mediator (Path b). There’s also path from IV to the mediator (Path a).

·         A variable functions as mediator when the following are met:

o         Variations in levels of the IV significantly account for variations in the presumed mediator (Path a)

o         Variations in the mediator significantly account for variations in the DV (Path b)

o         Controlling for Paths a and b, a previously significant relation b/w the IV and DV is no longer significant (strongest demonstration when Path c is zero b/c that evidences a single dominant mediator; BUT if residual Path c is not zero, this indicates operation of multiple mediating factors (and may be more realistic given that we treat phenomena as having multiple causes)


Testing Mediation:

·         A series of regression models should be estimated (Judd & Kenny, 1981b); need to estimate the 3 following regression equations:

o         Regress the mediator on the IV

o         Regress the DV on the IV

o         Regress the DV on both the IV and on the mediator

·         To establish mediation, the following conditions must hold:

o         The IV must affect the mediator in the 1st equation

o         IV must affect the DV in the 2nd equation

o         Mediator must affect the DV in the 3rd equation

o         If the conditions all hold (in predicted direction), then the effect of the IV on the DV must be less in the 3rd equation than in the 2nd. Perfect mediation holds if the IV has no effect when the mediator is controlled.

·         B/c the IV is assumed to cause the mediator, these 2 variables should be correlated; presence of correlation results in multicollinearity when estimating effects of IV and mediator on DV (results in reduced power in the 3rd equation); SO, critical to examine significance of coefficients AND absolute size

·         Assumptions required: there is no measurement error in the mediator AND that the DV does not cause the mediator

·         B/c mediator is often internal, psychological variable, it’s likely to be measured with error; the presence of error in mediator tends to produce an underestimate of the effect of the mediator and an overestimate of the effect of the IV on the DV when all coefficients are positive (not good b/c mediators may be overlooked)

·         Generally, the effect of measurement error is to attenuate the size of measures of association (resulting estimate being closer to zero than if no measurement error)

·         Can use multiple operationalizations of each construct; use structural modeling techniques…


Overview of Conceptual Distinctions Between Moderators and Mediators

·         For mediation, must establish strong relations b/w: the predictor and the mediating variable, and b/w mediating variable and criterion.

o         At individual unit of analysis – mediators represent properties of the person that transform the predictor variable in some way

o         Group level – role conflict, norms, groupthink, cohesiveness


Strategic Considerations

·         Moderators are typically introduced when there is an unexpectedly weak or inconsistent relation b/w a predictor and a criterion (e.g., relation hold in one subpopulation but not for another)(e.g., self-monitoring improves the ability of personality traits to predict Behaviors criteria

·         Mediation – best done in the case of a strong relation b/w predictor and criterion

·         Moderator to mediator and Mediator to moderator – at times, moderator effects may suggest a mediator be tested at more advanced stage of research; conversely, mediators may be used to drive interventions to serve applied goals (e.g., social densityà perceived control à decrements in task performance…an environmental intervention to prevent density from having adverse effects may be suggested to increase the perceived controllability…)


Operational Implications

·         Moderator interpretation of the relation b/w the stressor and control typically entails an experimental manipulation of control (to establish independence b/w the stressor and control as a feature of the environ separate from the stressor). If control is experimentally manipulated (for moderator function), it doesn’t need to be measured unless as a manipulation check)

·         If theory assigns a mediator role to the control construct, it’s only secondarily concerned with independent manipulation of control…most essential feature of the hypothesis is that control is the mechanism through which the stressor affects the outcome variable. For this theory, an independent assessment of control is essential for conceptual reasons, as opposed to methodological reason (as in moderator case)….main concern is on demonstrating construct validity which requires multiple independent and converging measurements and so we need to increase quality and quantity of data.


A Framework for Combining Mediation and Moderation

·         P. 1179 – figure and steps for combining mediation and moderation…


Implications and Applications of the Moderator-Mediator Distinction

·         Moderator example: Perceived control (ability to escape from high density situation) moderates density-crowding relation.

·         Mediator example: attitudes à behavioral intentions à behavior (attitude theory of reasoned action; Fishbein & Ajzen, 1980)

·         Combined effect:  prediction of social behaviors from global dispositional variables; using path analytic framework, could take differences in self-monitoring and simultaneously establish both its role as a moderator and the nature of the mediation process through which it has an impact on a given class of behavior…??

o         Placing both moderators and mediators within same causal system helps make salient the more dynamic role played by mediators as opposed to moderators (classification-ish)…

o         Self-monitoring as moderator – partitions people into subgroups (emphasis is on who (hi or lo) does what

o         Linking self-monitoring x trait relation to a specific mediating mechanism implies that variations in self-monitoring elicit different patterns of coping or info-processing that cause people to become more or less consistent with their attitudes in the Behaviors (prior condition allows us to discover different states that cause individual to act differently)…


Here’s a “nutshell”:

·         Mediation implies a causal sequence among three variables X to M to Y (IV causes the mediator and the mediator causes the DV). 

o         Example, an intervention may change social norms and this change in social norms prevented smoking.

·         Moderation (or an interaction) means that the effect of X on Y depends on the level of a third variable. No causal sequence is implied by interaction. 

o         Example, an intervention may be successful for males but not for females--an interaction effect.


Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

§    [Abstract]  This is an account of what I have learned (so far) about the application of statistics to psychology and the other sociobiomedical sciences. It includes the principles “less is more” (fewer variables, more highly targeted issues, sharp rounding off), “simple is better” (graphic representation, unit weighting for linear composites), and “some things you learn aren't so.” I have learned to avoid the many misconceptions that surround Fisherian null hypothesis testing. I have also learned the importance of power analysis and the determination of just how big (rather than how statistically significant) are the effects that we study. Finally, I have learned that there is no royal road to statistical induction, that the informed judgment of the investigator is the crucial element in the interpretation of data, and that things take time.

§    Less is more: less variables is better

§    Simple is better: make graphs, use unit weights instead of beta weights (because beta weights maximize prediction only for a given sample) – but don’t go too far with simplifying (dichotomizing continuous variables)

§    The Fisherian Legacy/Null Hypothesis testing: .05 is not a magic number.  When a Fisherian null hypothesis is rejected with an associated probability of, for example, .026, it is not the case that the probability that the null hypothesis is true is .026 (or less than .05, or any other value we can specify). Given our framework of probability as long-run relative frequency—as much as we might wish it to be otherwise—this result does not tell us about the truth of the null hypothesis, given the data. (For this we have to go to Bayesian or likelihood statistics, in which probability is not relative frequency but degree of belief.) What it tells us is the probability of the data, given the truth of the null hypothesis—which is not the same thing, as much as it may sound like it.  If the p value with which we reject the Fisherian null hypothesis does not tell us the probability that the null hypothesis is true, it certainly cannot tell us anything about the probability that the research or alternate hypothesis is true. In fact, there is no alternate hypothesis in Fisher's scheme: Indeed, he violently opposed its inclusion by Neyman and Pearson.  Among the less obvious benefits of power analysis was that it made it possible to “prove” null hypotheses. Of course, as I've already noted, everyone knows that one can't actually prove null hypotheses. But when an investigator means to prove a null hypothesis, the point is not to demonstrate that the population effect size is, say, zero to a million or more decimal places, but rather to show that it is of no more than negligible or trivial size (Cohen, 1988, pp. 16–17). Then, from a power analysis at, say, α = .05, with power set at, say, .95, so that β = .05, also, the sample size necessary to detect this negligible effect with .95 probability can be determined. Now if the research is carried out using that sample size, and the result is not significant, as there had been a .95 chance of detecting this negligible effect, and the effect was not detected, the conclusion is justified that no nontrivial effect exists, at the β = .05 level. This does, in fact, probabilistically prove the intended null hypothesis of no more than a trivially small effect. The reasoning is impeccable, but when you go to apply it, you discover that it takes enormous sample sizes to do so. For example, if we adopt the above parameters for a significance test of a correlation coefficient and r = .10 is taken as a negligible effect size, it requires a sample of almost 1,300 cases. More modest but still reasonable demands for power of course require smaller sample sizes, but not sufficiently smaller to matter for most investigators—even .80 power to detect a population correlation of .10 requires almost 800 cases. So it generally takes an impractically large sample size to prove the null hypothesis as I've redefined it; however, the procedure makes clear what it takes to say or imply from the failure to reject the null hypothesis that there is no nontrivial effect.

§    How to use Statistics: Plan your research.  Focus on effect sizes, not p.  Effect-size measures include mean differences (raw or standardized), correlations and squared correlation of all kinds, odds ratios, Kappas—whatever conveys the magnitude of the phenomenon of interest appropriate to the research context. If, for example, you are comparing groups on a variable measured in units that are well understood by your readers (IQ points, or dollars, or number of children, or months of survival), mean differences are excellent measures of effect size.  Use your informed judgment as a scientist.  The prevailing yes–no decision at the magic .05 level from a single research is a far cry from the use of informed judgment. Science simply doesn't work that way.


Ford, J.K., MacCallum, R.C., & Tait, M. (1986). The application of exploratory factor analysis in applied psychology: A critical review and analysis.  Personnel Psychology, 39, 291-314.

Overview:  Looked at 152 studies (that used FA) and analyzed researcher choices: factor model, retention criteria, rotation, interpretation of factors, etc.  Results show poor choices, bad reporting practices, SO the authors make suggestions for improving the use of FA and reporting results of such.


Factor analysis: contributes to advancements in psych research; used extensively as a data analytic technique for examining patterns of interrelationship, data reduction, classification and description of data, data transformation, hypothesis testing, and mapping construct space (Rummel, 1970).


·         No systematic assessment of how FA is applied had been conducted, SO purpose of study was to review and critically evaluation current FA practices in applied psych research (focuses on exploratory, not confirmatory, FA)

·         Concentrates on four issues (decisions made at each will substantially impact the interpretation of results):

o         Choice of factor model

o         Decision about # of factors to retain

o         Methods or rotation

o         Interpretation of the factor solution


Factor model

·         Common FA or Components analysis – both allow examination of how variance for a given variable is distributed relative to other variables in the data set; distinction concerns nature of the variance of the variables

·         Common factor model

o         Assumes variance of each measured variable can be decomposed into common and unique portions (unique variance includes random error variance and systematic variance specific to given measured variable)

o         Appropriate when measured variables are assumed to be a linear function of a set of unmeasured or latent variables (if used components for this it would lead to an inappropriate solution)

·         Components model

o         No differentiation b/w common, unique, and error variance; Rather, set of observed variables is transformed into new set of linear composites of observed variables (composites are intended to account for covariation among variables + total observed variance of each variable)

o         Critics say capitalizing on unreliable variance and results in convenient groupings of variables and not theoretical constructs or latent variables

o         Supporters say doesn’t impose potentially questionable assumption that a hypothetical causal model underlies data

o         Appropriate when interested in maximizing the ability to explain the variance of the observed variables


Number of Factors

·         Factors should stop when additional factors account for trivial variance, but criterion for retention of factors is uncertain (and various rules of thumb lead to different solutions)

·         Components analysis – Kaiser criterion (retain factors with Eigenvalues >=1); Alternative criteria: scree test or parallel analysis

·         Good strategy = use # of decision rules and examine a # of solutions before final conclusion; examine highest to lowest # of factors until the most interpretable solution is found



·         Used to improve the psychological meaningfulness, reliability, and reproducibility of factors (Weiss, 1976)

·         Simple structure (Thurstone, 1947) served as major criterion for rotation (rotates factors around the origin until each factor is maximally collinear with a distinct cluster of vectors.

·         Orthogonal rotation – produces factors that are statistically uncorrelated

o         Simplicity, conceptual clarity, amenability to subsequent analysis

·         Oblique rotation – allows factors to be correlated

o         Complex; Generates patterns and structure matrix; factor intercorrelations


Interpretation – process by which results of a FA are given meaning or labels

·         Ultimate goal of FA = ID of underlying constructs that summarize a set of variables

·         Common rule: only use variables with factor loadings > .40, but “only this” is limiting

·         To reduce subjectivity, etc., consider alternative constructs (e.g., item wording); could use independent panel’s consensus judgments…


Other issues with FA:

·         Sampling size – stability of factor loadings (and subsequent interpretation) is direct function of sample size; Nunnally (1978) argues for 10:1 ratio.

·         Computer program package – don’t just use default options – can be misleading…chose options manually…

·         Factor scores -

o         Reporting of FA results – Rummel (1970) says published studies should contain: clear info about decisions in conducting analysis (correlation matrix, results of FA (eigen, communality estimates, loadings, percentage of variance accounted for)


STUDY METHOD – 152 studies from JAP, PP, OBHP with exploratory FA…coding…



General Discussion:

·         Majority used components model; but, “since most researchers were interested in relationships among unmeasured latent variables, there appears an over-reliance on the components model”

·         Large # of studies didn’t present criterion used for determining factor retention or used the Eigenvalues greater than one…but this is arbitrary (1.01 = sig., but .99 isn’t?? Problematic b/c inflexible adherence can lead to under/over estimation of # of factors to retain); also alternative solutions should be considered

·         Most used orthogonal rotation to “force” independence among factors w/o conceptual justification; others used orthogonal even though the interdependence of factors was recognized…since orthogonal is subset of oblique, makes sense to rotate factors obliquely and then determine the tenability of the orthogonality…

·         To interpret factor solution, most set minimum value above which the loading was considered significant…but is arbitrary…also, they tended to force variables to load on only 1 factor (which ignores that it is consistent with the common factor model and simple structure for a var to have more than 1 high loading and results in a reduction in amt of info used for factor defining).

·         Few studies provided justification for their decisions, or mentioned considering alternatives; few presented communalities, or factor loadings…

·         Overall, authors say FA in I-O is poorly applied; disturbing b/c implies that users of FA are employing methodologies they have little understanding of à meaningless solutions and erroneous conclusions

·         Widespread availability of computer packages cited as major problem (reliance on defaults even w/o understanding; inaccurate reporting as function of blind acceptance of printouts)



·         Support Glass & Taylor (1966) review that also found poor techniques in use of FA

·         Recommendations:


Orr, J. M., Sackett, P. R., & DuBois, C. L. Z. (1991). Outlier detection and treatment in I/O psychology: A

survey of researcher beliefs and an empirical illustration. Personnel Psychology, 44, 473-486.

Overview:  Basically saying that outliers can have a great influence on the conclusions we draw from correlational data; that researchers don’t agree on any one best way for handling outliers (but most look at scatterplots to detect), when to delete data points, if/how to report when data is removed; and that “while outlier removal can influence effect size measures in individual studies, outlying data points weren’t found to be a substantial source of validity variance in a large test validity data set”…


Outliers = extreme data points (those separated form the majority of the data)

·         Outlier detection is essential aspect of data analysis; need to consider the source of points in determining how to handle

·         Basic sources of outliers:

o         Data points from Ss who aren’t part of the population of interest

o         May be legitimate data points (w/ info valuable to the study of interest)

o         Result of extreme values on the error component of classical test model

o         Error in observation or recording during data collection process

o         Error in preparing data for analysis (although checking for out-of-range data should be “routine” for a competent researcher)

·         Even though outliers can significantly impact statistical results, very little agreement regarding whether they should be excluded from data analysis…

·         Study was motivated by developments re: meta-analysis:

o         Apparent variability in findings across studies may be due to statistical artifacts (e.g., sampling error, error of measurement) so, authors felt that differences in the treatment of outliers could be additional source of variance…

Two purposes:

·         Survey recent pub authors to see how they detect/treat outliers

·         Re-analyze data from large db of multiple studies to determine effects of different outlier removal procedures on the findings


Overview of outlier detection methods:

·         Outlying data points can be extreme on DV, IV, or both; influence the regression line by forcing it away from majority of points, or by conveying a curvilinear quality to data.

·         For bivariate data: graphic means of outlier detection (e.g., scatterplots, or plots of residuals against respective predicted values or IVs); OR use SPSS or SAS; OR Cook’s D Statistic (index sensitive to outliers on both X and Y)


Study 1

·         How do currently published researchers believe outliers should be treated: should they be included in analyses or dropped?

·         Which outlier detection techniques do these researchers report using in their work?

·         Mail survey…JAP or P-Psych (1984-1987); 157 studies…100 surveys returned



·         Results suggest researchers are sensitive to the potential impact of outliers, many make use of at least some technique for detection (70 of 81 respondents ranked scatterplots as the #1 method), and majority are willing to remove under some circumstances; ALSO, 29% report endorsing retention of all data points, and 18% reported no outlier detection… so there is clear variability in the treatment of outliers


Study 2 – Do the differences in treatment of outliers contribute to inconsistency in study findings across settings?

·         2 objectives: document the extent to which validity coefficients were affected by removal of small # of outliers and to explore whether outliers contributed to variance in study findings

·         Effect size measures were validity coefficients from selection test validation studies (GATB test results and job performance measures for 36K+ EEs; USES)

·         Whether outliers reflect legitimate data points is often unclear (in a single study);

·         If r =.15 ns with outliers and .30 sig. without outliers – which best represents the population parameter? Can’t be determined, BUT if there are 50 other studies examining same variables and the mean r is .30 (and residual SD = .02), can conclude findings are consistent with other research…so then, it’s either that test validity is diff in this org, or the outliers are distorting true picture…authors argue that cumulative evidence of 50 studies suggests outliers to be artifactual source of variance and would endorse use of test…

·         If you want to know more specifics about the method – read the article (yawn…); Basically, assessed predictor-criterion relationships with Cook’s D, Studentized residuals, and Leverage values, and All Diagnostics (used all 3 measures)


·         Very few data points reached threshold for removal with Cook’s D; removal of data points with Studentized residuals results in increase in mean validity and increase in validity variance; removal based on Leverage values results in opposite of S. residuals; removal of points identified as influential by the All Diagnostics left mean validity essentially unchanged but increased validity variance…Although values varied, RESULTS indicate that outlier removal can have noticeable effect on the size of the validity coefficient

·          Results don’t support that outliers contribute to the illusion of variance in validity coefficients

·         Note: “bounded” criterion (supervisors rated on 7-pt scale)


“Net results of study are…”

·         Disagreement among researchers as to appropriateness of deleting data points

·         Report greater use of visual exam of data than of numeric diagnostic techniques for detecting

·         “while outlier removal can influence effect size measures in individual studies, outlying data points weren’t found to be a substantial source of validity variance in a large test validity data set”…



·         Cant assume that all researchers are dealing with outliers equally effectively (no clear guidance on the range of options avail)

·         Varying treatments can alter conclusions, so comparisons of studies without knowing their outlier approach can be misleading

·         Researchers need to CAREFULLY consider potential impact of outliers, should employ a strategy of multiple methods, and should clearly document the effects of their treatment of outliers on study outcomes (the authors seem to advocate reporting stats results for both with and w/o outlier removal, if applicable)


Wilcox, R. R. (2001). Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy.  New York: Springer-Verlag.

Overview:  Again, Wilcox believes that serious problems arise when we rely on standard stat methods b/c even under very small departures from normality, important discoveries lost by assuming that observations follow a normal curve (e.g., cant detect differences or relationships, or the magnitude is grossly underestimated)...

·         Reliance on central limit theorem; central = fundamental; need substantially large sample to be able to make inferences based on means with the normal curve…

Ch.2 Review:

·         Probability density function – represents area under the curve; equals 1

·         Measures of location or measures of central tendency –a single value chosen to represent the typical individual under study

·         How well does the Sample mean estimate the Population mean?  Assumed to be symmetrical, but often not symmetrical, consider outliers, low finite sample breakdown point of the sample mean (=1/n; as n ­, the finite sample breakdown point goes to 0).

·         Median – pop M = # such that .5 probability of an observation being less than it; Sample Median; the median achieves the highest possible breakdown point as opposed to the sample mean (meaning relatively insensitive to outliers)

·         Weighted mean – like sample mean, finite breakdown point low…i.e., a single unusual value can make the weighted mean arbitrarily large or small

·         Variances2 = sample variance; s2 = population variance; s = population standard deviation estimated with s…low breakdown point of variance is esp. devastating (even when distributions are symmetric) 

·         Measuring error:

o         Using Least squares principle, (minimizing sum of squared errors), the optimal guess leads to the mean.

o         Using absolute values of the errors to get overall accuracy measure would be better; leads to median

·         Least squares regression shown to be a weighted mean, so it too highly influenced by a single outlier.

·         Sample variance has a finite sample breakdown point of 1/n.


Ch.3 Normal Curve and Outlier Detection:

·         Masking – problem when sample mean and the standard deviation are inflated by the outliers, which masks their presence; Can make “traditional” equation for detecting outliers unsatisfactory

·         Using outlier detection rule that avoids the problem of masking (b/c based on measures of location and scale that have a high finite sample breakdown point) is better; boxplot also?

·         Central limit theorem says w/ sufficiently large sample size can assume sample mean is normal, but sometimes n=20; sometimes, however, n = 100 or 160 may be required

·         A version of central limit theorem applies to the sample median; situations arise where distribution of the median approaches normal curve more quickly, as the sample size increases, than does the distribution of the mean.


Ch.4 Accuracy and Inference:

·         Mean squared error of the sample mean = average of expected squared difference between the infinitely many sample means and the population mean (we want it to be as small as possible)

·         Least squares estimate of the slope of a regression line assumes homoscedasticity; when assumption is violated, alternate estimators can be substantially more accurate…

·         Homoscedasticity – situation where variation of the Y values doesn’t depend on X

·         Hetereoscedasticity – variation of Y changes with X (e.g., more variation on Y values when X=12 than when X=20)

·         Variance of sample mean = squared standard error of sample mean…expected squared distance b/w sample mean and pop mean



Ch.5 Hypothesis Testing and Small Sample Sizes

·         Type I error – saying null hypothesis is false when it’s true; a

·         Type II error – failing to reject when the null hypothesis is false; b;

·         Power = 1-b (probability of rejecting when the null is false)

·         As sample size, n, ­, power ­, so probability of Type II error ¯

·         As a ¯ (and probability of Type I error ¯), probability of Type II ­ (i.e., smaller a is, the less likely we are to reject when in fact we should (b/c null is false))

·         As the standard deviation, s, ­…power ¯

·         Practical problem with T is that its expected value can differ from 0 (possible b/c sample mean and sample variance can be dependent under non-normality) so can be biased test (power isn’t minimized when null is true, i.e., power ¯ as we move away from null)


Ch.6 The Bootstrap

·         Percentile t bootstrap beats our reliance on the central limit theorem when computing CIs for the population mean. Practical problems with Students T are reduced but not eliminated…

·         Percentile bootstrap not recommended when working with sample mean, but has practical value when making inferences about the slope of a regression line; also corrects problem of heteroscedasticity affecting T test of hypothesis (modified percentile bootstrap)…


Ch.7 A Fundamental Problem

·         Small departures from normality can inflate s tremendously. In practical terms, small departures from normality can mean low power, and relatively long CIs when using means

·         Approximating a probability curve with a normal curve can by highly inaccurate even when the probability curve is symmetric

·         Even when s is known, outlier detection rules based on X bar and s can suffer from masking


Ch.8 Robust Measures of Location

·         2 robust estimators of location:

o         Trimmed means – basically basing average rating after highest and lowest values are removed… issue = need to figure out how much to trim

o         M-estimators

·         Standard errors of the 20% trimmed mean and the M – estimator are only slightly larger than standard error of the mean (when sampling from normal distribution); when under small departure from normality, the std error of the mean can be substantially higher

·         20% trimmed mean has breakdown point of .2; M-estimator based on Huber’s has breakdown point of .5 – so, when sampling from skewed distribution, their values can be substantially closer to the bulk of the observations than the mean


Ch.9 Inferences about Robust Measures of Location

·         Estimate std error of trimmed means with Winsorized variance,

·         Given estimate of std error of trimmed mean, CIs can be computed to test hypothesis

·         Use trimmed means to compare two independent groups; under normality, little power is lost, but for very small departures from normality, 20% trimmed mean results in substantially more power

·         CIs can be computed and hypothesis tested using M-estimator and percentile bootstrap


Ch.10 Measures of Association

·         Pearson’s correlation = 0 when 2 measures are independent; b/w -1 and 1; r is not resistant to outliers.

o         Useful for summarizing and detecting associations, but can be relatively unrevealing

·         Smoothers provide useful addition to tools to detect curvature…may show an association b/w X and Y over some range of  values but outside the range, association disappears

·         One way to reduce effects of outliers on r  is to downweight extremes values for X and Y – Winsorized correlation, Spearman’s Rho, and Kendall’s Tau…


Ch.11 Robust Regression & Ch.12 Alternate Strategies

·         Many alternative methods, but which to used will depend on particular situation (of course!), but these do offer substantial advantages over tradition techniques


Wilcox, R.R. (1998).  How many discoveries have been lost by ignoring modern statistical methods?

American Psychologist, 53, 300-314.

Overview:  A lot of our commonly applied statistical techniques can be misleading and can have relatively low power under even small departures from normality…If more modern stat methods were employed, many ns findings would be sig.…Modern techniques are more robust and effective when data is non-normal, but also do well when normal assumptions are met.  Advantages: more accurate confidence intervals, effective outlier detection, etc…


·         Standard methods not robust when differences exist or associations b/w random variables (esp. arbitrarily small departures from normality = low power; heteroscedasticity (unequal variance among groups) in normal dist lowers power of ANOVA and regression methods, CIs and measures of effect size can be “extremely misleading” w/ non-normal data)

·         Concern about the ever-widening gap b/w modern stat methods & techniques psychologists use!

·         Typical stats (sample mean, person product-moment correlation, least squares estimate of regression parameters) can be drastically affected by a single, unusual value – Modern methods designed to deal with this…(and make huge difference in applied work)

·         Skewness, heteroscedasticity and outliers can substantially decrease chances of:

o         Detecting true differences b/w groups

o         Detecting true associations among random variables

o         Obtaining accurate CIs

o         Also, problems with common measures of effect size


Problems with Students t test

·         1-α = CI for the population mean; If goal is to test null hypothesis that M = 10, then reject null if CI doesn’t contain 10…

o         Skewness and outliers affect power (probability of rejecting null when it is false)…power is related to variance of the sample mean…as population variance ­ the power ¯

·         Small departures from normality can substantially lower power

o         Normal vs. mixed normal distributions – appear similar, but…even though Kolmogorov test for normality can have low power (and unlikely to detect departure from normality), the standard normal distribution s2 = 1.0, but for mixed normal, s2 = 10.9 (pop variance is not robust so even small changes in tails of distribution can drastically alter value of the pop variance)

·         Under slight departures from normality, potential discoveries will be lost!

·         Trimmed means and M estimators use a type of trimming (extreme values are removed when a  measure of location is being estimated)…can result in higher power

o         10% trimming means 10% of largest and 10% of smallest observations are removed…

o         M estimators empirically determine whether observation is an outlier; if so, adjustments for it are made…

o         Standard error of the sample mean may be < standard error of these estimators, but the sample mean rarely, if ever, offers a substantial advantage (and it’s fairly common for trimmed means and M estimators to have substantially smaller standard errors)

o         How it works:  in samples of observations, outliers inflate estimated standard error of the sample mean (i.e., outliers inflate sample variance à long CIs & relatively poor power)… incorporating a mechanism into estimator ¯ effects of outliers and low std errors can be achieved

But outliers are important, interesting, and informative:

·         Modern methods are effective b/c when one attempts to ID and study unusual observations, measures of location and scale that aren’t themselves affected by outliers are useful (contrast = sample mean and SD can be inflated by outliers)

Why not discard outliers and apply standard methods to the remaining data?

·         This fails b/c it results in using the wrong standard error (if extreme values are discarded, remaining observations are no longer independent, so conventional methods for deriving standard errors no longer apply)….lots of confusing “non-details” but he says there are statistical packages that calculate what you need for the modern methods!

What if distributions are skewed or there are no outliers?

·         Common misconception is that robust methods are appropriate only when distributions are symmetric, so by implication, std methods should be used if distributions are skewed. Another misconceptions is that if there are no outliers, modern robust methods offer no practical advantages…

·         Modern methods give better results when dist are skewed & std methods fail miserably (standard methods have can peculiar power properties (power ¯ as moves away from null), CIs have probability coverage substantially different from the nominal level, and the sample mean can poorly reflect the typical participant under study)…


How much trimming should be used?

·         Depends…20% good for general use, but may not be optimal in terms of minimizing std error

·         Analogue of Students t test – Windsorized variance * constant (dependent on amt of trimming); again, use a computer…


What is a robust parameter?

·         Seems to mean that a particular hypothesis-testing procedure controls the probability of Type I error; among modern methods, it has more general meaning that applies to both parameters and estimators…hypothesis testing – small changes in a distribution shouldn’t result in large changes in power or probability coverage

·         Don’t discard means, but…using means to the exclusion of modern methods is unsatisfactory and an uninformative way to proceed…



·         With correlation and regression, the problems with means may become worse.  Modern methods are available to overcome that correlations aren’t resistant to even small changes in values.



·         Robust analogues of the ordinary least squares reg estimator are avail…offer substantial improvements in terms of power and resistance to outliers while sacrificing very little when  the error term is normal and homoscedastic; says OLS is one of the poorest choices b/c std error is 100x more than some modern methods…

·         Also, benefit of using robust modern technique that isn’t sensitive to outliers is that it can give an even better picture of the reg line that best fits the bulk of the points so one can determine which points are unusual.

·         Look at the article for illustrations of how this works…and how the methods are able to ID “true” relationships, etc…



·         Basically, no single perfect method; which estimator is best depends on the particular situation; take advantage of technology to vastly improve on std ANOVA and regression techniques (great opportunity to improve psych research)

·         Worst choice = apply OLS regression or just report correlations


* Other Notes

VARIOUS STAT METHODS/DATA ANALYSIS *from texts, summaries, etc…

For general referencing:

Aguinis et al. Measurement in Work and Org Psychology. IWO handbook.

Guion – Validity and Reliability (2002) book chapter from Rogelberg I-O Methods Handbook.

Hunter, J. E., & F. L. Schmidt (1990). Methods of meta-analysis. Newbury park, ca: sage.

Klein & Kozlowski (Eds.). (2000). Multilevel Theory, Research, and Methods in Organizations.

Pedhazur, E.J. & Schmelkin, L.P. (1991). Measurement, Design, and Analysis: An Integrated

Approach (student edition)…

Tabachnick, B.G. & Fidell, L.S. (2001). Using Multivariate Statistics.


“Concepts to master in simple terms

Standard deviation – measure of variability – represents the average amount of variability in a set of scores; average deviation/distance from the mean; how far from the mean the scores actually fall; the tighter that a group of scores clusters around the mean, the easier it is to make accurate predictions about the value of additional scores. So, samples with lower standard deviations provide more reliable and predictable data than samples with higher standard deviations.


Correlation – A numerical index of the relationship b/w two variables; reflects the amount of variability shared b/w 2 variables and what they have in common (how the value of one variable changes when the value of another variable changes); measure of the strength and direction of a relationship between two variables. The coefficient of correlation ranges between -1.00 (perfect negative correlation) to 0.00 (no correlation) to +1.00 (perfect positive correlation); Need to be considered carefully b/c they don't indicate causality…


Group differences – when Ss are assigned to groups (treatments), the major research Q is usually the extent to which reliable mean differences on DVs are associated with group membership. Once differences are found, assess the degree of relationship b/w IVs and DVs

·         ANOVA – test for difference b/w 2 or more means; Simple ANOVA has 1 IV; One-way ANOVA looks for differences b/w the means of 2 or more groups; compares amount of variability between groups  (due to grouping factor) to the amount of variability within groups (due to chance)– as average diff b/w groups gets larger, F value increases, and it’s more like due to X than to chance…

·         ANCOVA – allows you to basically equalize initial difference between groups; assesses group differences on a single DV after the effects of 1 or more covariates are removed (ex: remove age, degree of disability before examining effects of therapy on reading level for each group (e.g., treatment, control); want strong relationship b/w DV and covariates for greater ANCOVA power…

·         MANOVA – differences among centroids for set of DVs when 2 or more levels of IV (groups). (MANCOVA)

·         Also, independent t-test (only 2 groups; assumes equal variability…)


Moderator (or interaction) - Two independent variables interact if the effect of one of the variables differs depending on the level of the other variable, OR an interaction effect refers to the role of a variable in an estimated model, and its effect on the dependent variable - A variable that has an interaction effect will have a different effect on the dependent variable, depending on the level of some third variable.

·         For example, increasing organizational effectiveness might yield increasing job satisfaction, but the increase in satisfaction might be progressively greater for men than for women. In this case, organizational effectiveness is interacting with gender to produce different rates of satisfaction for gender categories.

·         [see previous section for info on how to test].


Internal consistency reliability – determines the degree to which various items of a measure correlate with each other (Aguinis et al, IWO handbook), OR the extent to which all items are measuring the same construct; Fundamentally based on the notion of items as replications, ignoring differences in difficulty, so that similar responses should be given to both in any pair of items. Less than perfect correlation is evidence that items don’t tap precisely the same facet or same level of the underlying construct; internal consistency coefficients are useful; method of estimation should be chosen based on the sorts of variance to be treated as error…(Guion, 2002)

·         Inter-item correlation – correlate all items with each other; items should correlate positively (not to low so shows consistency; not too high (or shows redundancy))

·         Corrected Item-total correlation – select an item and calculate scale score excluding that item, then correlate item w/ corrected scale score; higher the correlation, the better

·         Split-half – compares scores on two halves of a test taken at the same time

·         Cronbach’s a – compares scores of examinees on all possible split halves; most common reliability measure; provides lowest estimate of reliability, influenced by # of items;


Inter-rater reliability – an indication of the extent to which 2 or more raters have agreement with one another (consistency b/w raters) about some judgment (b/c raters’ biases and inconsistencies may influence ratings); determines consistencies among raters and whether rater characteristics are determining the rating instead of the attribute being measured (Aguinis et al.);

·         Degree of consistency across raters when rating objects or individuals

o         Interrater consensus – absolute agreement b/w raters

o         Interrater consistency – similarity in ratings based on correlation or rank-order

·         Measured by interrater agreement (% of rater agreement), interclass correlation, or intraclass correlation (these refer to proportional consistency of variance among raters)

o         Interclass correlation – 2 raters rating multiple objects or individual (can use Pearson’s r)

o         Intraclass correlation – reliability of mean ratings; group of raters and single and/or multiple targets

§         ICC(1) – multiple raters, multiple targets on single dimension

§         ICC(2) – randomly sampled judges, and each judge rates each target


Sampling error – a measure of how well a sample approximates the characteristics of a population; basically the difference between the values of the sample statistic (a measure that describes a sample value but estimates a population value) and the population parameter (the value of that measure in the pop). 

·         The higher the sampling error, the less precise the sample and the more difficult it will be to make that case that what you find in the sample indeed reflects what you expect to find in the pop…

·         Sampling error is an estimate of the margin by which the "true" score on a given item could differ from the reported score for one or more reasons (i.e., differences in one or more important characteristics between the sample and the population). For example, if 60% of Ss reply "very often" to a particular item and the sampling error is ± 5%, there is a 95% chance that the population value is between 55% and 65%.

·         The larger the sample, the smaller the degree of sampling error.


Measurement error - Measurement is never perfect, and we can always expect measurement errors in our data. Goal is to keep these errors to a minimum. For this reason, we need to be aware of the various sources and causes of measurement error.

·         Random error is a nonsystematic measurement error that is beyond our control, though its effects average out over a set of measurements.

o         Example: a scale may be properly calibrated but give inconsistent weights (sometimes too high, sometimes too low). Over repeated uses, however, the effects of these random errors average out to zero.

o         The errors are random rather than biased: They neither understate nor overstate the actual measurement.

·         In contrast, measurement bias, or systematic error, favors a particular result. A measurement process is biased if it systematically overstates or understates the true value of the measurement.

o         Example: If a scale is not properly calibrated, it might consistently understate weight. In this case, the measuring device -- the scale -- produces the bias. Human observation can also produce bias.

o         Biased measurements invariably produce unreliable results.

In any statistical investigation, we can always attribute some of the variation in data to measurement error, part of which can result from the measurement instrument itself. But human mistakes, especially recording errors (e.g., misreading a dial, incorrectly writing a number, not observing an important event, misjudging a particular behavior), can also often contribute to the variability of the measurement and thus to the results of a study.


Statistical significance (vs. practical significance) – Stat sig. of a result is the probability that the observed relationship (e.g., b/w variables) or difference (e.g., b/w means) in a sample occurred by pure chance and that in the pop from which sample was drawn, no relationship/difference really exists.

·         Statistical significance is the degree of risk you are willing to take that you will reject a null hypothesis when it is actually true (AKA Type I error), or the reliability of having some effect present…

·         SO, significance tests are performed to see if the null hypothesis can be rejected. If the null hypothesis is rejected, then the effect found in a sample is said to be statistically significant. If the null hypothesis is not rejected, then the effect is not significant. Researcher chooses a significance level before conducting the statistical analysis. The significance level chosen determines the probability of a Type I error.

·         A statistically significant effect is not necessarily practically significant.


Practical significance– are the results meaningful or practical? More of a substantive issue, not statistical one (maybe think: utility?)

·         Large sample sizes can produce a statistically significant result even though there is limited or no practical importance associated with the finding (a sound conceptual base to a study will lend meaning to the significance of an outcome)

·         Answers: Is the size of the effect meaningful in the real world? (Can measure with eta-squared (estimate the pop. % of variance in the DV accounted for by the relationship w/ IV), or omega-squared…)


Chi-Square – allows you to determine if what you observe in a distribution of frequencies would be what you would expect to occur by chance; if no difference b/w what is expected by chance from what is observed, then chi-square would = zero.


MANOVA – used to see main and interaction effects of categorical variables on more than 1 DV;

Where ANOVA tests the differences in means of the interval dependent for various categories of the independent(s), MANOVA tests the differences in the centroid (vector) of means of the multiple interval dependents, for various categories of the independent(s).


Multiple potential purposes for MANOVA:

·         To compare groups formed by categorical independent variables on group differences in a set of interval dependent variables.

·         To use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables.

·         To identify the independent variables which differentiate a set of dependent variables the most.


Multiple analysis of covariance (MANCOVA) is similar to MANOVA, but interval independents may be added as "covariates." these covariates serve as control variables for the independent factors.


Significance tests

F-test. The omnibus or overall f test is the first of the two-step MANOVA process of analysis; Tests the null hypothesis that there is no difference in the means of the dependent variables for the different groups formed by categories of the IVs.

·         The multivariate formula for f is based not only on the sum of squares between and within groups, as in ANOVA, but also on the sum of cross-products -- that is, it takes covariance into account as well as group means.

Tests of group differences are the second step in MANOVA. If the overall f-test shows the centroid (vector) of means of the dependent variables is not the same for all the groups formed by the categories of the independent variables, tests of group differences are used to explore the nature of the group differences.

Significance tests for multiple dependents (ex., Hotelling, Wilks, or Pillai tests) all follow the f distribution and so an f value and corresponding significance level are printed out for each of these



·         Observations are independent of one another. MANOVA is not robust when the selection of one observation depends on selection of one or more earlier ones.

·         IVs are categorical.

·         DVs are continuous and interval level.

·         Residuals are randomly distributed.

·         Homoscedasticity (homogeneity of variances and covariances): within each group formed by the categorical independents, the variance of each interval dependent should be similar, as tested by Levene's test, below. Also, for each of the k groups formed by the independent variables, the covariance between any two dependent variables must be the same.


Canonical correlation – several continuous DVs and several continuous IVs, and the goal is to assess the relationship b/w the 2 sets of variables…(T&F)


The purpose of canonical correlation analysis is to explain or summarize the relationship between two sets of variables by finding a linear combinations of each set of variables that yields the highest possible correlation between the composite variable for set a and the composite variable for set b. One or more additional linear combinations are then formed for each variable set in an attempt to further explain the residual variance that is not explained by the initial correlation. There may be multiple ivs and dvs and canonical correlations finds a liner relationship. For example, if you have several predictors of job performance and several predictors of job success you might use canonical correlation to measure the relationship.


Discriminant function analysis - a method of distinguishing between classes of subjects or objects. The values of various attributes of an subject or object are measured and a rule (function) is applied that assigns a classification to that object.

·         Example: a rule is desired to distinguish between competent and incompetent employees. We could measure: personality, g, bio-data, job knowledge…Discriminant analysis seeks to establish a rule formula that accurately divides employees into competent and incompetent categories based on the above variables. Typically, the rule will be established using a portion of the data (validation sample) and tested on another portion of the data (hold out or cross validation sample).


Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups.

·         For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students' graduation. After graduation, most students will naturally fall into one of the three categories. Discriminant analysis could then be used to determine which variable(s) are the best predictors of students' subsequent educational choice.


The basic idea underlying discriminant function analysis is to determine whether groups differ with regard to the mean of a variable, and then to use that variable to predict group membership


Summary. In general discriminant analysis is a very useful tool (1) for detecting the variables that allow the researcher to discriminate between different (naturally occurring) groups, and (2) for classifying cases into different groups with a better than chance accuracy.



·         Computationally very similar to MANOVA, and all assumptions for MANOVA apply.

·         Normal distribution.

·         Homogeneity of variances/covariances. It is assumed that the variance/covariance matrices of variables are homogeneous across groups.

·         Correlations between means and variances. The major "real" threat to the validity of significance tests occurs when the means for variables across groups are correlated with the variances (or standard deviations). Intuitively, if there is large variability in a group with particularly high means on some variables, then those high means are not reliable. However, the overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance. In practice, this pattern may occur if one group in the study contains a few extreme outliers, who have a large impact on the means, and also increase the variability. To guard against this problem, inspect the descriptive statistics, that is, the means and standard deviations or variances for such a correlation.


Stepwise discriminant analysis

Probably the most common application of discriminant function analysis is to include many measures in the study, in order to determine the ones that discriminate between groups.

·         For example, an educational researcher interested in predicting high school graduates' choices for further education would probably include as many measures of personality, achievement motivation, academic performance, etc. As possible in order to learn which one(s) offer the best prediction.

Model. Put another way, we want to build a "model" of how we can best predict to which group a case belongs. In the following discussion we will use the term "in the model" in order to refer to variables that are included in the prediction of group membership, and we will refer to variables as being "not in the model" if they are not included.

·         Forward stepwise analysis. In stepwise discriminant function analysis, a model of discrimination is built step-by-step. Specifically, at each step all variables are reviewed and evaluated to determine which one will contribute most to the discrimination between groups. That variable will then be included in the model, and the process starts again.

·         Backward stepwise analysis. One can also step backwards; in that case all variables are included in the model and then, at each step, the variable that contributes least to the prediction of group membership is eliminated. Thus, as the result of a successful discriminant function analysis, one would only keep the "important" variables in the model, that is, those variables that contribute the most to the discrimination between groups.

·         F to enter, f to remove. The stepwise procedure is "guided" by the respective f to enter and f to remove values. The f value for a variable indicates its statistical significance in the discrimination between groups, that is, it is a measure of the extent to which a variable makes a unique contribution to the prediction of group membership.


Structural Equation Modeling – combines factor analysis, canonical correlation, and multiple regression – evaluates whether the model provides a reasonable fit to the data and the contribution of each of the IVs to the DVs…can also compare among alternative models (T&F)


Structural equation modeling (SEM) or “covariance structure analysis“ is a multivariate statistical technique which combines confirmatory factor analysis and path analysis for the purpose of analyzing hypothesized relationships among latent (unobserved or theoretical) variables measured by manifest (observed) indicators.

An SEM model is typically composed of two parts:

·         A measurement model – describes how latent variables are measured by their manifest indicators (i.e., test scores, survey observation)

·         A structural model – describes the relationship between the latent variables and shows the amount of unaccounted for variance.

SEM is confirmatory in nature since it seeks to confirm the relationships we believe to exist between measured and unmeasured variables. This is accomplished by comparing the estimated covariance matrix (implied by the hypothetical model we have in mind) to the actual covariance matrix derived from the empirical data.


Model specification - - Parameter estimation - -

A model “fits” when the implied covariance matrix (theory driven) is equivalent to the covariance matrix of the observed data. Various goodness of fit indicators tell you how well your theory of interrelationships among variables (latent and measured) fits the observed interrelationships among variables. All of them assess the degree of correspondence between the theoretical (implied) covariance matrix and the observed covariance matrix; Chi-square, Root mean square error, Goodness of fit index



·         A reasonable sample size – a good rule of thumb is 15 cases per predictor in a standard ordinary least squares multiple regression analysis. Since SEM is closely related to multiple regression in some respects, 15 cases per measured variable in SEM is not unreasonable. Some researchers think that for this class of model with two to four factors, the investigator should plan on collecting at least 100 cases, with 200 being better (if possible). In general, the more the better.

·         Continuously and normally distributed endogenous variables

·         SEM programs assume that dependent and mediating variables (so-called endogenous or downstream variables in SEM parlance) are continuously distributed, with normally distributed residuals. In fact, residuals from a SEM analysis are not only expected to be univariate normally distributed, their joint distribution is expected to be joint multivariate normal (JMVN) as well. However, this assumption is never completely met in practice.

·         Model identification (identified equations)


Confirmatory Factor Analysis

Goal of FA or PCA is to reduce a large # of variables to a smaller # of factors, to concisely describe (and maybe understand) the relationships among observed variables, or to test theory about underlying processes (T&F).


Principle components and factor analysis:

·         Factor analysis is a statistical approach that can be used to analyze interrelationships among a large number of variables and to explain these variables in terms of their common underlying dimensions (factors).

·         The difference between PCA and FA is that is that for the purposes of matrix computations PCA assumes that all variance is common, with all unique factors set equal to zero; while FA assumes that there is some unique variance. The level of unique variance is dictated by the FA model which is chosen. Accordingly, PCA is a model of a closed system, while FA is a model of an open system.

·         Factor rotation attempts to put the factors in a simpler position with respect to the original variables, which aids in the interpretation of factors. Rotation places the factors into positions that only the variables which are distinctly related to a factor will be associated.

o         Varimax, quartimax, and equimax are all orthogonal rotations, while oblique rotations are non-orthogonal (allows resultant factors to be correlated – used when the research believes that the factors are related to one another).

o         Varimax rotation maximizes the variance of the loadings, and is also the most commonly used.

·         Factor analysis could be used to verify your conceptualization of a construct of interest.


2 main types of FA:

Principal component analysis -- this method provides a unique solution, so that the original data can be reconstructed from the results. It looks at the total variance among the variables, so the solution generated will include as many factors as there are variables, although it is unlikely that they will all meet the criteria for retention. There is only one method for completing a principal components analysis; this is not true of any of the other multidimensional methods described here.


Common factor analysis -- this is what people generally mean when they say "factor analysis." this family of techniques uses an estimate of common variance among the original variables to generate the factor solution. Because of this, the number of factors will always be less than the number of original variables. So, choosing the number of factors to keep for further analysis is more problematic using common factor analysis than in principle components.


Steps in conducting a factor analysis

·         Data collection and generation of the correlation matrix

·         Extraction of initial factor solution

·         Rotation and interpretation

·         Construction of scales or factor scores to use in further analyses


·         Determine the number of components/factors to be retained for further analysis – “rule of thumb” for determining the number of factors, is the "Eigenvalue greater than 1" criteria, OR view scree test (factors on the x-axis and Eigenvalues on the y-axis)

o         Eigenvalues are produced by a process called principal components analysis (PCA) and represent the variance accounted for by each underlying factor (in terms of the amount of 1 variable…)

o         Choose # of factors based on where the line straightens…

·         Rotating factor structure might make more interpretable

·         Naming the factors

o         Factor names should be brief, one or two words communicate the nature of the underlying construct

o         Look for patterns of similarity between items that load on a factor. If you are seeking to validate a theoretical structure, you may want to use the factor names that already exist in the literature. Otherwise, use names that will communicate your conceptual structure to others. In addition, you can try looking at what items do not load on a factor, to determine what that factor isn't. Also, try reversing loadings to get a better interpretation.

·         Using the factor scores

o         It is possible to do several things with factor analysis results, but the most common are to use factor scores, or to make summated scales based on the factor structure (depends..)


Suppressor variables – A variable that conceals or reduces (suppresses) a relationship between other variables. It may be an *independent variable unrelated to the *dependent variable but correlated with one or more of the other independent variables. Removing the suppressor variable from the study raises the correlation between the remaining independent variable(s) and the dependent variable.

·         (in multiple regression ) has zero (or close to zero) correlation with the criterion but is correlated with one or more of the predictor variables, and therefore, it will suppress irrelevant variance of independent variables.

·         Example, you are trying to predict the times of runners in a 40 meter dash. Your predictors are height and weight of the runner. Now, assume that height is not correlated with time, but weight is. Also assume that weight and height are correlated. If height is a suppressor variable, then it will suppress, or control for, irrelevant variance (i.e., variance that is shared with the predictor and not the criterion), thus increasing the partial correlation. This can be viewed as ridding the analysis of noise.

·         Another example, suppose we were testing candidates for the job of forest ranger. We are sure that a good forest ranger must know a lot of botany; we also think that verbal ability has no effect on rangers' job performance. We give a written botany test to help us pick good rangers. Because the test is written, candidates with stronger verbal ability is the suppressor variable. It gets in the way of studying what we are interested in: knowledge of botany and rangers' job competencies.


Meta-analysis (VG) – takes the results of two or more studies of the same research question and combines them into a single analysis. The purpose of meta-analysis is to gain greater accuracy and statistical power by taking advantage of the large sample size resulting from the accumulation of results over multiple studies. Meta-analysis typically uses the summary statistics from the individual studies, without requiring access to the full data set. Key components of meta-analysis include ensuring the availability of a common metric (statistic) across all studies, and the use of appropriate algorithms for combining or averaging those metrics across studies and assessing statistical significance.

·         Allow a more objective appraisal of the evidence than traditional narrative reviews, provide a more precise estimate of a treatment effect, and may explain heterogeneity between the results of individual studies.

·         Poorly conducted meta-analyses may be biased owing to exclusion of relevant studies or inclusion of inadequate studies. Misleading analyses can generally be avoided if a few basic principles are observed.


Summary points

·         Meta-analysis should be as carefully planned as any other research project, with a detailed written protocol being prepared in advance

·         The a priori definition of eligibility criteria for studies to be included and a comprehensive search for such studies are central to high quality meta-analysis

·         The graphical display of results from individual studies on a common scale is an important intermediate step, which allows a visual examination of the degree of heterogeneity between studies

·         Different statistical methods exist for combining the data, but there is no single "correct" method

·         Meta-analysis involves identifying the source of the variance in effect sizes in order to make statements about what causes effect sizes to be high or low. This requires looking for moderator effects in the magnitude of the effect sizes. Ones theory as to what might moderate the effect sizes determines which moderators to search for.


Observational study of evidence:  Meta-analysis should be viewed as an observational study of the evidence. The steps involved are similar to any other research undertaking: formulation of the problem to be addressed, collection and analysis of the data, and reporting of the results.


·         Researchers should write in advance a detailed research protocol that clearly states the objectives, the hypotheses to be tested, the subgroups of interest, and the proposed methods and criteria for identifying and selecting relevant studies and extracting and analyzing information.

·         Eligibility criteria have to be defined for the data to be included.

·         The strategy for identifying the relevant studies should be clearly delineated. In particular, it has to be decided whether the search will be extended to include unpublished studies, as their results may systematically differ from published studies (restricted to published evidence may produce distorted results owing to such publication bias).

·         A standardized record form is needed for data collection- - Independent observers code the data, determine consensus…

·         Standardize reported data

·         Statistical methods for calculating overall effect (weights study effect sizes by sample size); statistical techniques to do this can be broadly classified into two models, the difference consisting in the way the variability of the results between the studies is treated. The "fixed effects" model considers, often unreasonably, that this variability is exclusively due to random variation therefore, if all the studies were infinitely large they would give identical results. The "random effects" model assumes a different underlying effect for each study and takes this into consideration as an additional source of variation, which leads to somewhat wider confidence intervals than the fixed effects model. Effects are assumed to be randomly distributed, and the central point of this distribution is the focus of the combined effect estimate.


Multi-level Modeling - includes hierarchical linear modeling (HLM)or random coefficients modeling (RC) or covariance components models, is a form of hierarchical regression analysis developed since the 1980s, designed to handle hierarchical and clustered data.

·         Such data involve group effects on individuals which may be assessed invalidly by traditional statistical techniques. That is, when grouping is present (ex., students nested in schools), observations within a group are often more similar than would be predicted on a pooled-data basis, and hence the assumption of independence of observations is violated.

·         Multilevel modeling uses variables at superlevels (ex., school-level budgets) to adjust the regression of base level dependent variables on base level independent variables (ex., predicting student-level performance from student-level socioeconomic status scores).


Because hierarchical data are extremely common in settings studied by social scientists, multi-level modeling is increasingly popular. Use of multilevel modeling has become widespread in recent years due to the advent of software for generalized linear models.


Multilevel modeling is related to structural equation modeling in that it fits regression equations to the data, then tests alternative models using a likelihood ratio test.


Key concepts and terms

·         Hierarchical data involve measurement at multiple levels such as individual and group as, for example, a study of certain variables studied in terms of individual students' opinions, their classes, and their schools. In fact, much early work on multilevel modeling focused on educational settings.

·         In general, hierarchical data are obtained by measurement of units grouped at different levels, such as a study of children nested within families; employees nested within agencies; soldiers nested within platoons, divisions, and armies; or subjects nested within studies.

·         Aggregation/disaggregation vs. Multilevel analysis. The traditional approach to multilevel problems was to aggregate data to a superlevel (ex., student performance scores are averaged to the school level and schools are used at the unit of analysis) or to disaggregate data to the base level (ex., each student is assigned various school-level variables such as funding level per student, and all students in a given school have the same value on these contextual variables, and students are used as the unit of analysis). Ordinary ols regression or another traditional technique is then performed on the unit of analysis chosen.

o         There were three problems with this traditional approach: (1) under aggregation, fewer units of analysis at the superlevel replace many units at the base level, resulting in loss of statistical power; (2) under disaggregation, information from fewer units at the superlevel is wrongly treated as if it were independent data for the many units at the base level, and this error in treating sample size leads to over-optimistic estimates of significance; and (3) under either aggregation or disaggregation, there is the danger of the ecological fallacy: there is no necessary correspondence between individual-level and group-level variable relationships (ex., race and literacy correlate little at the individual level but correlate well at the state level, since southern states have many African Americans and many illiterates of all races).

·         Multilevel analysis, in contrast, involves multilevel theory which specifies expected direct effects of variables on each other within any one level, and which specifies cross-level interaction effects between variables located at different levels. That is, the researcher must postulate mediating mechanisms which cause variables at one level to influence variables at another level (ex., school-level funding may positively affect individual-level student performance by way of recruiting superior teachers, made possible by superior financial incentives).

·         Multilevel modeling tests multilevel theories statistically, simultaneously modeling variables at different levels without necessary recourse to aggregation or disaggregation. It should be noted, though, that in practice some variables may represent aggregated scores.


The multilevel model is one with a single dependent variable located at the base level (ex., in education, performance scores at the student level). There may be additional independent variables at the base level also, as in ols regression. In addition there will be at least one superlevel with at least one explanatory variable each (ex., in education, budget at the school level). In the ols model, base level data would be analyzed for all groups (ex., all schools) pooled together. However, in a multilevel model one is performing the regression for the base-level dependent on the base-level independents separately for each group. This results in different regression coefficients and different intercepts (ex., different for each school).


Model fit is assessed in multilevel modeling using deviance, a statistic which follows the chi-square distribution. Lower deviance corresponds to better fit. As many models may "fit" the data, it is usually more meaningful if one obtains deviance for a full model and for a nested model excluding some effects.

·         A chi-square difference test can then be performed to see if the full model is significantly different from the fit of the nested model. If it is not, then the more parsimonious nested model is usually preferred.




Utility -


Two-tailed test:  A two-tailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A two-tailed test has two critical regions.

·         2-tailed vs. 1-tailed tests: The purpose of a hypothesis test is to avoid being fooled by chance occurrences into thinking that the effect you are investigating (for example, a difference between treatment and control) is real. If you are investigating, say, the difference between an existing process and a (hopefully improved) new process, observed results that don't show an improvement would not interest you so you do not need to protect yourself against being fooled by "negative" effects, no matter how extreme. A 1-tailed test would be appropriate. If, on the other hand, you are interested in discerning a difference between samples a and b (regardless of which direction the direction goes), a 2-tailed test would be appropriate.


Type I error: In a test of significance, type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not. The projected probability of committing type I error is called the level of significance. For example, for a test comparing two samples, a 5% level of significance (p <= .05) means that when the null hypothesis is true (i.e. The two samples are part of the same population), you believe that your test will conclude "there's a significant difference between the samples" 5% of the time.


Type ii error: In a test of significance, type ii error is the error of accepting the null hypothesis when it is false -- of failing to declare a real difference as statistically significant. Obviously, the bigger your samples, the more likely your test is to detect any difference that exists. The probability of detecting a real difference of specified size (i.e. Of not committing a type ii error) is called the power of the test.

Variance: Variance is a measure of dispersion. It is the average squared distance between the mean and each item in the population or in the sample.

·         An advantage of variance (as compared to the related measure of dispersion - the standard deviation) is that the variance of a sum of independent random variables is equal to the sum of their variances.

·         Note: when using the sample variance to estimate the population variance, the divisor (n-1) is typically used instead of (n) to calculate the average. The latter results in a biased estimate; the former is unbiased.


Standard deviation: The standard deviation is a measure of dispersion. It is the positive square root of the variance.

·         An advantage of the standard deviation (as compared to the variance) is that it expresses dispersion in the same units as the original values in the sample or population. For example, the standard deviation of a series of measurements of temperature is measured in degrees; the variance of the same set of values is measured in "degrees squared".

·         Note: when using the sample standard deviation to estimate the population standard deviation, the divisor (n-1) is typically used instead of (n) to calculate the average. The use of (n-1) allows to reduce the bias of the estimate.


Nominal: allow for only qualitative classification (i.e., Categorical) Ex: races, gender, color, city


Ordinal: allow us to rank-order items that are measured in terms of which has less/more, but not “how much more”…Ex: upper-middle class is higher than middle-class, but we don’t know exactly how much (like 18%)...


Interval scale:  a measurement scale in which a certain distance along the scale means the same thing no matter where on the scale you are, but where "0" on the scale does not represent the absence of the thing being measured. Fahrenheit and Celsius temperature scales are examples.


Ratio scale: a measurement scale in which a certain distance along the scale means the same thing no matter where on the scale you are, and where "0" on the scale represents the absence of the thing being measured. Thus a "4" on such a scale implies twice as much of the thing being measured as a "2."