NSCC 328 - STATISTICAL SOFTWARE
Lesson 2. Working with SPSS
R E V I E W O N P A R A M E T R I C A N D N O N P A R A M E T R I C T E S T
Nonparametric tests don’t require that your data follow the normal distribution. They’re also known as distribution-free tests and can provide benefits in certain situations. Typically, people who perform statistical hypothesis tests are more comfortable with parametric tests than nonparametric tests
You’ve probably heard it’s best to use nonparametric tests if your data are not normally distributed—or something along these lines. That seems like an easy way to choose, but there’s more to the decision than that.
In this lesson, we compare the advantages and disadvantages to help you decide between using the following types of statistical hypothesis tests:
Parametric analyses to assess group means
Nonparametric analyses to assess group medians
Related Pairs of Parametric and Nonparametric Tests
Nonparametric tests are a shadow world of parametric tests. In the table below, see tge linked pairs of statistical hypothesis tests.
A D V A N T A G E S O F P A R A M E T R I C T E S T S
Advantage 1:
Parametric tests can provide trustworthy results with distributions that are skewed and non-normal.
Many people aren’t aware of this fact, but parametric analyses can produce reliable results even when your continuous data are nonnormally distributed. You just have to be sure that your sample size meets the requirements for each analysis in the table below. Simulation studies have identified these requirements.
Advantage 2:
Parametric tests can provide trustworthy results when the groups have different amounts of variability
It’s true that nonparametric tests don’t require data that are normally distributed. However, nonparametric tests have the disadvantage of an additional requirement that can be very hard to satisfy. The groups in a nonparametric analysis typically must all have the same variability (dispersion). Nonparametric analyses might not provide accurate results when variability differs between groups.
Conversely, parametric analyses, like the 2-sample t-test or one-way ANOVA, allow you to analyze groups with unequal variances. In most statistical software, it’s as easy as checking the correct box! You don’t have to worry about groups having different amounts of variability when you use a parametric analysis.
Advantage 3:
Parametric tests have greater statistical power
In most cases, parametric tests have more power. If an effect actually exists, a parametric analysis is more likely to detect it.
A D V A N T A G E S O F N O N P A R A M E T R I C T E S T S
Advantage 1:
Nonparametric tests assess the median which can be better for some study areas
Now we’re coming to the preferred reason for when to use a nonparametric test. The one that many don’t discuss frequently enough!
For some datasets, nonparametric analyses provide an advantage because they assess the median rather than the mean. The mean is not always the better measure of central tendency for a sample. Even though you can perform a valid parametric analysis on skewed data, that doesn’t necessarily equate to being the better method.
For example, using the distribution of salaries.
Salaries tend to be a right-skewed distribution. The majority of wages cluster around the median, which is the point where half are above and half are below. However, there is a long tail that stretches into the higher salary ranges. This long tail pulls the mean far away from the central median value. The two distributions are typical for salary distributions.
In these distributions, if several very high-income individuals join the sample, the mean increases by a significant amount despite the fact that incomes for most people don’t change. They still cluster around the median.
In this situation, parametric and nonparametric test results can give you different results, and they both can be correct! For the two distributions, if you draw a large random sample from each population, the difference between the means is statistically significant. Despite this, the difference between the medians is not statistically significant. Here’s how this works.
For skewed distributions, changes in the tail affect the mean substantially. Parametric tests can detect this mean change. Conversely, the median is relatively unaffected, and a nonparametric analysis can legitimately indicate that the median has not changed significantly.
You need to decide whether the mean or median is best for your study and which type of difference is more important to detect.
Advantage 2:
Nonparametric tests are valid when our sample size is small and your data are potentially non-normal
Use a nonparametric test when your sample size isn’t large enough to satisfy the requirements in the table above and you’re not sure that your data follow the normal distribution. With small sample sizes, be aware that normality tests can have insufficient power to produce useful results
This situation is difficult. Nonparametric analyses tend to have lower power at the outset, and a small sample size only exacerbates that problem.
Advantage 3:
Nonparametric tests can analyze ordinal data, ranked data, and outliers
Parametric tests can analyze only continuous data and the findings can be overly affected by outliers. Conversely, nonparametric tests can also analyze ordinal and ranked data, and not be tripped up by outliers.
Sometimes you can legitimately remove outliers from your dataset if they represent unusual conditions. However, sometimes outliers are a genuine part of the distribution for a study area, and you should not remove them.
You should verify the assumptions for nonparametric analyses because the various tests can analyze different types of data and have differing abilities to handle outliers.
Advantages and Disadvantages of Parametric and Nonparametric Tests
Many people believe that choosing between parametric and nonparametric tests depends on whether your data follow the normal distribution. If you have a small dataset, the distribution can be a deciding factor. However, in many cases, this issue is not critical because of the following:
Parametric analyses can analyze non-normal distributions for many datasets.
Nonparametric analyses have other firm assumptions that can be harder to meet.
The answer is often contingent upon whether the mean or median is a better measure of central tendency for the distribution of your data.
If the mean is a better measure and you have a sufficiently large sample size, a parametric test usually is the better, more powerful choice.
If the median is a better measure, consider a nonparametric test regardless of your sample size.
EXAMPLES OF WIDELY USED PARAMETRIC TESTS
Examples of widely used parametric tests include the paired and unpaired t-test, Pearson’s product-moment correlation, Analysis of Variance (ANOVA), and multiple regression. These tests have their counterpart non-parametric tests, which are applied when there is uncertainty or skewness in the distribution of populations under study.
At this digital age, we already have statistical software applications available for use in analyzing our data. Hence, the critical item to learn in this module is to discern when the use of particular parametric tests is appropriate. The diagram in Figure 1 shows under what situations a specific statistical test is used when dealing with ratio or interval data to simplify the choice of a statistical test.
L E S S O N I. O V E R V I E W O N S P S S
SPSS means “Statistical Package for the Social Sciences” and was first launched in 1968. Since SPSS was acquired by IBM in 2009, it's officially known as IBM SPSS Statistics but most users still just refer to it as “SPSS”.
SPSS is software for editing and analyzing all sorts of data. These data may come from basically any source: scientific research, a customer database, Google Analytics or even the server log files of a website. SPSS can open all file formats that are commonly used for structured data such as
spreadsheets from MS Excel or OpenOffice;
plain text files (.txt or .csv);
relational (SQL) databases;
Stata and SAS.
This sheet -called data view- always displays our data values. For instance, our first record seems to contain a male respondent from 1979 and so on.
An SPSS data file always has a second sheet called variable view. It shows the metadata associated with the data. Metadata is information about the meaning of variables and data values. This is generally known as the “codebook” but in SPSS it's called the dictionary.
For non-SPSS users, the look and feel of SPSS’ Data Editor window probably come closest to an Excel workbook containing two different but strongly related sheets.
D A T A A N A L Y S I S
SPSS can open all sorts of data and display them -and their metadata- in two sheets in its Data Editor window. So how to analyze your data in SPSS? Well, one option is using SPSS’ elaborate menu options.
For instance, if our data contain a variable holding respondents’ incomes over 2010, we can compute the average income by navigating to Descriptive Statistics as shown below.
Doing so opens a dialog box in which we select one or many variables and one or several statistics we'd like to inspect.
S P S S O U T P U T W I N D O W
After clicking Ok, a new window opens up: SPSS’ output viewer window. It holds a nice table with all statistics on all variables we chose. The screenshot below shows what it looks like.
As we see, the Output Viewer window has a different layout and structure than the Data Editor window we saw earlier. Creating output in SPSS does not change our data in any way; unlike Excel, SPSS uses different windows for data and research outcomes based on those data.
T A B L E A N D C H A R T S
All basic tables and charts can be created easily and fast in SPSS. Typical examples are demonstrated under Data Analysis. A real weakness of SPSS is that its charts tend to be ugly and often have a clumsy layout. A great way to overcome this problem is developing and applying SPSS chart templates. Doing so, however, requires a fair amount of effort and expertise.
I N F E R E N T I A L S T A T I S T I C S
SPSS contains all basic statistical tests and multivariate analyses such as:
t-tests;
chi-square tests;
ANOVA;
correlations and other association measures;
regression;
nonparametric tests;
factor analysis;
Cluster analysis.
Some analyses are available only after purchasing additional SPSS options on top of the main program.
S A V I N G D A T A A N D O U T P U T
SPSS data can be saved as a variety of file formats, including
MS Excel;
plain text (.txt or .csv);
Stata;
SAS.
The options for output are even more elaborate: charts are often copy-pasted as images in .png format. For tables, rich text format is often used because it retains the tables’ layout, fonts and borders.
Besides copy-pasting individual output items, all output items can be exported in one go to .pdf, HTML, MS Word and many other file formats. A terrific strategy for writing a report is creating an SPSS output file with nicely styled tables and chart. Then export the entire document to Word and insert explanatory text and titles between the output items.
L E S S O N 2 : S H O R T C OU R S E O N S P S S
S P S S D A T A E D I T O R W I N D O W
In SPSS, we usually work from 3 windows. These are:
The data editor window SPSS;
The syntax editor window SPSS;
The output viewer window SPSS.
SPSS’ main window is the data editor. This is the only window that's always open when we run SPSS. Although it's called “data editor”, we use it only for inspecting our data. We strongly recommend you never edit data in the data editor. The right way to edit data -and way faster too- is by using syntax.
S P S S D A T A V I E W A N D V A R I A B L E V I E W
An SPSS data file always has two tabs in the left bottom corner:
Data View is where we inspect our actual data and
Variable View is where we see additional information about our data.
You can switch between Data View and Variable View by
clicking the tabs in the left bottom corner;
using the Ctrl + t [shortkey];
double-clicking a variable name in Data View;
double-clicking an outline number in Variable View.
Let's first take a close look at the main parts of the Data View tab. We'll then proceed with variable view.
S P S S D A T A V I E W
[1] The data editor has tabs for switching between Data View and Variable View. For now, make sure you're in Data View.
[2] Columns of cells are called variables. Each variable has a unique name (“gender”) which is shown in the column header.
[3] Rows of cells are called cases. Oftentimes, each respondent in a study is represented as a single case.
[4] In SPSS, values refer to cell contents.
[5] The status bar may give useful information on the data, for instance whether a WEIGHT, FILTER, SPLIT FILE or Unicode mode is in effect.
S P S S V A R I A B L E V I E W
[1] In the left bottom corner we find tabs for switching between Variable View and Data View. For now, select Variable View.
[2] In Variable View, variables are shown as rows of cells.
[3] The first column shows the variable name for each variable.
[4] The fifth column may or may not contain a variable label. This describes the exact meaning of each variable.
[5] The sixth column shows value labels: descriptions of the meaning of one, many or all values that a variable may contain.
In short, Variable View does not show the data itself but, rather, information about the data. This is sometimes called “metadata” or “the codebook”. In SPSS, however, it's called the dictionary.
This is important to know because you may find commands like DISPLAY DICTIONARY or APPLY DICTIONARY in manuals. If you're familiar with syntax, running DISPLAY DICTIONARY creates the output shown below: dictionary information as seen in variable view.
For some variables, it's immediately clear what their values mean: a value of € 2500, - in a variable “gross monthly income” represents a gross monthly income of € 2500, -.
This is not always the case, however: answer categories for categorical variables are often represented by numbers -usually 1 through x. What these values represent is then stored in their value labels. Clicking the open value labels icon for education_type displays all value labels for this variable. These value labels tell us that a person with a value of 1 on education_type indicates somebody who studied “Law”. In a similar vein, “Economy” is represented by a value of 2, and so on.
Dictionary Information in Data View
Thus far, we explained that SPSS’ Data Editor always has 2 tabs:
Data View in which we inspect our actual data values and
Variable View in which we find information about our data -dictionary information.
Little known by many SPSS users is that we can see some dictionary information in Data View too. Let's start off with value labels. Initially, we just see data values in Data View as shown below.
Now, if we click the value labels icon we'll see value labels instead of data values in data view.
So, this allows you to look up what your data mean without having to switch between Data View and Variable View. Perhaps even more useful: place your mouse pointer on a variable name in Data View without clicking it. Now a yellow box with a lot of dictionary information pops up for a few seconds
Starting from SPSS version 22, icons next to variable names tell us something about our variable types, formats and measurement levels -if correctly set, that is.
So basically, “data” consist of 2 components:
data values which we see in Data View and
dictionary information about our data in Variable View.
We can save the contents of the Data Editor as an SPSS data file or .sav file. If we do so, the resulting file always contains everything in both Data View and Variable view.
Let's reemphasize that you should never -under no circumstances- edit anything manually in either Data View or Variable View. This is perhaps the single worst SPSS practice. And yes, I know. Many SPSS users do this anyway. But most will sooner or later wish they hadn't.
The only sound way to edit your data or dictionary information is by syntax. So, let's move on to our next tutorial: SPSS Syntax Introduction.
SPSS Syntax Introduction
SPSS syntax is a language containing instructions for analyzing and editing data and other SPSS commands.
SPSS users working directly from the menu may not actually see they syntax they're running. However, this is a terrible practice.
How to paste SPSS syntax?
Now let's suppose I'd like to gain some insight into the percentages of male and female respondents. I could first navigate to Analyze SPSS Menu Arrow Descriptive statistics SPSS Menu Arrow Frequencies as shown below.
I'll now move gender into the variable box and perhaps request a bar chart as well.
Now clicking Ok may seem the obvious thing to do. A much better idea, however, is to click the Paste button. Upon doing so, a new SPSS window opens which is known as the Syntax Editor. It's recognized by the orange icon in its left top corner.
The Syntax Editor contains a FREQUENCIES command which holds the instructions we just gave SPSS in the Frequencies dialog. However, we don't see the frequency distribution and bar chart we asked for. This is because we still need to run the command we just created.
How to run SPSS syntax?
The simplest way to run syntax is to select the command(s) you'd like to run and click the “run selection” icon in your toolbar.
A faster way to run syntax is to use several short keys, especially
F2 for selecting the command in which your mouse pointer is located;
Ctrl + a for selecting all syntax;
Ctrl + r for running all selected commands.
So, let's now run our pasted syntax. On doing so, a new window will open, containing our frequency table and bar chart.
Why even use SPSS syntax?
The single best SPSS practice is doing everything from syntax. Some reasons for this are you'll always know exactly which steps you took in which order so you can prove that your results are correct; if you made some mistake -don't we all sometimes? - you can correct it and rerun everything you did in just seconds;
you'll work way faster than from the menu and you never have to do things twice; some of the best SPSS tricks and time savers are available as syntax only.