Data preparation refers to the cleaning and organizing of data before conducting the analysis. Real-world data sets often contain missing data, invalid data, and various other problems. It is therefore essential to eliminate these problems before analyzing the data. Remember the phrase: “Garbage in, Garbage out”. If you start with problematic data, your analysis will very likely give you problematic results.
Data preparation is often much more time consuming than data analysis itself. Reserve more time for this part.
Because of different design philosophies and research practices, the tasks to do in data preparation differ slightly in SPSS vs. Jamovi. Below I list the major tasks in both pieces of software. Please also see my other notes and an FAQ below.
Prepare all the raw data in paper form or other electronic format.
Code or recode all the data into numeric using Excel (or defer to later if you want to do it in SPSS).
Input or import all the data into SPSS. Save the data sets to SPSS data files. (See Basics for instructions on importing data with non-English characters.)
Create a coding scheme to list the level of measurement and valid values for each of the variables.
Specify all the required variable information in the variable view of the SPSS data editor, especially the levels of measurement (See "Configuring the Variables" below).
Check for issues in the data and fix all of them. Suggested tasks:
Create case numbers for paper-based questionnaires.
Rename the variables either by the question numbers (e.g., Q1, Q2, Q3, etc.) or a short word describing the question (e.g., age, gender). No space or special character is allowed in the name.
If you haven't done it in Excel before data import, then recode the data as numeric and add value labels (for the nominal and ordinal variables) if necessary (See "Coding and Recoding Data" below).
Set the correct levels of measurements.
Recode multiple response items, if any. (Or, if you have collected your data using Google Form, then you can skip this step and instead obtain the frequencies of the options given by the statistics in Google Form. That will save you some time.)
Check all the missing values for (for nominal and ordinal variables only). If required, code these missing values (e.g., 88) as well and add the code to the “missing” field in the variables view (although this is not a recommended practice - See "Coding and Recoding Data" below).
If necessary, create new variables to summarize some of the existing variables (See "Computing the Averages of Questionnaire items" below).
Look for unexpected missing or invalid values.
Check for any other possible issues in the data.
SAVE your data file.
Prepare all the raw data in paper form or other electronic format.
In Jamovi (and R) I don't recommend recoding the data into numeric anymore, for the reasons stated in "Coding and Recoding Data" below. If you insist, recode can be done by Transform but it is not very convenient because value labels cannot be assigned to the transformed variables. Therefore, it is suggested that these should be done in Excel first before importing into Jamovi.
Input or import all the data into Jamovi. Save the data sets to Jamovi data files (*.omv).
Create a coding scheme to list the level of measurement and valid values for each of the variables. This is a good thing to do even if you don't recode the data into numerics anymore.
Specify all the required variable information in the Jamovi data editor, especially the levels of measurement. The variable settings can be invoked by double-clicking on the column heading of the variable (See "Configuring the variables" below).
Check for issues in the data and fix all of them. Suggested tasks:
Create case numbers for paper-based questionnaires.
Rename the variables either by the question numbers (e.g., Q1, Q2, Q3, etc.) or a short word describing the question (e.g., age, gender). Space is allowed in the name, but I do not recommend it.
In the variable settings, reorder the levels or add value labels (for the nominal and ordinal variables) in the Levels box if necessary. (See "Coding and Recoding Data" below).
Set the correct levels of measurements.
Recode multiple response items, if any. (Or, if you have collected your data using Google Form, then you can skip this step and instead obtain the frequencies of the options given by the statistics in Google Form. That will save you some time.)
Check all the missing values for (for nominal and ordinal variables only). If required, code these missing values (e.g., 88) as well and add the code to the “missing” field in the variables view (although this is not a recommended practice - See "Coding and Recoding Data" below).
If necessary, create new variables to summarize some of the existing variables (See "Computing the Averages of Questionnaire items" below).
Look for unexpected missing or invalid values.
Check for any other possible issues in the data.
SAVE your data file.
Set the variable properties using the Variable View:
Variable View->Set the variable properties.
Rename the variables to meaningful ones.
Reorder the variables if necessary.
Set the Width and Decimals accordingly.
Variable Names cannot contain space. Use Labels if necessary to give a more detailed description to the variable.
Use Values to assign labels to particular variable values if necessary.
Define custom Missing values if needed. This is important if you have used some code (e.g., 88) to represent missing values in the data.
Set the appropriate measurement level under Measure.
Set the variable properties by double-clicking on the column headers in the Data tab:
Rename the variables to meaningful ones.
Add description if necessary to give a more detailed description to the variable.
In Levels, set the descriptions for the particular variable values if necessary.
Set the data type.
Define custom Missing values if needed. This is important if you have used some code (e.g., 88) to represent missing values in the data.
Set the appropriate measurement level under Measure type.
Alternatively, you can also set up the variables in the Variables tab.
Once upon a time it was considered a good practice to represent all the nominal and ordinal data in numbers in the data set. This practice was desirable when data was manually input into the software from paper-based questionnaires. Moreover, it is necessary in older-generation statistical packages such as SPSS because otherwise some of the procedures will not work.
In contrast, with online questionnaires becoming more common nowadays, manual data input is rare. The more modern statistical packages (e.g., R, Jamovi, etc.) can deal with the non-numerical data quite well and thus recoding is no longer necessary. I therefore suggest that this old practice should be abandoned unless you are doing manual data input, or you are using older-generation software such as SPSS, or it is required by your supervisor or assessment criteria.
Following the practice above, it was also common to code missing values with a number, e.g., 88. Again I consider this practice obsolete because even the older-generation statistical packages such as SPSS could handle missing data in the data set very well. Therefore, if you enter data manually, simply leave the cell empty in case of missing data. If you use online questionnaire, the cell will already be empty in the data set.
In contrast, if you intend to follow the old practice, then you will need to manually recode all the missing values in the data set to a number, e..g, 88. Then, you need to specify in SPSS, Jamovi, etc. that the number 88 is a missing value, or otherwise the software will treat it as yet another valid value, creating problems in the various analyses. Since this old practice is time-consuming, risky, and unnecessary, I suggest that it should be abandoned unless it is required by your supervisor or assessment criteria.
Nevertheless, you may follow the procedures to code or recode your data:
If you already have your data in digital form and you need to convert all of them to numeric, then you can do a recode in SPSS. An alternative is to recode the variables using search and replace in Excel before importing your data.
I consider the Excel approach easier, faster and more convenient. However, if you need to do it in SPSS:
Transform->Recode into Same Variables->Select the variables to recode (Note: Can only select the same type of variables with the same set of options in one batch)->Click Old and New Values->Input the old value and new value->Add->Repeat for all the options->Continue
If your variables use the same set of options (e.g., All of them use the same Likert scale from Strongly Disagree to Strongly Agree), you can recode them together. When you move on to other variables that use a different set of options (e.g., Male and Female), remember to click the Reset button in the recode dialogue to clear the old and new values from the previous recode.
Note the following during recode:
You can repeat the codes in different questions. E.g., You may use 1=Male and 2=Female in Q5, and then 1=Chinese, 2=British, 3=American in Q6. It doesn’t matter that the 1 and 2 are reused in Q6.
If you do anything wrong during recode, you can copy that column of data from Excel to SPSS and start over.
When specifying the old values, you are suggested to copy the option name from Excel instead of typing it to ensure exact match. This is because, for example, “Male” is not the same as “Male “ (with the space after it) in SPSS. If you type “Male” in the old value while in fact it is “Male “ in the data, SPSS will not recognize it, and the recode will fail.
Alternatively, choose Recode into Different Variables or Automatic Recode instead and repeat the above. But note that if it is ordinal data, then Automatic Recode works well only if all your options are already in alphabetical order.
Add value labels. You may speed up this task using Data->Copy data properties.
Check against the original data to verify the changes.
Convert the variable types to numeric. This step will also convert those empty cells into missing values recognized by SPSS. VERY IMPORTANT: Never set the type of non-numeric data directly to Numeric. Otherwise ALL YOUR DATA under that variable will be turned to “missing”. Recode first before changing types.
For reasons stated above, coding the data to numeric is not recommended in Jamovi. You may instead simply specify the order of the options in the case of ordinal variables:
Double-click on the column heading of the variable.
Next to the Levels box, use the up and down arrow buttons to put the options in their desirable order.
You can also assign value labels to the options by clicking and typing on the values in the Levels box.
Anyway, if you already have your data in digital form, and yet for some reason you really need to code the answers into numbers, then you can use Transform in Jamovi. An alternative is to recode the variables using search and replace in Excel before importing your data.
Since Jamovi is not yet very good at recode, I would recommend the Excel approach.
Anyway, here are the procedures if you really need to do the recode in Jamovi:
Highlight the variable to be transformed, and then click Data->Transform. A new column will be added to the right of this variable.
Give a name to the new variable and add descriptions if needed.
Set the original variable as the source variable.
In “using transform”, choose an existing transform or create a new one.
For the latter, Set a name for the transform, and add recode conditions. For example, to transform / recode all 1 into 2, use the formula if $source == 1 use 2. To transform / recode all ‘Male’ into ‘M’, use the formula if $source==’Male’ use ‘M’. You may also set a default value in the “else use” statement.
Set the Measure type of the new variable if necessary.
Note the following during recode:
You must not delete the original variable. Otherwise the transform will also become blank.
You can repeat the codes in different questions. E.g., You may use 1=Male and 2=Female in Q5, and then 1=Chinese, 2=British, 3=American in Q6. It doesn’t matter that the 1 and 2 are reused in Q6.
When specifying the old values, you are suggested to copy the option name from Excel instead of typing it to ensure exact match. This is because, for example, “Male” is not the same as “Male “ (with the space after it) in Jamovi. If you type “Male” in the old value while in fact it is “Male " (with the space after it) in the data, Jamovi will not recognize it, and the recode will fail.
If you have items A1, A2, A3, A4 belonging to construct A, items B1, B2, B3 belonging to construct B, and items C1, C2, C3 belonging to construct C in the questionnaire, then you can calculate the averages of these items respectively under each construct.
Transform->Compute Variable:
In the Target Variable box, type the name of the new variable, e.g., A here.
In the Numeric Expression box, type MEAN(A1,A2,A3, A4). If you don’t want to type the variable names one by one, you can type MEAN(A1 to A4).
Click OK. A new variable calculated as the average of A1, A2, A3 will be added to the last column. You can drag this variable to the convenient location. In the subsequent analysis, you can analyze A instead of A1, A2, A3, A4..
Repeat the above for the other constructs B and C.
Click the column header of the last variable you want to include in the calculation (e.g., A4), and then click Data->Compute:
Name the new variable and add a description if needed.
Type the formula for the compute, e.g., =MEAN(A1,A2,A3,A4). You can click the fx button for some available functions and the names of the variables.
Repeat the above for the other constructs B and C.
Even if A1, A2, A3, A4, ... are ordinal (e.g., Likert scale), it is acceptable practice to calculate their average in this way and the resulting variable A will be regarded as scale, so that you can use it in linear regression for scale variables.
If you have reversed items in the questionnaire, you need to reverse them back in the average. Consider the following statements A1 and A2, in which you ask the respondents to rate a score of 1 to 5, with 5 indicating strongly agree and 1 indicating strongly disagree:
A1: "I think it is safe to visit country A during the pandemic." (Higher score -> Lower perceived health risk)
A2: "I worry that I would get ill if I visit country A during the pandemic." (Higher score -> Higher perceived health risk)
Here A2 is reversed with respect to A1 in a sense that a higher score in A2 indicates higher perceived health risks, which is opposite to the case in A1. To calculate the average score of A1, A2, A3, A4 such that a higher average score indicates a lower perceived health risk, you need to reverse A2 like the following in the formula:
MEAN(A1, (5-A2)+1, A3, A4)
(Or you can simplify the formula as: MEAN(A1, 6-A2, A3, A4).)
When A2=1, (5-A2)+1=5. When A2=2, (5-A2)+1=4, and so forth. Therefore, a higher score of (5-A2)+1 will indicate a lower perceived health risk.
Multiple response items as recorded in Google Form and some other online survey tools look like "A,B,C", "A,C", "B,C", etc., where the appearance of an option (A, B, or C here) indicates that the option has been selected by the respondent. For example, "A,C" means that the respondent has selected both A and C. This kind of data cannot be analyzed directly and must be recoded first.
You should create one variable for each option and use 1 or 0 (or leaving it blank) to denote whether the option is chosen or not. For example, if there are three options (A, B, C) under Q5, you can create the variables Q5.A, Q5.B, Q5.C. If someone answers A and C, then you can put Q5.A=1, Q5.B=0, and Q5.C=1. (For simplicity, you can also ignore the 0 and leave it as missing.)
In practice, you may use a spreadsheet program such as Excel and Google Sheets to help you with the recoding. You can manually create new columns for each variable, and then use the filter in the spreadsheet to select all those answers containing an options. Then type 1 in the first row under the corresponding new variable for that option, and copy this value all the way to the end of the data set. Repeat this for the other options.
This approach is necessary if you use Jamovi because the software does not provide any tool to deal with this situation (unless you write some codes in R).
Alternatively, if you are using SPSS and you feel comfortable with the SPSS programming syntax, you can use the char.index command in SPSS syntax mode to speed up the process. To do this, start a new syntax window from the File menu of SPSS and write the code. For example, the following command will check if the characters A, B, and C exist in the variable Q5 respectively. If so, it returns a 1 in the corresponding variables Q5.A, Q5.B, or Q5.C. Otherwise, it returns a 0.
compute Q5.A = char.index(Q5,'A')>0.
compute Q5.B = char.index(Q5,'B')>0.
compute Q5.C = char.index(Q5,'C')>0.
execute.
To run the code, highlight all the lines and click the run button in the SPSS syntax window. The new variables will appear on the far right-hand-side of the data set. You can then move them to the appropriate positions for your convenience.
After the recoding, if you use SPSS, you can group the variables and analyze the frequency distribution. See: https://www.dummies.com/education/math/statistics/creating-and-using-a-multiple-response-set-in-spss/ for details. If you use Jamovi, you may simply run frequency tables of all these new variables and manually merge their frequencies together into one table.
Finally, if you are not going to analyze the multiple-response data, or you are only going to generate simple frequency tables of the options for all the cases as a whole (i.e., without splitting into groups), then you may skip the above, and instead obtain the frequencies of the options given by the statistics in Google Form or other online questionnaire platform you may be using. That will save you some time.
Ranked items can be coded in a similar way as multiple response items except that you put down the ranks (e.g. 1, 2, 3, 4…) instead of whether it is selected or not (1, 0) as the data. Alternatively, you can use one variable for each rank and put down the option in that rank. For example, if the data under Q4 is A, C, B, D then you can create four variables Q4.1, Q4.2, Q4.3, Q4.4 and put A, C, B, D in them respectively.
Open-ended items can be entered as is. Just remember to use string as variable type, and if it is SPSS then make sure that the width of the string is long enough to hold your data.
In SPSS, try using Data -> Validation to quickly check the distribution and range of the data. You can also define rules for SPSS to do the checking. You may also check whether there are strange-looking data, e.g., the same option is chosen in all the items. This probably implies that the respondent was not giving their true responses.
In Jamovi, you can inspect the descriptive statistics of the variables.
Usually not. Both SPSS and Jamovi are good at handling missing data and for most cases it will ignore the missing data in the analysis. However, in some occasions it may be better off discarding all the incomplete cases (cases with missing data). For example, if most of the answers in the case are missing, then the other answers may not be reliable neither.
This is usually caused by illegal characters (often Chinese characters in non-UTF-8 encoding scheme), or other problems in the Excel file. In that case, try CSV instead. If that does not resolve the problem, try to copy the paste the data from Excel. The shortcoming is that it does not copy the variable names and you have to input the variable names one by one.
Jamovi does not have this issue.
This is usually due to non-UTF-8 encoding in Chinese characters. You can open the data file in Excel and then save it as another file while changing the character encoding setting (under web options in the save dialogue box) to UTF-8 when you save.
Alternatively, you can tell SPSS to read those non-UTF-8 encoded characters. To do so, close all your data sets and restart SPSS. Then, with an empty data set, choose Edit -> Options -> Language and then change the Locale setting in the Locale's writing system section.
Jamovi does not have this issue.
Likert scale is ordinal level although some people regard it as scale level.
You can use recode. For example, if your original data have three categories: 1, 2, 3, and you want to combine 2 and 3 into a new group, then you can use Transform -> Recode into Same Variables to recode 1 to 1 (1->1) and then 2 and 3 to 2 (2->2, 3->2). The new variable will then contain the regrouped categories.
I would suggest keeping them in the data set so that at least you know how many of the cases are screened out. To skip these cases in your analysis, you can use Data -> Select cases in SPSS or the filter in Jamovi to select only the cases you want to analyze.
Ordinal data have ordered categories, which is essential if you want SPSS to follow a particular order when displaying these data in tables and charts. If you set ordinal data as nominal, SPSS may get the ordering wrong.
In Jamovi you can specify the ordering of the categories in the Levels box in the variable settings.
SPSS is a proprietary software and you need to pay to use it. However, you can apply for a free trial at https://www.ibm.com/hk-en/analytics/spss-trials.
Alternatively, you can use Jamovi instead. (See below.)
If you do not want to use SPSS, you may install Jamovi (https://www.jamovi.org/) as an alternative. This software is similar to SPSS but open source. It is not as powerful as SPSS but can be used in case of contingency. Please install version 1.2.0 or above as earlier versions do not support import from Excel.
Some students are using a 32-bit Windows, which is not supported by Jamovi. In that case, you may also try the 32-bit version of JASP (https://jasp-stats.org) developed by the same team of people as Jamovi. However, JASP's user interface is not as intuitive as Jamovi, and thus is more difficult to learn.