Descriptive Analysis in Google Sheets/Excel
Ed Direction Data Fellows: Asynchronous Module
June 30th, 2022
Ed Direction Data Fellows: Asynchronous Module
June 30th, 2022
Welcome to the Descriptive Analysis in Google Sheets/Excel Optional Asynchronous Module.
Helpful Instructions and Reminders:
You will work through this module by scrolling through this learning space. To expand documents and slide decks that are included, you can click on the gray arrow at the top right corner of each item.
Feel free to focus on the pieces of this module that are most relevant to your topics of interest.
Please complete the Exit Ticket and Module Completion Form at the end of the module. We will use your submission to track completion.
Please contact datafellows@eddirection.org if help is needed.
In this module you will use a dummy dataset on this Google Sheet in lieu of a note catcher. Save a copy to use throughout the module. Refer to the dataset each time you see the icon on the left.
Session Outcome: This module will provide instructions on how to calculate descriptive statistics using Google Sheets. Developing this skill will enable Data Fellows to perform basic analysis on district datasets, thereby providing their district leadership and RSSP teams with important information as they shape strategies for continuous improvement.
Success Criteria: Upon completion of this module, participants will be able to:
Understand the basic structure of formulas used in Google Sheets (and Excel).
Calculate the central tendency of the variables within a dataset.
Calculate the spread of variables within a dataset.
What is Descriptive Analysis?
Performing descriptive analysis is the first step you should take as Data Fellows when analyzing a dataset. Will Hillier at Career Foundry explains that:
“Of all data analytics techniques, descriptive analytics is perhaps the most straightforward. It involves parsing (or breaking down) data and summarizing its main features and characteristics. In this way, descriptive analytics presents what has happened in the past without exploring why or how.”
Breaking down the variables in your dataset to discover their main features and patterns provides you with meaningful context that is essential for later stages of the analysis process.
As a Data Fellow, you will be tasked with finding descriptive statistics for a wide variety of student subgroups. The goal of this module is to provide you with detailed explanations of how to construct formulas for descriptive statistics in Google Sheets so that you can calculate these statistics efficiently and with as little stress as possible!
*Important note: The formula structures used in Google Sheets to calculate descriptive statistics are identical to the formula structures used in Excel. If a linked resource in this module references Excel, it provides the same guidance that a resource for Google Sheets would.
Central Tendency
The Central Tendency refers to the center of your data. The center of your data has three different definitions: Mean, Median, and Mode.
Mean: The mean is the most important statistic in data analysis because it conveys what the expected value for an observation is. If nothing different or unique was happening to an observation (think student test score, attendance rate, etc.), you would expect its value to be at or close to the average. If an observation’s value was significantly different from the mean, then that is an indication that something else is going on. Chances are that “something else” is worth exploring.
Median: The median is useful in data analysis because it indicates the extent to which the mean is being influenced by outliers. If the median and mean are roughly equivalent, then the outliers in the dataset are exerting a low level of influence on the mean. If the median and mean are noticeably different, then the outliers in the dataset are exerting a high level of influence on the mean. If there are noticeable differences between the median and mean, the right course of action is to identify outliers and determine if they should remain in or be dropped from the dataset.
Mode: Calculating the mode is important in data analysis because it describes which value in a variable is the most commonly occurring. This is especially useful when working with categorical variables (variables expressed in non-numerical values). Because categorical variables do not have averages (it is impossible to find the average of the values "Cat", "Dog", and "Hamster"), finding the mode enables us to identify which value most frequently occurs.
Let's practice calculating the mean STAAR3 Raw Score from the Student Master Data in Google Sheets! There are two primary methods you can use.
Method 1: Finding the Mean via Point and Click
To calculate the average using a point and click method, simply select Column J by clicking the letter "J" at the top of the column and manually change the metric output at the bottom right of the screen from "sum" to "average" (see images below).
While this method is convenient, it is also inefficient. Using a point and click method can take a lot of time time when calculating averages for lots of values and does not preserve the steps you took while calculating your descriptive statistics.
Method 2: Finding the Mean Using Formulas
Alternatively, you can use formulas to calculate the mean. After going through the initial learning curve, using formulas to calculate descriptive statistics is much faster and preserves the steps you took while calculating your descriptive statistics. You can always revisit the formulas you wrote within your spreadsheet. You can not revisit the path your mouse took when clicking around the screen. For these reasons, the rest of this module will focus on using formulas.
To provide clarity on how formulas work, we have included descriptions of the common elements found in almost every spreadsheet-based formula.
= The equal sign, when placed at the beginning of a formula, indicates that Google Sheets/Excel should treat everything after the = as a formula instead of a list of separate values.
COMMAND The command in a formula indicates which calculation you want Google Sheets/Excel to perform. Commands are usually capitalized (though they do not need to be) and are typically words like AVERAGE, MEDIAN, and MODE. The terms "command" and "function" are used interchangeably. When you begin typing a command into Google Sheets, it will usually auto-populate a list of commands from which you can choose. We recommend selecting the intended command from this list because occasionally Google Sheets will not recognize a command if it has been manually typed.
() The parentheses usually follow a command. They indicate which cells will be included in the calculations: everything inside of the parentheses following a command will be included in the command's calculations, while everything outside of the parentheses will not.
, The comma is typically used to indicate the different sections of the formula.
Cell Range The cell range indicates the list of cells that will be included in the calculations. Cell ranges normally look like the following: K4:K75. If you were to attach that cell range to a command, the command would perform calculations on the values found in cells K4 through K75.
While other components are sometimes used in spreadsheet-based formulas, they appear less frequently than the core components described above.
The AVERAGE command
Here is the formula you would use to find the average STAAR 3 Raw Score: =AVERAGE(J2:J274)
Let's break down exactly what this formula is telling Google Sheets to do:
The "=" tells Google Sheets we are about to enter in a formula
The “AVERAGE” command tells Google Sheets you want to calculate the mean for the values you are about to enter.
The parentheses tell Google Sheets to calculate the mean for everything inside of the parentheses.
“J2” is the first cell you are selecting and “J274” is the last cell you are selecting
The “:” tells Google Sheets that you want to calculate the mean for all the cells from J2 through J274. If you were to use a comma between J2 and J274, you would be telling Google Sheets that you only wanted to calculate the mean for only cells J2 and J274. When you use a colon, you tell Google Sheets that you want to use the entire range between the two cells you have chosen.
Try it!
Select a blank cell in column T of the Student Master Data sheet, write the following equation: =AVERAGE(J2:J274) and hit enter on your computer.
If you wrote the formula correctly, the output in your cell should read 19.577!
If you want to learn more about the AVERAGE command, check out these resources!
The AVERAGEIF command
The AVERAGEIF command enables you to calculate the mean for a set of values that meet a specific criteria. For example, if you wanted to find the mean STAAR 3 Raw Score for students who had Betsy Birch as their tutor, you would use the following formula: =AVERAGEIF(E2:E273, E12, J2:J273)
Let’s break down what this formula is telling Google Sheets to do:
The “=” tells Google Sheets we are about to write an equation
The AVERAGEIF command tells Google Sheets you want to calculate the average for a set of values according to a criteria you are about to enter.
The parentheses tell Google Sheets to apply the AVERAGEIF function to everything inside the parentheses.
E2:E273, E12, and J2:J273 are the cell ranges.
The commas indicate where each cell range ends.
You can read the cell references inside of the parentheses as follows: “If the values in the first set of references (E2:E272) are equal to the value in the second reference cell (E12), then calculate the average for the values in the third set of references (J2:J272).
In this example, we are telling Google Sheets to do the following: “If the name of the tutor is Betsy Birch, then calculate the average STAAR 3 Raw Score score for the corresponding rows.”
When we enter in the formula above, Google Sheets examines column E for cell values that have an exact match for the whatever is in cell E12 (in this case, it is the name Betsy Birch -- spelled and capitalized exactly that way). Once it has determined which cells have Betsy Birch and which cells do not have Betsy Birch, it will then calculate the average for the cells in column J that correspond with the cells that have Betsy Birch in column E.
Because this is how Google Sheets does the math, it’s important that the length our cell ranges match. If we try to write a formula that looks like this =AVERAGEIF(E2:E273, “Betsy Birch”, J2:J117), we will most likely get an error telling us that our column arrays do not match. The way to fix that problem would be to change J2:J117 to J2:J273 so that it exactly matches the range length of E2:E273.
*Note: When writing a formula to calculate the average for a group or subgroup, instead of listing a cell for our reference value, we can write the value itself. For example, we could change the formula above to read =AVERAGEIF(E2:E273, “Betsy Birch”, J2:J273) . We use quotation marks around Betsy Birch because it is a non-numerical value.
Try it!
Select a blank cell in column T, write the following equation: =AVERAGEIF(E2:E273, E12, J2:J273) and hit enter on your computer.
Then select a different blank cell in column T and write =AVERAGEIF(E2:E273, "Betsy Birch", J2:J273) and hit enter on your computer.
Compare the outputs of the two formulas.
If you wrote the formulas correctly, the output in both cells should read 16.722!
If you want to learn more about the AVERAGEIF command, check out these resources!
The AVERAGEIFS command
The AVERAGIFS command expands the capabilities of the AVERAGEIF command by it letting you calculate the mean for a group of cells if they meet multiple criteria.
For example, if we wanted to find the mean STAAR3 Raw Score for a student who had Betsy Birch as a tutor and who was classified as "Approaches" within the STAAR3 Performance Band, we would use the following formula: =AVERAGEIFS(J2:J273, E2:E273, E14, I2:I273, I13)
Let’s break down what this formula is telling Google Sheets to do. Pay attention to the key differences in the formula's structure when compared to the formula structure when we used the AVERAGEIF command.
The “=” tells Google Sheets we are about to write an equation
The AVERAGEIFS command tells Google Sheets you want to calculate the average for a set of values according to the multiple criteria we are about to enter.
The parentheses tell Google Sheets to apply the AVERAGEIFS function to everything inside the parentheses.
J2:J273 is the range of cells we want Google Sheets to average, which in this case is the STAAR 3 Raw Scores.
E2:E273 is the first criteria range. In this case, the first criteria range is the names of the tutors in the dataset.
E14 is the first criteria value. This is the value within the first criteria range that we want to select for. In this case, we want to select the value "Betsy Birch".
I2:I273 is the second criteria range. In this case, the second criteria range is the STAAR Performance Band.
I13 is the second criteria value. This is the value within the second criteria range that we want to select for. In this case, we want to select for the value "Approaches".
When we use this formula, we are effectively saying, “Hey Google Sheets. I want to find the mean STAAR 3 RAW Scores for students who had Betsy Birch as their tutor and were designated as 'Approaches' within the STAAR3 Performance Band.”
If you follow the same structure as outlined above (the range of cells you want average, criteria range one, criteria value one, criteria range two, criteria value 2), you can add as many criteria ranges and values as you would like with the AVERAGEIFS function--you are not limited to just two!
Try It!
Select a blank cell in column T, write the formula: =AVERAGEIFS(J2:J273, E2:E273 ,E14, I2:I273, I13) and hit enter on your computer.
If you wrote the formula correctly, the cell should read 19.22!
If you want to learn more about the AVERAGEIFS command, check out these resources!
The MEDIAN command
If we want to find the median for a range of cells, we can use a formula structure that is almost identical to the formula structure we used to find the mean for a range of cells. The structure follows the simple = COMMAND cell range pattern.
For example, if we wanted to find the median STAAR3 RAW Score for all students in the dataset, we would use the following: =MEDIAN(J2:J273)
Try it!
Select a blank cell in column T, write the formula: =MEDIAN(J2:J273) and hit enter on your computer.
If you wrote the formula correctly, the cell output should be 22!
If you want to learn more about the MEDIAN command, check out these resources:
Using Array Formulas to Find Medians with One Criteria
If we would like to find the median value for a specific subgroup in our dataset, we would have to use a different formula structure than the one used in the previous section.
For some strange reason Google Sheets/Excel does not have a MEDIANIF or MEDIANIFS command. Instead, we would need to use a different formula structure known as an array. Array formulas have greater flexibility and can perform more powerful calculations than basic formulas. The array formulas we will be using in this module are simple to use and relatively easy to understand.
For example, if we wanted to find the median STAAR3 Raw Scores for students who had Betsy Birch as their tutor, we would use the following formula: =MEDIAN(if(E2:E273 = E12, J2:J273))
Let’s break down exactly what this formula means:
The first = sign tells Google Sheets that we are about to use a formula.
The MEDIAN command tells Google Sheets that we want to calculate the median.
The first set of parentheses (the first parenthesis on the left and the last parenthesis on the right) tells Google Sheets that we want to calculate the median for everything contained in that first set of parenthesis.
The if function tells Google Sheets that we are about to enter a conditional statement -- that we want the MEDIAN command to only apply to a specific group within the dataset.
The second set of parentheses (the inner set) tells Google Sheets that we want the conditional if function to apply to everything within that second set of parentheses.
E2:E273 is the criteria range we are telling Google Sheets you want to select from.
The second = sign tells Google Sheets that we want to calculate the median for values in the range J2:J273 that also have the value E12 within the E2:E273 range.
In other words, we are saying “Hey Google Sheets. Calculate the median STAAR3 Raw Score for the values in J2:J273, but only if the values in the range E2:E273 are equal to the value in cell E12.”
Or, in simpler terms, “Hey Google Sheets. Find the median STAAR3 Raw Score for students who have Betsy Birch as their tutor.”
When using an array formula, you will need to hit ctr + shift + enter (if you are using a PC) or command + shift + return (if you are using a Mac) on your keyboard instead of just hitting enter (like with the previous formulas).
Try it!
Select a blank cell in column T, write the formula: =MEDIAN(if(E2:E273 = E12, J2:J273)) and hit control + shift + enter on your computer if you are using a PC or command + shift + return if you are using a Mac.
If you wrote the formula correctly (and used ctr + shift + enter/ command + shift + return), then the cell output should be 16.5!
Using Array Formulas to Find Medians with Multiple Criteria
If we wanted to find the median of a subgroup based off of multiple criteria, then we would use an array formula similar to the one we used in the previous section with a couple of small changes.
As a demonstration, lets say we wanted to find the median STAAR 3 Raw Score of students who had Betsy Birch as their tutor and who were classified as Approaches within the STAAR3 Band Performance. To do so, we would use the following formula: =MEDIAN(if((E2:E273 = E12)*(I2:I273=I13), J2:J273))
Let’s break down exactly what this formula means.
The first = sign tells Google Sheets that we are about to use a formula.
The MEDIAN command tells Google Sheets that we want to calculate the median value.
The first set of parentheses (the first parenthesis on the left and the last parenthesis on the right) tells Google Sheets that we want to calculate the median value for everything within the parentheses.
The if command tells Google Sheets that we are about to enter a conditional statement -- that we want the median command to only apply to a specific set of groups within the dataset.
The second set of parentheses (the second parenthesis on the left and the second to last parenthesis on the right) tells Google Sheets that we want the if command to apply to everything within the second set of parenthesis.
The third set of parentheses (the parenthesis surrounding E2:E273 = E12) contain the first criteria range and criteria.
E2:E273 is the first criteria range
The second = sign tells Google Sheets that we only want to calculate the median in our dataset if the values in our first criteria range are equal to the criteria we have selected. In this case, the criteria we have selected is the value of whatever is in cell E12.
The * sign tells Google Sheets that you want to add a second criteria.
The fourth set of parentheses contains the second criteria range and criteria.
I2:I273 is the second criteria range.
The third = sign tells Google Sheets that we only want to calculate the median if the values in your second criteria range are equal to the criteria you have selected. In this case, the criteria you have selected is the value of whatever is in cell I13.
The comma tells Google Sheets that we are beginning a new section of the formula. In this specific scenario, we are indicating that we are about to input the range of values from which we want the median calculated.
J2:J273 is the range of values from which we want the median calculated.
In other words, we are saying “Hey Google Sheets. Calculate the median value of numbers in the range J2:J273 only if the values of E2 through E273 are equal to E12 and if the values of I2 through I273 are equal to I13.
Or, in simpler terms, “Hey Google Sheets. Find the median STAAR3 Raw Score test score for students who had Betsy Birch as their tutor and were classified as Approaches within the STAAR3 Performance band.”
Try it!
Select a blank cell in column T, write the formula: =MEDIAN(if((E2:E273 = E12)*(I2:I273=I13), J2:J273)) and hit control + shift + enter on your computer if you are using a PC or command + shift + return if you are using a Mac.
If you wrote the formula correctly, the cell output should be 18!
If you want to learn more about how to use array formulas to calculate medians with criteria, check out these resources:
The MODE command + Using Arrays to Calculate Conditional Modes
Calculating the mode in Google Sheets/Excel uses the exact same formula structures that were introduced in the section on medians.
For example, if you wanted to calculate the mode STAAR3 Raw score, you would use the MODE command, as demonstrated in the following equation: =MODE(J2:J273).
If you wanted to calculate the mode STAAR3 Raw Score for students who had Betsy Birch as a tutor, you would use: =MODE(if(E2:E273 = E12,J2:J273))
If you wanted to calculate the mode STAAR3 Raw Score for students who had Betsy Birch as a tutor and were classified as Approaches within the STAAR3 Performance band, you would use: =MODE(if((E2:E273 = E12)*(I2:I273=I13),J2:J273))
You’ll notice that these are the exact same equations we used to find the median test scores, with the sole exception that we used the MODE command instead of the MEDIAN command.
Try it!
Select a blank cell in column T, write the formula: =MODE(if((E2:E273 = E12)*(I2:I273=I13),J2:J273)) and hit control + shift + enter on your computer if you are using a PC or command + shift + return if you are using a Mac.
If you wrote the formula correctly, the cell output should be 17!
Spread
Spread refers to how much your data varies (or changes). You can discover the variability of your data by finding the following: Range, Frequency, and Standard Deviation.
Range: Identifying the range of a variable's values is important in data analysis because it gives you the highest and lowest values in your dataset. The meaning of how far away the minimum and maximum values are from the center of your dat is determined by the standard deviation.
Frequency: Compiling the frequencies of values within a variable follows the same logic as calculating the mode. Calculating frequencies helps us understand which values have the highest and lowest likelihoods of appearing in our dataset, which can be especially useful when working with categorical variables.
Standard Deviation: Finding the standard deviation is important in data analysis because it conveys the average distance between a given data point and the mean. In other words, the standard deviation gives you context for how different a value is from the average value in your dataset. If a value is one standard deviation away from the mean, then that value is not very different. If a value is four standard deviations away from the mean, then that is abnormally different. The ability to determine how different a specific value is from the average is a foundational concept in data analysis.
Finding the Range Using the MIN and MAX Commands
Finding the minimum and maximum values of a variable follows the same logic as finding the median and the mode. Swapping out the MEDIAN/MODE commands for the MIN or MAX commands will enable you to find the minimums and maximums for variables, subgroups for those variables, and subgroups based off of multiple criteria.
For example, if you wanted to calculate the minimum STAAR 3 Raw Score, you would use the following equation: =MIN(J2:J273)
If you wanted to calculate the maximum STAAR 3 Raw Score for students who had Betsy Birch as their tutor and were classified as Approaches within the STAAR3 Performance band, you would use the following equation: =MAX(if((E2:E273 = E12)*(I2:I273=I13), J2:J273))
Try it!
Select a blank cell in column T, write the formula: =MAX(if((E2:E273 = E12)*(I2:I273=I13), J2:J273)) and hit control + shift + enter on your computer if you are using a PC or command + shift + return if you are using a Mac.
If you wrote the formula correctly, the cell output should be 23!
Calculating the Frequency Using the COUNT, COUNTIF, and COUNTIFS Commands
Finding the frequency of variables' values is very simple because, lucky for us, Google Sheets and Excel allow us to use the basic formula structure we learned in the section on calculating means.
The COUNT Command
Continuing with our example of STAAR3 Raw Scores, if you wanted to calculate the total number of STAAR3 Raw Scores in your dataset, you would use the formula =COUNT(J2:J273) where:
COUNT is a command that instructs Google Sheets to count the number of rows in the designated cell range and (J2:J273) is the cell range.
Try it!
Select a blank cell in column T, write the formula =COUNT(J2:J274), and hit enter.
If you wrote the formula correctly, the cell output should be 272!
The COUNTIF Command
If you wanted to find the total number of STAAR3 Raw Scores for students who had Betsy Birch as their tutor, you would use the COUNTIF command within the formula =COUNTIF(E2:E273, E12) where:
The = sign tells Google Sheets we are about to enter a formula
The COUNTIF command tells Google Sheets that we want to count the frequency of a variable based off of one set of criteria.
The parentheses tell Google Sheets that we want to apply the COUNTIF command to everything inside of the parentheses.
E2:E273 is the criteria range
The comma tells Google Sheets that we are about to enter in the criteria value
E12 is the criteria value.
In other words, we are saying, “Hey Google Sheets. Count how many students were assigned Betsy Birch as a tutor”.
Try it!
Select a blank cell in column T, write the formula =COUNTIF(E2:E273, E12), and hit enter.
If you wrote the formula correctly, the cell output should be 18!
The COUNTIFS Command
If you wanted to find the total number of STAAR3 Raw Scores for students who had Betsy Birch as their tutor and were classified as Approaching within the STAAR3 Performance band, you would use the COUNTIFS command within the formula =COUNTIFS(E2:E273, E12, I2:I273, I13) where
This formula follows the structure used in the countif section, except it adds an additional criteria range and criteria value. This enables the formula to only count the number of students who had Betsy Birch as a tutor and who were classified as Approaching.
If you wanted to add additional criteria to the formula, all you would have to do is add the criteria ranges and values according to the same structure used for the first and second criteria ranges and values.
For example, if you wanted to count the number of students who had Betsy Birch as a tutor, were classified as Approaches, and belonged to the cohort ESA.4.2, you would use the following formula
=COUNTIFS(E2:E273, E12, I2:I273, I13, D2:D273, D14)
Try it!
Select a blank cell in column T, write the formula =COUNTIFS(E2:E273, E12, I2:I273, I13) , and hit enter.
Then select a different blank cell in column T, write the formula =COUNTIFS(E2:E273, E12, I2:I273, I13, D2:D273, D14) , and hit enter.
If you wrote the formulas correctly, the cell outputs should be 9 and 5 respectively!
Calculating the Standard Deviation using the STDEV Command and Array Formulas
Using the same formula structure and logic introduced in the mean and median sections, you can simply replace the MEDIAN command with the STDEV command, and you will be able to calculate the standard deviation for variables, subgroups of variables, and subgroups of variables using multiple criteria.
For example, if you wanted to find the standard deviation s the STAAR3 Raw Scores, you would use the following equation: =STDEV(J2:J274)
If you wanted to find the standard deviation of the STAAR3 Raw Scores for students who had Betsy Birch as a tutor, you would use the following formula: =STDEV(if(E2:E273 = E12, J2:J273))
And if you wanted to find the standard deviation of the STAAR3 Raw Scores for students who had Betsy Birch as a tutor and were classified as Approaches, you would use the following formula: =STDEV(if((E2:E273 = E12)*(I2:I273=I13), J2:J273))
Try it!
Select a blank cell in column T, write the formula =STDEV(J2:J274) , and hit enter.
Then select a different blank cell in column T, write the formula =STDEV(if(E2:E273 = E12, J2:J273)) , and hit control + shift + enter on your computer if you are using a PC or command + shift + return if you are using a Mac.
Then select a third blank cell in column T, write the formula =STDEV(if((E2:E273 = E12)*(I2:I273=I13), J2:J273)) , and hit control + shift + enter on your computer if you are using a PC or command + shift + return if you are using a Mac.
If you wrote the formulas correctly, the cell outputs should be 8.897, 3.462, and 2.68 respectively!
Congratulations! You made it through detailed explanations of twenty-one different Google Sheets formulas -- good on you!
You now know how to calculate the most important descriptive statistics in data analysis. These statistics provide the context needed to explore your data in greater detail. It is only after you understand the central tendency and spread that you can begin to identify the meaningful differences among values within and between the variables in your dataset.
If you would like to further develop your Google Sheets and Excel skills, you can explore these additional resources:
Excel also has a plethora of well-designed tutorials that are built into the software. Simply select the Help feature at the top of the screen, type the category or concept with which you would like help, and select the tutorial as demonstrated in the image on the right.
Congratulations on completing the Descriptive Analysis module. Please complete the Exit Ticket and Module Completion forms by clicking on the links below. We will use the information you submit to track your completion.