TEK 6.12 C/D - Summarizing Data

TEK 6.12 C Learning Goal: I can summarize numeric data with numerical summaries, including the mean and median (measures of center) and the range and interquartile range (IQR) (measures of spread), and use these summaries to describe the center, spread, and shape of the data distribution

TEK 6.12 C Learning Goal: I can summarize categorical data with numerical and graphical summaries, including the mode, the percent of values in each category (relative frequency table), and the percent bar graph, and use these summaries to describe the data distribution.

Vocabulary

Data distribution: A listing of the values or responses associated with a particular variable in a data set. Basically, a list of every piece of data in the data set.

Measures of spread: Numbers used to describe the distribution, or spread, of the data. They describe how the values of a data set vary with a single number.

Measures of center: Numbers that describe the center of a set of data.

Mean: The arithmetic average of a distribution.

Median: The value appearing at the center of a sorted version of a list of values, or the mean of the two central values, if the list contains an even number of values.

Range: The difference between the greatest and least data values.

Quartile: A value that divides the data set into four equal parts.

Second Quartile: The median of the whole data set.

First quartile: The median of data values less than the median of the whole data set.

Third quartile: The median of data values more than the median of the whole data set.

Interquartile range (IQR): The difference between the first and third quartiles of the data set.

Mode: The number or numbers in a data set that occur most often.

Outlier: A data value that is either much greater or much less than the median.

Categorical data: Data that can be divided into categories based on the attributes of the data.

Relative frequency: The ratio of the number of times a category is represented to the total number of pieces of data.

Relative frequency table: A table that shows the relative frequency of each category.

Percent bar graph: A bar graph that shows the relative frequency of each category in a single bar.

Measures of Spread/Center

Measures of spread are used to describe how similar or varied the set of observed values are for a particular variable. The measures of spread include range and interquartile range (IQR). Measures of center describe the center of a set of data. The measures of center include mean, median, and mode. To look at all of these concepts, we are going to use this same situation:

All of the students in a class go home one night and count all of the toy cars they have in their home. Some don't like to play with toy cars and don't have any, and one student actually has a toy car collection and has a lot more cars than everyone else. All of the students take a survey in class and write down how many toy cars they have at home. Here are the results:

2 students have 0 toy cars.

2 students have 1 toy car.

3 students have 2 toy cars.

1 student has 3 toy cars.

1 student has 4 toy cars.

1 student has 5 toy cars.

2 students have 6 toy cars.

2 students have 7 toy cars.

1 student has 8 toy cars.

1 student has 26 toy cars.

Here are the responses in a list form, where each student's response on how many toy cars they have is listed:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8, 26

We can also put the data into a table with every student's name:

Mean

Mean is the average of a distribution. A very common use of mean is calculating grades. If you have 3 test grades for a class, and one is 85%, one is 90%, and one is 95%, you would use the mean to find your grade for the whole class.

To find the mean, add up all of the numbers in your distribution and then divide by the total amount of numbers in your distribution.

For the grade example, you would add 85, 90, and 95. 85 + 90 + 95 is 270.

There are 3 numbers in the distribution, so we would then divide 270 by 3 to get 90. 90 is our mean, or the average number of points received on the tests, so your grade for the class would be a 90%

Remember that our toy car example has these numbers:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8, 26

There are 16 total numbers here. We need to add up all of the numbers in this distribution, and then divide by 16 (the amount of numbers there are).

0 + 0 + 1 + 1 + 2 + 2 + 2 + 3 + 4 + 5 + 6 + 6 + 7 + 7 + 8 + 26 = 80.

80 ÷ 16 = 5.

5 is the mean, or the average number of toy cars a kid in this class has.

Median

The median is the middle of a distribution. There is an equal number of values that are greater than and less than the median. If you have an odd amount of numbers in your distribution, the median is the number in the middle of the data set sorted as a list in increasing order. Let's use the example where you score 85%, 90%, and 95% on your 3 tests. Here is the data as a list:

85, 90, 95

There are 3 numbers, so the second number is the one in the middle. 90% is the median. There is one number above the median (95), and one number below the median (85), so there is an equal number of values that are greater than or less than the median.

If you have an even amount of numbers in your distribution, the process will be a little bit different. You have to take the middle 2 numbers and find the mean of those 2 numbers. The result is your median.

Here's our list again:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8, 26

Remember that we have 16 values in our distribution. The middle 2 numbers are 3 and 4, because there are 7 values to the right of 3 and 4 and 7 values to the left of 3 and 4. Now we need to find the mean of 3 and 4. Remember, you add all of the numbers and divide by the amount of numbers. We have two numbers - 3 and 4, so we will add 3 and 4 and then divide that by 2.

3 + 4 = 7

7 / 2 = 3.5

3.5 is the median. There are 8 numbers to the right of 3.5 with a higher value, and 8 numbers to the left of 3.5 with a lower value.

Remember that to find the median you have to first sort the data into increasing order.

Mode

The mode is the number or numbers that occur most often. Remember that we have this list:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8, 26

3, 4, 5, 8, and 26 only appear once in the list.

0, 1, 6, and 7 appear twice.

2 is the only number that appears three times, so 2 is the mode.

If some other number also appeared 3 times, then that number would also be included in the mode.

Range

Range is the simplest measure of spread. All you need to do is subtract the least data value from the greatest data value. For our toy care example, which has this data:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8, 26

The greatest number is 26, and the least is 0. Don't worry that there are multiple zeros. As long as there is no number greater than 26, and no number less than 0, these are the greatest and least numbers in the distribution.

Now subtract 0 from 26.

26 - 0 = 26.

26 is the range of the toy car data.

Quartiles

The quartiles are the values that split the data into four equal parts. There are 3 quartiles that you need to concern yourself with:

Second Quartile - Another word for median of the entire data set
First Quartile - The median of all the data less than the second quartile
Third Quartile - The median of all the data greater than the second quartile

You use the second quartile to find the first and third quartiles, and then you use the first and third quartiles to find the interquartile range (IQR). The IQR is the difference between the third and first quartiles of the data set.

In our toy car example, the median was 3.5. The first quartile would be the median of all of the data that has a value of less than 3.5, and the third quartile would be the median of all the data that has a value of greater than 3.5.

The IQR is the difference between the third quartile, 6.5, and the first quartile, 1.5. 6.5 - 1.5 = 5, so the IQR is 5.

But what if you have an odd amount of data? Pretend that the student with the car collection decided not to participate in the survey. Now the data looks like this:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8

There are only 15 numbers in the distribution. Since we have an odd number of values, the middle number, 3, is the median.

Remember that the first and third quartiles are all the data below and above the median, so the median is actually not included when you break the data up into 2 pieces.

The IQR is the third quartile minus the first quartile, or 6 - 1, which is 5.

Outliers

An outlier is a data value that is much greater or much less than the median. If a data value is more than 1.5 times the value of the interquartile range beyond the quartiles, it is an outlier.

Let's use our original toy car example again. Here's a reminder of what the data looks like:

0, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 8, 26

For this data, we determined that the interquartile range was 5, and the first and third quartiles were 1.5 and 6.5.

Multiply the interquartile range by 1.5:

5 × 1.5 = 7.5

Now subtract 7.5 from the first quartile, and add 7.5 to the third quartile:

1.5 - 7.5 = -5

6.5 + 7.5 = 14

If any number in the distribution is less than -5 or greater than 14, its distance from the nearest quartile is more than 1.5 times the interquartile range, so it is an outlier.

26 is an outlier for the toy car survey because 26 is greater than 14.

Categorical Data

Earlier we said that measures of spread are used to describe how similar or varied the set of observed values are for a particular variable. In the example we used, number of toy cars was a particular variable. However, in some cases data comes in a categorical form with multiple variables.

The teacher tells the students to go home and look at their toy cars again, this time recording the color of the toy cars. Now there are two different variables - the color of the toy car, and the number of toy cars with a certain color. All together, the students have 80 toy cars. Here is the new data:

We can't use most measures of spread and center because the data is split into separate categories (different colors.) We can analyze the data other ways, though.

Mode

Mode is the only measure of spread and center that we can use to properly describe this data. Before, we said that mode is the number or numbers that appear most often in the data. For categorical data, it's the category or categories that appear most often in the data. Red appears 32 times, which is much more than any other category, so the mode is red.

Relative Frequency Table

The relative frequency of a certain category is the ratio of the number of times a category is represented to the total number of pieces of data. A relative frequency table shows the relative frequency of each category. The ratio can be in decimal form or in percent form.

To find the ratio for each category, divide the number of times each category is represented by the total number of pieces of data. There are 80 toy cars in total, so we will divide each category by 80.

Red: 32/80 = 0.4

Blue: 16/80 = 0.2

Pink: 16/80 = 0.2

Yellow: 8/80 = 0.1

Green: 4/80 = 0.05

Black: 2/80 = 0.025

Multicolored: 2/80 = 0.025

Here is what the relative frequency table could look like:

You can also write these decimals in percent form. You can move the decimal place over twice to the right to make a percent. If you need help on this, go to the lesson on percents.

In this new table, every color is shown as a percentage of the total cars.

Percent Bar Graph

A percent bar graph also shows the relative frequency of each category, but it is all in one bar, and it is always in percent form.

As you can see, every color has its own segment that is some percentage of the bar. To find the percentage of a color, find the difference between the two percents the segment sits between on the bar.

If you wanted to find the relative frequency of blue cars, you would subtract 40% from 60% to find the difference between 40% and 60%, since the blue section is bordered by those two percents. The result is 20%, which we know is correct from our frequency table.

The data is placed on the bar graph in order of greatest relative frequency to least, but it could be in any order you want, like alphabetical by color name. As long as the sections fill up the right percentage of the bar graph, you are good to go.

Notice how only every 20% is labeled. Multiples of 20% are usually a good benchmark to measure by. It's okay that some segments don't line up with two or even one of these 20% labels - sometimes you will need to estimate the exact percentage.

For practicing vocabulary, go to:

https://quizlet.com/_6uctu6

Use this quiz to make sure you fully understand what you have just learned. Click "view score" to see explanations for any wrong answers. Click "submit another response" to try again.

Report abuse