A summary table is a new spreadsheet that instead of having all of the data, has new data that has statistics computed from the original data. In preparation for the Summary Tables Tutorial, following are some important terminology and data statistics that can be used in summary tables.
Summary Table - A table that summarizes information about some larger dataset. It typically consists of performing computations like sums, averages, and counts on higher level groupings of information. The intent is to summarize lots of data into a form that is more useful, and easier to "see".
Pivot Table - The tool used by most spreadsheet programs to create a summary table.
Aggregation - A grouping of similar data within a dataset for the purposes of performing a calculation on items within each group. For example, if some dataset contained information about how many hours of television people watched and included their age, you could "aggregate the data by age" and compute the average hours watched for each age group. You could also "aggregate by hours of TV watched" and compute the average age for each number of hours.
Statistics can help us analyze data. For instance, consider a survey given to 1000 high school students asking them to rate a movie. By taking the average (also called the mean) movie rating given a movie by students of ages 14 through 18, we will get five numbers:
average rating of 14 year olds: 3.24
average rating of 15 year olds: 3.89
average rating of 16 year olds: 4.56
average rating of 17 year olds: 4.65
average rating of 18 year olds: 4.90
These five numbers are simpler to look at than all 1000 ratings to see if there is a trend of younger students liking the movie less than older students.
This chapter defines several data statistics that are appropriate for this course. There are many more data statistics that data scientists use, but those require advanced mathematics and statistics skills that are beyond this course. We will work with the basic statistics described in this chapter - which are still very powerful for analyzing data!
The maximum and minumum values of a field in a set of data can be important to establish the range that the data falls in. It can also help establish outliers - data values that may be invalid for your analysis.
It can be illustrative to sort data in a field before examining it. Sorting data can make it easier to spot trends, find duplicates, find the median (described below), and find outliers that are at the beginning or end of the sorted data.
One of the most widely used statistics is the mean, which is also less-formally called the average. The formula for the mean is to sum the values and then divide by number of values. If the values are
[23,26,49,49,57,64,66,78,82,92]
then the calculation of the mean is:
(23+26+49+49+57+64+66+78+82+92)/10 = 58.6
The problem with the mean is that it does not tell anything about how the values are distibuted. Values that are very large or very small change the mean a lot. In statistics, these extreme values might be errors of measurement, but sometimes the population really does contain these values. For example, if in a room there are 10 people who make $10/day and 1 who makes $1,000,000/day. The mean of the data is $90,918/day. Even though it is the average amount, the mean in this case is not the amount any single person makes, and is probably useless.
The median is the middle item of the data. To find the median we sort the data from the smallest number to the largest number and then choose the number in the middle. If there is an even number of data values, there will not be a number right in the middle, so we choose the two middle ones and calculate their mean. In our example there are 10 items of data, the two middle ones are "57" and "64", so the median is (57+64)/2 = 60.5. Another example, like the income example presented for the mean, consider a room with 10 people who have incomes of $10, $20, $20, $40, $50, $60, $90, $90, $90, $100, and $1,000,000, the median is $55 because $55 is the average of the two middle numbers, $50 and $60. If the extreme value of $1,000,000 is ignored, the mean is $57. In this case, the median is close to the value obtained when the extreme value is thrown out. The median solves the problem of extreme values as described in the definition of mean above.
A simple count of how many times a value appears in a field can be useful in analyzing the data. For instance a count of how many new babies are given the first name "Jennifer" in a year can help with analysis of the popularity of baby names.
A histogram shows a count of how many times values occur. Histograms use bins, which are ranges. For instance a histogram of ages might use the bins 0-9, 10-19, 20-29, 30-39, etc. The histogram of the age field represents how many age values from the data set are in each bin.
The mode is the most frequent item of data. For example the most common letter in English is the letter "e". We would say that "e" is the mode of the distribution of the letters.
For example, if in a room there are 10 people with incomes of $10, $20, $20, $50, $60, $90, $90, $90, $100, and $1,000,000, the mode is $90 because $90 occurs three times and all other values occur fewer than three times.
There can be more than one mode. For example, if in a room there are 11 people with incomes of $10, $20, $20, $20, $50, $60, $90, $90, $90, $100, and $1,000,000, the modes are $20 and $90. This is bi-modal, or has two modes. Bi-modality is very common and often indicates that the data is the combination of two different groups. For instance, the average height of all adults in the U.S. has a bi-modal distribution. This is because males and females have separate average heights of 1.763 m (5 ft 9 + 1⁄2 in) for men and 1.622 m (5 ft 4 in) for women. These peaks are apparent when both groups are combined.
Another thing we can say about a set of data is how spread out it is. A common way to describe the spread of a set of data is the standard deviation. If the standard deviation of a set of data is small, then most of the data is very close to the average. If the standard deviation is large, though, then a lot of the data is very different from the average.
If the data follows the common pattern called the normal distribution, then it is very useful to know the standard deviation. If the data follows this pattern (we would say the data is normally distributed), about 68 of every 100 pieces of data will be off the average by less than the standard deviation. Not only that, but about 95 of every 100 measurements will be off the average by less that two times the standard deviation, and about 997 in 1000 will be closer to the average than three standard deviations.
Calculating the standard deviation is a complicated mathematical formula. Fortunately, spreadsheet tools and the Python programming language have built-in functions to calculate the standard deviation by giving the function the data values.