Stats text 03

03 Visualizing data

3.1 Graphs and Charts

The data below is the FSM Statistic bureau projections for the state populations in 2020. This data is updated every ten years. Work on the 2020 data is ongoing at the time of this writing.  

State Population 2020 (est)
Chuuk 49,509
Kosrae 6,732
Pohnpei 36,832
Yap 11,577

Circle or pie charts

In a circle chart the whole circle is 100% Circle charts are used when data adds to a logical whole, for example state populations add to yield national population.

A pie chart of the state populations:

Column charts

Column charts are also called bar graphs in some texts. The following is a column chart for the population of Kosrae. Note that the values after 2010 are projections based on the 2010 census.

Pareto chart

If a column chart is sorted so that the columns are in descending order, then it is called a Pareto chart

Links to an external site.. Descending order means the largest value is on the left and the values decrease as one moves to the right. Pareto charts are useful ways to convey rank order as well as numerical data. The following is the average score by high school on the essay section of a college placement test.

Line graph

A line graph is a chart which plots data as a line. The horizontal axis is usually set up with equal intervals. Line graphs are not often used in this course and should not be confused with xy scatter graphs which are introduced in chapter four. Line charts, however, sometimes reveal patterns in data that are not as clearly seen in a column chart. The population column chart for Kosrae obscures the use of a growth model after 2010, a model that is not well supported by facts on the ground in Kosrae.

Note that the population was measured by the decennial censuses to have declined from 2000 to 2010. The forces that drove this decline in population have continued to act on the population in Kosrae and inter-census surveys suggest that the population has continued to decline. That the census bureau chose to use a near-linear growth model post-2010 can be seen more clearly in this line chart. The dotted line is a continuation of the population loss rate seen 2000 to 2010 and better agrees with current population estimates. The population loss is being driven by outmigration primarily in search of employment and secondarily for educational opportunities.

XY Scatter graph

When you have two sets of continuous data (value versus value, no categories), use an xy graph. These will be covered in more detail in chapter four.

3.2 Histograms and Frequency Distributions

A distribution counts the number of elements of data in either a category or within a range of values. Plotting the count of the elements in each category or range as a column chart generates a chart called a histogram. The histogram shows the distribution of the data. The height of each column shows the frequency of an event. This distribution often provides insight into the data that the data itself does not reveal.

The ranges into which values are gathered are called bins, classes, or intervals. This text tends to use classes or bins to describe the ranges into which the data values are grouped.

Histograms at the nominal level of measurement

At the nominal level of measurement one can determine the frequency of elements in a category, such as students by location affiliation in a statistics course. The sum of the frequencies is the sample size. The relative frequency is the share of the whole held by a category. This is calculated by dividing the frequency by the sample size.

A frequency histogram is a column chart of the frequencies of the categories. In a traditional histogram the columns have no gaps between the columns. Removing the gaps from columns in Google Sheets is not directly possible at this time.

A chart of the relative frequencies is a relative frequency histogram. The relative heights of the columns is the same, only the scale on the y-axis changes.

3.3 Histogram charts and Frequency tables at the ratio level of measurement

Ratio level data is usually a continuous variable. The number of possible values cannot be counted. At the ratio level data is divided into intervals of equal width from the minimum value to the maximum value. The intervals are called classes by statisticians. In some spreadsheets these intervals are called bins. The intervals are called buckets in Google Sheets™.

Histogram chart

Google Sheets™ can automatically generate a histogram chart from raw data. The specific dialog boxes tend to change in terms of layout and new edit capabilities appear over time.

Pre-select the data range and from the Insert menu choose Chart. Depending on the data set, Google Charts may automatically select a histogram as the default chart type. This usually occurs when the sample size is large enough to warrant use of a histogram. The following histogram is the distribution of the number of orange MMs in bags of peanut MMs candy. The data has a sample size of 46. Note that to make a histogram the data must be a single vertical column on the spreadsheet. 

Google Sheets™ will choose a number of buckets based on the square root of the sample size n. This rule works moderately well for the sample sizes usually encountered in an introductory statistics course. Note that the Google Sheets™ Android app cannot generate a histogram with more than ten buckets, which limits the sample size to roughly one hundred. 

If Google Sheets™ does not produce a histogram by default, the histogram chart type can be selected if the data is numeric data.

The histogram chart type does have customization options. By changing the number of buckets and adjusting the horizontal axis Min and Max values, histograms with specific class widths can be generated. For the purposes of this course, however, the default histogram will provide insight into the shape of the distribution of the data. 

Frequency tables

There are occasions on which one would want to build a histogram using pre-specified classes. Doing this requires use of the FREQUENCY function. In this course as currently configured this material is optional.

Each bucket has a smallest value called the class lower limit. Each bucket has a largest value called a class upper limit. The number of data values in each bucket is called the frequency. Spreadsheets have a FREQUENCY function that uses the class upper limits to automatically count the frequencies for each bucket.

To calculate the class upper limits the minimum and maximum value in a data set must be determined. Spreadsheets include functions to calculate the minimum value MIN and maximum value MAX in a data set.

=MIN(data)

=MAX(data)

The minimum and maximum are used to calculate the range. The width of each bucket is equal to the range divided by the number of desired buckets.


For the Orange MM data determine the minimum and maximum. Calculate the range. For a five class (bucket) frequency table, divide the range by five to obtain the width. Use the table above to enter the class upper limits.