Q2 Knowledge Base -- Univariate Freuquency Distribution

Describing One Nominal Variable Using Numbers, Tables, and Graphs (Univariate)

Learning Goals

1. To understand the nature of nominal variables: each level is only a category to which a person or subject belongs.

2. To learn to describe (quantify or enumerate) a nominal variable by providing the count and computing proportions and percentages for each level of a nominal variable.

3. To be able to identify the independent and dependent variables when only nominal variables are present.

4. To be able to construct a univariate frequency and percentage distribution for a nominal variable.

5. To learn to make predictions based on observations from collected data using one nominal variable

One Nominal Variable (Univariate)

Metric variables convey how much of a trait or characteristic a person or subject possesses. When data are collected on a nominal variable, there are typically two or more levels of a "fixed" characteristic or trait such as gender (male or female) or party affiliation (Republican, Democrat, or Independent). The levels of each category are mutually exclusive, meaning a person or subject can only belong to one category of the variable.

A nominal variable is described by "enumerating" or "quantifying" it. The enumeration of a nominal variable occurs in two ways: (1) by counting how many subjects are in a level or category, and (2) by computing what percentage of the sample is in a given category. When the number of subjects in a level of the nominal variable is collected, it is sometimes described as the "count" of a given variable, which gives rise to what some researchers refer to as "count data." Unlike metric variables, the issue of interest is not explaining “why subjects deviate from the mean,” because there is no mean. Instead, the issue is “how many subjects are in this category (what is the count) and why.”

Example: To illustrate an example of count data for a nominal variable with two levels, we examine a study that is intended to discover how common it is to find an “A” student among all undergraduate students at XYZ State University. We define an “A” student as one who has a cumulative GPA of 3.50 or better. Therefore, we have created a variable with two categories (or levels): The number of students who score "A" (those with a GPA of 3.50 or higher), and the number of students who score "B" and below (those with a GPA of 3.49 or lower).

We take a random sample of students and divide them into two groups, based on their so-called “treatment:” Level 1 is termed “A” student (GPA > 3.50) and level two is termed “Below A” student (GPA < 3.49). In this example, a student is either scored as an "A" student or “Below A” student. In this way, the dependent variable is nominal with two levels, since a person is either in one level or the other, but not both.

In actuality, we have taken a metric variable, GPA, which typically ranges from around .01 to 4.0, and transformed it into a nominal variable with two levels. Rather than thinking of grades as ranging from "A" to "F", the researcher has collapsed the categories into two: "A" student (GPA of 3.50 or greater) or "below A" student (GPA of 3.49 or below). Therefore, students are in one category or the other. This study uses nominal data: difference between the number of “A” students and the number of students scoring "below A" are examined. Naturally, it is no longer possible or practical to compute the descriptive statistics we typically associate with a metric variable, such as a “mean” or “standard deviation.”

To operationalize this study, we must simplify the procedure for operationalization, because in this study there is only one variable. When there is only one (or two) nominal variable(s) present, the data is not “parametric,” meaning that “parameters” cannot be estimated (the mean, standard deviation, minimum, maximum, range). This study uses “non-parametric” data, therefore, the dependent variable becomes the “count” in a given category, and the null hypothesis compares the observed to the expected count.

Describing One Nominal Variable Using Numbers (Univariate)

Using numbers to describe nominal variables is what we are doing when the variables are quantified or enumerated. This can happen in three ways:

(1) Provide the count. The count is the “number of frequencies” or “number of occurrences” in a particular category of a nominal variable. It is the “count” in that category. Example: Of the 500 Students last year, 161 students scored an "A" and 339 scored a "B" or below.

(2) Provide the portion. It is the decimal value showing the amount of one level in relation to the other. But this is not as easy to comprehend as the percentage.

(3) Provide the percentages, because it is the easiest way to understand. Calculating percentages is easy. It is the number of frequencies in each level of the variable relative to the total frequencies. Example: Of the 500 students taken from a sample at XYZ State University last year, 32% scored an "A", and 68% scored "below A".

To calculate percentages when the count of each level of the nominal variable is known, divide the count in each level by the total number in the sample (which yields the proportion) then multiply by 100 and attach a the percent (%) sign (which yields the percentage, typically rounded off to the nearest whole percent):

Count in

Category Proportion Percentage

"A": 161 / 500 = .322 (.322 * 100) = 32%

"below A": 339 / 500 = .678 (.678 * 100) = 68%

Note: Although proportions are carried out to 3 decimal places, percentages for nominal variables are typically rounded off to the whole percentage point. Therefore, 67.8% is most often expressed as 68%.

Describing One Nominal Variable Using a Table (Univariate)

Think about this situation. Someone is talking to you about the number of “A” students at XYZ University; they tell you the total number of students and the number of “A” students. (Of the 500 Students who took statistics last year, 161 students scored an “A”). That information is not as easy to “digest” as when it is expressed as a percentage.

To make it more understandable, we expressed the numbers as percentages, which “standardized” the numbers into easily understood units of 100. However, to further make sense out of the numbers, it is possible to use a table, which “organizes” the numbers. This table is called a frequency distribution. Since it has one variable, it is known as a “univariate frequency distribution” (univariate means "one variable").

(Column 1) (Column 2) (Column 3) (Column 4)

Labels Frequencies Proportion Percent

(Row 1) “A” 161 .322 32%

(Row 2) Below “A” 339 .678 68%

(Row 3) Total 500

In this example the variable is "score", divided into either "A" or "below A". The formula for calculating the percentages is to divide the number in each level or category by the total number in the sample (which produces the proportion), then to multiply the proportion by 100 (which produces the percentage).

1. We use numbers to describe variables, even nominal variables because they have a number (or frequency) of participants in each category.

2. Frequencies can be clumsy, so we transform the frequencies into easily understood “standardized” units: percentages.

3. The bottom line is this: Instead of just “talking” or “writing” about the numbers in a sentence, we further organize them by placing both the frequencies in a table called a univariate frequency distribution.


1. We took raw data on a single nominal variable and described it, by using frequencies. To put those frequencies into a more familiar format, we turned them into percentages (all numbers are more easily understood when “standardized;” because it doesn’t matter what the variable is, we can all understand units of 100). Finally, by organizing the numbers in a table, we created a “Frequency Distribution.”

2. Then we used graphs to give a fuller visual meaning: A “pie chart” and a “bar graph.” This gives us a “photograph” of the data. With graphs, it doesn’t matter how seemingly unordered the frequencies are in their raw form; the brain can easily relate to a picture showing how the proportions in each category are distributed.

3. The bottom line is this: Numbers are confusing enough without having to mentally compare two categories that have seemingly haphazard values. That is why percentages, and then ultimately graphs make easy understanding of raw data

In a univariate distribution, the "count" of one level of the variable becomes the dependent variable. It will usually be the level of we are most interested in (or the level of “principle concern” to the researcher). However, often it will be the “modal” category (meaning the category with the highest count). In this case it is the number of "below A" students, which is both the category of principle concern and the modal category.

The dependent variable is the “count,” but what is the count dependent on? It is dependent on how many are in a given category. Since there is only one variable and it has two levels, it is possible for the sake of clarity, and for the sake of definition, to visualize the dependent variable as the “count,” and the independent variable as category (score: “A” or “below A”).

Pay attention to the fact that even though there is only one variable, there are in essence still both a dependent and an independent variable, because the count of "below A" students is dependent on how many students score in the "A" category. Therefore, statisticians refer to the “count” in one category as the dependent variable, and the “levels” (categories) as the independent variable. Think of the count in the level of principle concern as the dependent variable. In other words, what are we interested in: the number of "below A" students. The researcher wants to find out why they are "below A" students.

How the dependent variable is determined when there is only one nominal variable: The dependent variable is thought of as the "effect" of the treatment. The "cause" is the treatment, or independent variable. In this example, each participant is "treated" with a score, either a score of "A," or a score of “below A.” Therefore, what score a participant received "causes" or determines the count in the other category.