Knowledge-Base

Bivariate Frequency Distribution

Describing Two Nominal Variables Using Numbers, Tables, and Graphs (Bivariate)

Describing Two Nominal Variables Using Numbers (Bivariate)

We found that we could “describe” a nominal variable, but could do little more with the information than just provide a “description.” However, one of the main purposes for statistics in Criminal Justice is to be able to predict. This is best exemplified by the prediction of a person’s “characteristic” or “trait.” That “characteristic” or “trait” is, of course, his or her score on the dependent variable. We use the information contained in the data make that prediction. The goal is to try to improve that prediction by adding more information: The subject’s score on another variable.

We know from the previous example that students were classified in one of two academic standings: “A” students and Below “A” students. Can we predict the academic standing of a student randomly drawn from the population? This chapter shows how to make use of the information contained in collected data to improve our ability to predict.

We know that this example involves nominal data: the difference between males and females are described with respect to the number of “A” students or "below A" students. Therefore, there are now two nominal variables (gender = male/female, and score = "A"/"below A"). For simplicity, gender is considered the independent variable, and the count or number of occurrences of "below A" students is the dependent variable. However, when the number of subjects are being given to you in a narrative form, they can become confusing very quickly. Consider the example below:

Example: 500 students randomly sampled from XYZ University were asked to provide their cumulative GPA’s. Two hundred were male. Of 161 students that scored an "A" cumulative GPA, 6 were male. Of the 339 that scored "below A", 194 were male.

How do we make sense of this information. This narrative is difficult at first glance to digest, there are too many values for too many categories. We could just create two separate univariate frequency distributions, one for each variable. But, that would not be making the most of the additional information. We need to show how these two variables are working together. So we combine the categories and report the frequencies of all four categories (two variables each with two categories means four frequencies). Therefore, we need a tool to help us make sense of it. That is the job of a table: the Frequency Distribution.

Describing Two Nominal Variables Using Tables and Graphs (Bivariate)

If the information in the narrative above is presented in an organized table, it is much easier to understand.

Males Females Total

Count “A” 6 155 (161)

Count “Below A” 194 145 (339)

Total (200) (300) (500)

This is easier to understand because it is obvious that there are so few “A” students among men compared to women. But to help further sort it out, the percentages are always reported in addition to the frequencies, because it lends to the clarification of the description. This changes the “frequency distribution into a “frequency / percentage” distribution.

Males Females Total

Count “A” 6 155 (161)

Percentage “A” 3% 52%

Count “Below A” 194 145 (339)

Percentage “Below A” 97% 48%

Total (200) (300) (500)

Example of Requested cell percentage is: What is the probability of randomly selecting a “Male in the “A” category”

The answer is the “cell percentage for males who got an A = 3%

Read the percentages across the row: Of the 161 students at XYZ State University this year that received an "A", 3% were male, 52% were female.

The procedure for calculating the percentages is to divide the number in each "cell" by the total number in the column. This produces the proportion, which is then multiplied by 100, which produces the percentage. This is how to calculate the percentages:

Count in Total in

“Cell” Category Proportion Percentage

Males Scoring "A": 6 / 200 = .030 (.030 * 100) = 3%

Females Scoring "A": 155 / 300 = .517 (.517 * 100) = 52%

By comparing the percentages across the rows (Male = 3%, Female = 52%), the information is easier to understand. The percentage table reveals that in proportion to the number of males in the sample, females actually scored better overall, as was predicted.

Analysis: Female "A" students outnumbered male "A" students in the sample; the percentage of all males who were "A" students was 3%, while 52% of all females were "A" Students.

Making Predictions Based on Observations from Collected Data

The most powerful use of statistics for a Criminal Justice researcher is the power to predict. The more information we have, the better our prediction. That is why, when we collect data we try to get as much information as possible, which means we utilize as many “variables” as possible. We are using a simple example to explain prediction: “A” students and Below “A” students. Is it possible to make prediction on what a randomly selected student will score based on only the information available from one variable. Let’s look at the levels of information:

1^st level: We know that all students are in one of two categories -- “A” and “Below “A.” That means if we were to guess based on that info, there a 50% chance of randomly drawing an “A” student, and a 50% chance of drawing a Below “A” student. (two categories, one score, so

1/2 = .5 = 50%.

2^nd level: The data contains the frequencies of each level, therefore knowing that “A” students = 161, and Below “A” students = 339, we can compute what the probability of getting an “A” student is 161 / 500 = .322 = 32%, with a 68% chance of getting a Below “A” student. (we learn a lot more about how to compute probabilities in future chapters).

In other words, if only the dependent variable is examined, how much information is available about whether a student gets an "A" or not: The number of "A" students (161), verses the number of "below A" students (339). In this case it seems that a student is more likely to score "below A" than "A", because "below A" is the “modal category” (the category with the highest frequency).

Therefore, if you were asked to draw a person from the sample and predict what score that person has, there are two choices: "A" and "below A". Since we have some information from the data, We know that a prediction of "below A” would be more often correct than incorrect. If it were predicted that all 500 students would score an "A", that choice would be erroneous 161 times (out of 500). So, the best choice for prediction would be the modal category (the one that minimized the chance of error in prediction): Below "A".

How many errors would be made if it were assumed that all 500 students scored in the modal category of "below A"?

----------------------------------------------------------------------

161, meaning that there were 161 people who scored an "A".

What percentage of errors would that be?

----------------------------------------------------------------------

Since there are 161 errors out of a possible 500 errors, the percentage would be 161 / 500 = 32%.

3^rd level: Since we know that Criminal Justice researchers are striving for an even more accurate prediction of what a randomly drawn student will be, we look for more information in the data: another variable. We theorize that, since male college students typically party non-stop, and females study judiciously, females probably score higher in cumulative GPA. The sample is re-examined to discover if knowledge about an additional variable, gender, improves the ability to predict how a given student will score.

Therefore, it is theorized: if the gender of the individual is known, the prediction would be more informed, knowing that females are more likely to be "A" students than males (owing to their superior study habits). For example, if a person is drawn from the sample, and the gender is unknown to the researcher, her or his best prediction would be "Below A" for the grade. If it were known that the randomly selected person were a female, it would be tempting to change that prediction, since it is theorized that the likelihood of a female scoring "below A" is less than a male scoring "below A". The prediction changes with the addition of new information.

1^st Level (Prior to collecting the data): We only know that we have two levels of a nominal variable. What is the only basis upon which to make a prediction of which category any given subject will fall?

------------------------------------------------------------------------------

A subject can only be in one category, and there are two possible categories, therefore the best prediction is a guess, leaving 50% error rate.

2^nd Level (having knowledge about only one variable -- the dependent variable): In a univariate frequency distribution with two levels of a nominal variable, what is the only basis upon which to make a prediction of which category any given subject will fall?

-------------------------------------------------------------------------------

The modal category (the category with the greatest count or number of frequencies) because that category would minimize the possibility of error in prediction.

3^rd Level (adding information from the independent variable): In a bivariate distribution with a dependent variable and an independent variable, what are the grounds upon which to make a prediction of which category any given subject will fall on the dependent variable?

-------------------------------------------------------------------------------

On the modal category of the dependent variable, in conjunction with the modal category of the independent variable. If the two variables are associated, the possibility of error will be reduced with the addition of a new variable.

What is the dependent variable in the example above?

----------------------------------------------------------------------

The count (number of frequencies or occurrences) in the "score" variable; in this case, the number of "below A" students, because it is the modal category.

What is the independent variable in the example above?

----------------------------------------------------------------------

Gender (with two categories -- Male and Female)

How is the dependent variable determined when data are collected on two nominal variables (remember, two variables = a bivariate configuration)? The dependent variable is, as was described in earlier, the "effect" of the treatment. The "cause" is the treatment. In this example, each participant is "treated" with one sex or the other. Therefore, we pretend like gender "causes" or determines the score. The treatment that determines whether or not a student is an "A" student is gender.

1. An example of count data for a nominal variable was given. The variable is "Score". Of principle concern is the number of "below A" students at XYZ State University. This variable is regarded as a nominal variable because it only has two levels, and a person can only be in one category ("A" or "below A"), not both.

2. The count in the variable "Score" represents a Univariate distribution. The count in the "below A" category is considered to be the dependent variable as it is the level of the variable that is of principle concern to the research community and is the modal category.

3. The category “A” or "below A" is (in effect) the independent variable.

4. Using only the modal category as a predictor of the dependent variable produces erroneous predictions 32% of the time.

5. Another independent variable is added in the interest of improving the accuracy of prediction. Gender is theorized to be associated with student's score, wherein females are believed to score in the "A" category more frequently than males.

6. The bottom line is this: If two nominal variables are present, the dependent variable will be the count in the category of interest in one of the two variables, and the independent variable(s) are actually both nominal variables.

Operationalizing the Study and Computing Lambda

Before getting into the actual process of operationalization, let’s think about the “statistical model” for a minute. Constructing a model showing the association between nominal dependent and independent variables is very easy. Since the mean is the focal point of the statistical model, and the computation of the mean is impossible in the presence of only nominal variables, the statistical model discussed earlier must be modified by removing the population mean (u).

Non-parametric model: Y^ = µ + e

Y^ = The dependent variable, or “outcome.” It is the score that a member of the population has on the nominal variable. It is what is “affected” by the treatment.

µ = The independent variable, or “treatment.” It is the level of the independent variable.

e = error -- incorrect predictions or misses in both variables.

Let’s run through the procedure for operationalization:

1. How is the population defined and to what subjects in the sample are the questionnaires being distributed? Population = University students. Sample = a random sample of 500 students at XYZ State University. What are the two questions being asked? (1) What was your score in GPA (“A” or “Below A”), and (2) Are you a male or a female?

2. What is being measured as the dependent variable? The count of students in a particular category

3. What is the treatment, or independent variable(s)? Score -- “A” or “Below A” and gender -- male or female

4. What is the type of data for each variable?

Dependent variable = count (in the “A” or “Below A” category) with two nominal independent variables: what score (“A” or “Below A”), and what gender (male or female). That makes this a “count- nominal-nominal” data-type combination.

5. What type of study is it and what procedure is appropriate based on the study design? Differences study -- The Two-way Chi-square test because there are two nominal variables and the difference in the “count” of the category of principal concern is being compared among the two nominal variables.

6. Construct the model: Y^ = µ + e

Y^ = Predicted outcome for a given student based on the modal category: For females it is “A,” for males it is “Below A.”

µ = The treatment that student received: Male or Female

e = Anything that affects the outcome “A”” or “below A” that is not associated with gender. Could be IQ, study time, parent’s education, etc.

7. What is the theory? Females are more likely to be “A” students than males

8. What is the null hypothesis? There is not a statistically significant difference in “A” students among males and females between the observed and the expected count.

9. (We cannot perform this step yet) Test the Null and Report the results: (We will either Reject or Retain the null hypothesis, and report the “test statistic” value, degrees of freedom, and the probability.)

10. (We cannot perform this step yet) Provide the Analysis: (We will report the findings of the study. State if the data supports the theory, and the effect size, which is the “Magnitude of Effect” (MOE) if any.)

It is clear that gender has some association with whether or not a student gets an "A" since a larger percentage of females were "A" students, but how strong is the association?

Remember, the purpose of adding the independent variable was to improve prediction and thereby reduce the errors. Lambda shows how much we reduced the “error” by (it conveys the proportional reduction of error). How is the error being reduced? As previously mentioned, if only the dependent variable is examined, how much information is available about if a student gets an "A" or not: The number of "A" students (161), verses the number of "below A" students (339). A random individual plucked from the population would be assumed to be a "below A” student, since “below A” is the modal category. However, knowing that females score higher than males, it is now possible to make a more informed prediction based on gender, which is the purpose of adding an additional variable to the prediction process.

By examining the bivariate table above, it is known that females are more likely to be "A" students than males. By how much did that information reduce the errors? Without the independent variable gender there were 161 errors (32%). Using gender as a predictor there are only 151 errors. How do we know that? Where did the number 151 come from?

Correct predications can be thought of as "hits". Errors can be thought of as "misses." The univariate error rate is the total number of possible “hits” minus the number of participants who were correctly identified (actual hits). This produces the number of misses (errors). In the example, it was the total number of students (total possible hits = 500) minus the number of students scoring "below A" (actual hits = 339): 500 - 339 = 161. There were 161 misses (errors) using only "Score" as a predictor. This is referred to the "Error 1" symbolized as E₁.

E₁: 500 - 339 = 161

Let’s look at the females. Which category did most females fall in, “A” or Below “A”? Using the modal category reveals that more females score an “A” than “below A.” Based on that information, our best prediction of “score” among only females is “A.” Therefore, females are more likely to score “A” then Below “A.” If all females (n = 300) are theorized to score "A", all those who didn't score "A" are termed “misses.” Possible “hits” (300) minus actual “hits” (155) equals “misses” (errors) = (145).

300 - 155 = 145

All males (n = 200) are theorized to score "Below A", because that is the modal category. Therefore, the best prediction is for 200 “Below A” “hits.” All those who didn't score "below A" are “misses.” Possible “hits” of (200), minus actual “hits” of (194), equals the number of “misses” errors (6).

Male misses (6) plus female misses (145) are the total number of errors (151). This is referred to as "Error 2", symbolized as E₂.

E₂: 6 + 145 = 151

If the number for E₂ (the number of errors using gender and score as predictors, which is 151) is less than the number for E₁ (the number of errors using only score as a predictor, which is 161), then there is a reduction of error. This reduction in error, expressed as a proportion, is called Lambda. Lambda is symbolized as λ. The formula for computing Lambda is:

λ = (E₁ - E₂) / E₁

λ = (161 - 151) / 161 = .062

λ = (10) / 161 = .062

λ = .062

Multiplying Lambda by 100 yields the percentage.

λ = .062 * 100

λ = 6%

Social Science researchers use single number (a statistic) to describe the association between two variables. That statistic is often referred to as "Magnitude of Effect" (MOE). The MOE used when both variables are nominal is Lambda.

Lambda = (E₁ - E₂) / E₁

Using gender as a predictor to reduce the number of errors provided only 6.2% reduction in errors. What is to be concluded from this? Owing to the weak Lambda, female study habits are yielding only moderately impressive results.