Counting for Categorical Data

Understanding frequencies

You can obtain a frequency for each categorical variable of the dataset, both for the predictive variable and for the outcome, by using the following code.

We will see the number of player for each positions.

For numerical variable, we can do binning process to transforms numerical variables into categorical ones. Let's divide the Height group into 5 in this code and see how many players fall into each range.

> The precision decides the decimal places.

Creating contigency tables

By matching different categorical frequency distributions, you can display the relationship between qualitative variables. The pandas.crosstab function can match variables or groups of variables, helping to locate possible data structures or relationships.

Here, we will see the relation between Position and player's Height.

> To be able to see the whole data in the output, uncomment the set_option() line.

However, seeing the contingency itself is not pretty helpful without knowing the real portion for each part. We can add an extra argument normalize, to see the proportion for each part.

=True : the whole data is 100%
='columns': each column is 100%
='index': each row is 100%

Say we want to compare which position require taller player. We set it to normalize='columns'. The *100 in the end is to make it easier to see it as a percentage.

> The data shows that most of the tall player (in the last column) fall into Relief Pitcher and Starting Pitcher. We may conclude from this data that most taller player fall into the role of pitcher.

< Prev. Lesson

Next Lesson >

Exercise 3.2

Checking house prices
Using homes.csv, try to find out the following:

The number of house for each number of bathrooms. Save that dataframe in a variable called num_bathroom.
Divide the area (acres) of the house into 7 bins using cut. Save that dataframe in a variable called num_house_in_area. Then, show the number of houses in each bins.
By using "Baths" column and num_house_in_area, create a contingency table to see the relation between the number of bathrooms and the area of the house.
Do no.3 again, but instead of printing the number, use percentage.

Page updated

Google Sites

Report abuse