This exercise is based off of this FiveThirtyEight article titled Science Isn't Broken: It's just a hell of a lot harder than we give it credit for.
Before beginning, download the dataset and import it into R Studio.
The point of the dataset is to answer the question:
In order to do this, we will need to do a t-test or a chi-squared test, where we will separate the cards earned by darker skinned and lighter skinned players and do a t-test. Or maybe a chi-squared test with dark-light vs. manyCards-notManyCards. Something! Either way, the dataset isn't set up for it well right now.
So, we are going to need to create some new columns. This is called feature creation. The reason we need to do this here is because the dataset has no categorical variables of use, so we will use ranges in the numerical variables to create categoricals for use in t and chi-squared testing. This is something to be careful about - it's easy to fiddle with the categories until you get something that is significant! In general though, feature creation is a useful tool when analyzing a dataset.
Here is how to make a new column in a dataset. We are going to make a new column that adds up the total number of cards that a player gets. The first step is to initialize the column by putting NA's in a brand new column name (totalCards doesn't exist yet in our dataset!)
soccerCards$totalCards <- NA
Then you can do math with other columns to create a new column.
soccerCards$totalCards <- soccerCards$yellowCards + soccerCards$yellowReds + soccerCards$redCards
Two more examples... this one just averages the skin color ratings
soccerCards$ratingAvg <- NA
soccerCards$ratingAvg <- (soccerCards$rater1 + soccerCards$rater2)/2
And this uses a logical to turn numbers into categories, which is helpful for a t-test
soccerCards$skin <- NA
soccerCards$skin[soccerCards$ratingAvg == 0 ] <- "veryWhite"
soccerCards$skin[soccerCards$ratingAvg > 0 ] <- "darkerSkin"
Then you can use the new column in an inference test:
white <- soccerCards$totalCards[soccerCards$skin=="veryWhite"]
darker <- soccerCards$totalCards[soccerCards$skin=="darkerSkin"]
t.test(white,darker)
The only problem with this? You can set the barrier arbitrarily wherever you want!!! Which might allow us to tilt the test in wahtever direction we want it to... THUS:
Your task is to create some new columns in the dataset and experiment until you can prove both of these things with the same dataset:
Officials are biased against lighter skinned players
Officials are biased against darker skinner players
T-test examples are above, and an example for a chi-squared is below. It uses a column we haven't made yet that could be an example for you... splitting the players into "big offenders" and "not big offenders".
chisq.test(table(soccerCards$skin, soccerCards$bigOffender))