Chi Squared Testing.
----
Sometimes we have data that doesn't work the way we would like it to ... It's in tables:
... so what questions do we ask with this data?
we can't just compare the NUMBER of people in each cell ... it is clear we sampled more females than males for this experiment.
we are going to need some other things:
Residual Tables
Expected Results
making tables in r: http://www.cyclismo.org/tutorial/R/tables.html <---anything and everything you need to make and use tables in r.
scared = matrix(c(40,70,50,90),ncol=2,byrow=TRUE)
colnames(scared) = c("Male","Female")
rownames(scared) = c("Jumped","Did Not Jump")
scared.t = as.table(scared)
scared.t
FACT 1: margin tables
margin.table(scared)
margin.table(scared,1)
margin.table(scared,2)
FACT 2.
expected value is as follows:
row.total*column.total
---------------------- = expected value for that cell.
table.total
FACT 3.
we compare the expected value of a cell
to the actual value of a cell.
if the differences are big enough,
the difference is significant.
...we use chi squared test to check for that.
FACT 4. chi-squared of a single cell is...
(actual-expected)^2
------------------- = chi-squared value.
expected
FACT.
we do this by: summary(scared) <-- obv. change "scared" for new data.
Let's try a few together:
We are curious about Facebook use on Penn State Campus. Here are some numbers collected about two different places to live on campus. Note these are different groups--a person who counts for once a day is not included in the once a week data. In addition, we were only interested in people who used facebook, not non-users.:
1) What percent of University Park users say they use facebook at least once a week?
2) What percent of people who use facebook once a day are on the Commonwealth campus?
3) Find the expected values for each of these cells.
4) Find the chi-squared values for each of these cells. Which one is the largest? What does that mean? Which one is the smallest? What does that mean?
5) Find the sum of the chi-squared values.
6) Is there a difference between facebook users on one campus as opposed to the other? Explain how you know.
---
ANSWERS:
facebook=matrix(c(55,76,215,157,640,394),ncol=2,byrow=TRUE)
colnames(facebook)=c("University Park","Commonwealth")
rownames(facebook)=c("Several Times a Month or Less","At Least Once a Week","At Least Once a Day")
facebook.t=as.table(facebook)
1) What percent of University Park users say they use facebook at least once a week?
215/910
[1] 0.2362637 <-- 23.6 %
2) What percent of people who use facebook once a day are on the Commonwealth campus?
627/1034
0.381 <-- 38.1 %
Find the expected values for each of these cells.
expected = as.array(margin.table(facebook,1)) %*% t(as.array(margin.table(facebook,2))) / margin.table(facebook)
expected
University Park Commonwealth
Several Times a Month or Less 77.56018 53.43982
At Least Once a Week 220.24723 151.75277
At Least Once a Day 612.19258 421.80742
4) Find the chi-squared values for each of these cells. Which one is the largest? What does that mean? Which one is the smallest? What does that mean?
(facebook.t - expected)^2/expected
University Park Commonwealth
Several Times a Month or Less 6.5621535 9.5240186
At Least Once a Week 0.1250117 0.1814364
At Least Once a Day 1.2630869 1.8331883
5) Find the sum of the chi-squared values.
sum((facebook.t - expected)^2/expected)
19.4889
6) Is there a difference between facebook users on one campus as opposed to the other? Explain how you know.
summary(facebook.t)
Number of cases in table: 1537
Number of factors: 2
Test for independence of all factors:
Chisq = 19.489, df = 2, p-value = 5.862e-05
our p value is very small ... I can reject the null hypothesis that the usage of Facebook is the same on both campuses. The chi-squared values for cells suggests that the discrepancy occurs in the number of people who use it "Several Times a Month or Less".
... that should do it.