Questions one through three are due to me Friday.
Questions four and five are due by Monday.
As usual, be sure to include your lines of code. For example, when making a table, I would like all the steps from:
grads = matrix(c(3738, 4704, 1494, 2827), ncol=2, byrow=TRUE)
colnames(grads) <- c("Accepted", "Denied")
rownames(grads) <-c("Male", "Female")
grad.table = as.table(grads)
grad.table
And then the table.
another way to make the same table:
grads = matrix(nrow=2, ncol=2, byrow=TRUE)
data.entry(grads) #here you will input your numbers and column names
rownames(grads) <-c("Male", "Female")
grad.table = as.table(grads)
grad.table
If you have a lot of data, sometimes the second way is easier than the first.
And remember, if you need help figuring out how to do something in r, try: http://www.cyclismo.org/tutorial/R/tables.html
As a brief part, you can make an expected values table like so (this example uses the smoke data):
expected <- as.array(margin.table(smoke,1)) %*% t(as.array(margin.table(smoke,2))) / margin.table(smoke)
expected
Question One:
We are curious about Facebook use on Penn State Campus. Here are some numbers collected about two different places to live on campus. Note these are different groups--a person who counts for once a day is not included in the once a week data. In addition, we were only interested in people who used facebook, not non-users.:
1) Create a joint distribution table for this information
2) What percent of University Park users say they use facebook at least once a week?
3) What percent of people who use facebook once a day are on the Commonwealth campus?
4) Find the expected values for each of these cells.
5) Can we say that the campus has an influence on how often people use facebook? Set up a null and alternate hypothesis, a significance level, and then run the appropriate tests.
note: when comparing here, our null and alternate hypothesis are based upon whether or not there is a correlation or measurable difference between one column and the others. It has to do with those expected values. As such, your Ho and Ha will end up structured a little differently than those before.
Significance level is the same, however.
7) Create a barplot with all of the information in a useful manner. Make it colorful.
barplot(facebook, legend=T, main="Facebook...", beside=T)
that might be a start. or just try plot(facebook) .
Question Two:
There is a question as to whether or not smoking and level of school completed are related. We took an SRS of males from France and looked at the data below:
1) Create a two way table of your data in r.
2) Using the marginal distribution tables, outline the useful information (percentage wise) about these numbers.
3) Get a chi-squared value for each of the different cells.
Which cell gives the highest addition to the chi-sqared total?
Which cell adds the least about to the chi squared?
4) Test whether or not there appears to be a correlation between smoking habits and school completed.
5) Create some type of plot showing this information accurately and in an interesting way. Consider the difference between these three, for example:
barplot(prop.table(smoke,2))
barplot(smoke,beside=T,legend=T)
barplot(prop.table(smoke,2),beside=T,legend=T,ylim=c(0,.8))
6) Why did you choose the interpretation of the data that you did? What goal did you have in mind?
Question Three:
A study of people who refused to answer survey questions is shown below:
1) Give a brief analysis of the population of people here using the marginal tables and marginal proportion tables.
2) Are there any possible problems with the way we've broken up this data age wise?
3) At alpha=0.01, can we say that age affects likelihood to respond to a survey?
4) Make a graph that focuses on showing the differences in response rates at the different ages.
5) Explain why you think your graph was the best choice to show the differences in response rates. What other options did you consider? Why did you reject them?
Question Four:
Congratulations. With the exception of a final data set, you are at the end of my designed data sets. I would really appreciate some feed back on the data sets so that I can work them out and make them the best (and most useful that they can be).
a) Looking back at the data sets, which one was the easiest? Why do you think that was?
b) Looking back at the data sets, which one was the hardest? Why do you think that was?
c) I'm going to assume the bookstore one was the most tedious. If not, please let me know and explain.
d) Do you feel comfortable using r at this point? What pieces of r do you feel are the most useful?
e) What could be done to make the labs better/more useful?
Question Five:
Calendar of events that you need to help fill in. Below are the final courses that we will have together. Thursdays are going to be deemed independent work time, which means that I will be in my office/in the computer lab and you will be working on your data and projects while I'm there so you can ask questions.
Final presentations will be May 28 and 31.
What I need from you is a plan for what you are going to have completed for your project for each of the following Thursdays. This way I make sure that we are busy working on those days and that we don't end up all crunched at the end.
1. For each of the Thursdays, write a brief summary of what you plan to have done at that time. This can happen throughout the week, using Thursday as a kind of check in place.
2. For each of the Thursdays, write what you expect to be doing during that two hour span (working with me, working on r code, writing up your data, etc.).
3. If planning a poster, make a sketch of what it will look like. Make a plan for each of the graphics you plan to use as well.
4. Let me know if you prefer to present on Monday, May 21 or Thursday, May 24. First come, first served on this. 6-7 each day, followed by the remaining time for questions on your final data set.
Thursday, May 10: ON YOUR OWN
Thursday, May 17:ON YOUR OWN
Monday, May 21: FINAL PRESENTATIONS
Thursday, May 24: FINAL PRESENTATIONS
I will have the final data set ready for you on May 17. It will be due May 31.