Project overview
The objective is to give you applied research experience using real data and statistical methods
This is a group project (each group should have at most 3 people - you can do it yourself if you want, but you'll have to do more work in that way)
Pick a dataset or research topic of interest to you, and you will
describe the data (should have at least 4 variables, but shouldn't be too complicated)
state your research questions (at least 2 questions/hypotheses)
conduct analysis (using methods that are covered in our course, but you can go beyond if you like)
interpret and analysis results and summarize your findings
Main deliverables include:
project proposal (20% grade); describe the background and the data, state your research questions, and outline an analysis plan
presentation (30% grade); a 10-minute presentation that briefly summarizes the work you've done (motivation, data, analysis and findings)
project report (50% grade); a short write-up (3-8 pages, not including references and appendices) with proper introduction, data exploration, methods & analysis, and conclusions
Where can I find data?
R offers quite a few built-in datasets. Here are several example datasets included in the package MASS:
birthwt - Risk Factors Associated with Low Infant Birth Weight
Boston - Housing Values in Suburbs of Boston
crabs - Morphological Measurements on Leptograpsus Crabs
nlschools - Eighth-Grade Pupils in the Netherlands
Pima.tr - Diabetes in Pima Indian Women
To load these datasets, first load the package MASS. For example, if you want to load the "birthwt" dataset, you can run the following commands in R:
library(MASS)
data("birthwt")
~~~~~~
Apart from these R-built-in datasets, these two following websites are also great data sources (though it may take you some time and effort to choose the proper data and pre-process your data):
Kaggle (click here to see their datasets) - this is a data science competition website that offers many interesting open-source data; make sure you choose something you can handle (some datasets are REALLY challenging).
The General Social Survey (GSS, click here to visit their website) - the GSS is a long-standing, high-impact social science/political science survey that has helped scholars and policy makers understand how Americans feel about political and social issues; in fact, a lot of textbook examples in Statistics come from this survey.
~~~~~~
You can collect your own data if you want to! Duke students get free access to Qualtrics (an online survey tool) as a Duke student (see here for information).
Part I: project proposal (20% grade)
Due by 11:59pm, June 4.
Should include the following sections
Introduction/Motivation (4% grade):
describe the background of your research topic (e.g., if it's about effectiveness of COVID-19 vaccines, you should explain why vaccines are important for disease control and why we care about vaccine effectiveness)
discuss source of your data (or how you collected it)
Data description and exploration (8% grade + 2% bonus points for nice exploratory analysis)
describe variables of the data (if there are more than 15 variables, you can only describe the ones that are relevant to your research topic)
data exploration (use summary statistics and/or plots to conduct some exploratory analysis)
address potential issues with the data (e.g., is there sampling bias? is the sample representative of the population you want to make inference about?)
Research questions and analysis plan (8% grade)
state at least 2 research questions you want to investigate using the data
for each research question, talk about what methods you plan to use for analysis (you don't need to be very specific at this stage, something like "I plan to make inference about single proportions", or "I want to use two-sample t-tests" is enough)
this part is simply a plan; as plans may change, you are not required to stick to your analysis afterwards, but it's nice to sketch out something that is somewhat concrete
Part II: presentation (30% grade)
In-class presentations on June 17 and 18.
You should prepare for a 10-minute presentation (beware of the time limit!) that includes
Background and research questions
discuss the background and motivation of your project (what is the general topic, why it is interesting, etc.)
state your research questions
Data description
describe your data (source, variables, potential issues)
include some exploratory analysis results if they are directly relevant to your research questions
Methods and data analysis
talk about what statistical methods you used to answer your research questions
for each research question, talk about what methods you plan to use for analysis (you don't need to be very specific at this stage, something like "I plan to make inference about single proportions", or "I want to use two-sample t-tests" is enough)
Findings and conclusions
summarize the key findings in your analysis
provide main conclusions (you should give us some "major takeaways" at the end of your presentation so that we know what to leave with)
Grading rubric:
(3%) within the 10-minute time limit
(3%) clearly mention the motivation
(3%) state research questions
(6%) describe the data (should at least include data source & variables)
(8%) explain statistical methods and show results
(4%) summarize findings and provide major takeaways
(3%) structure and organization
Should include the following sections
Introduction/Motivation (10% grade):
describe the background of your research topic
state your research questions (something like "in this analysis/research project, we investigate ...")
mention what your dataset is (in a general sense, including source or collection process of your data, and what variables/information your data contain)
give an overview of what statistical methods you are using in the analysis
Data exploration (10% grade)
describe collection process or source of the data, and explain the variables (esp. the variables relevant to your analysis)
do some data exploration (use summary statistics and/or plots to conduct exploratory analysis), which should be relevant to your research questions
address potential data issues if there are any
Analysis (20% grade + 4% bonus for outstanding method choices/interpretation)
provide details of your analysis on the research questions
e.g., If you carry out hypothesis testing, you should write down the null and alternative hypotheses, the test you are using, the test statistics, and the p-value (or some other criterion that you are using to make a decision)
e.g., If you are fitting (multiple) linear regression, you should first say that you are doing so, then write down the linear regression equation, and also check model conditions (or do model selection is you are using multiple regression and there are many potential predictors)
interpret the analysis results in the context of the data application (e.g., something like "it seems that smoking is strongly associated with longevity, and on average, someone who smokes lives 3 years shorter than someone who doesn't")
Conclusion (5% + 3% bonus for discussion on extension or future work)
summarize your findings in words and in context of the data application
(optional) discuss what else could have been done, or what further research (based on your analysis) can be conducted
(NOTE: 5% grade will be allocated to writing and formatting.)