Course plan

Week 1: Build a language, define objectives

  • Brief overview of course language, goals, and plan.

  • What brings us all to this research?

Week 2: Find problem partners & define a problem with data

  • Objective 1: Find problem partners. Is there a topic you are interested in and researchers studying this topic? Reach out to them to see if you can collaborate.

  • Objective 2: Data. Are there data for the topic you are interested in? What do those data look like?

  • Objective 3: Define a problem. What is the question you would like to answer? Can that question be answered with the data that you are working with?

Week 3: Build code infrastructure to visualize raw data; define analytic problem

Objective 1: Create GitHub account and Google Collaboratory account.

Objective 2: Read in your data into a Colab iPython notebook. Visualize those data with histograms, scatter plots, bar plots. Quantify missing data.

Objective 3: Begin an Overleaf document about your project. Write two paragraphs describing your data. Write two paragraphs describing the analytic problem you will address in your research.

Week 4: Normalize, clean, and impute missing data. Build a background model.

Objective 1: Quantify missing data in your data set.

Objective 2: Identify sample outliers in your data set. Are they possible but hard to account for? Look out of range variables (e.g., ages that are < 0 or > 100). Decide how to handle those outliers. If you remove them, think through the possible impact on your analysis.

Objective 3: Consider the impact of normalizing (either z-scoring or projecting to the quantiles of a standard normal) your quantitative data. Do some of your downstream techniques require Gaussian data (for example)? Do you want to remove differences by race in the means?

Objective 4: Intersect your data with external data. For example, what is the density of the Black population in neighborhoods across the US? These external data sources may clarify how unusual your data are (e.g., are Black people getting killed by police in mostly Black neighborhoods?), or add additional layers of analysis (e.g., do police shootings only occur in mostly Black neighborhoods?).

Week 5: Apply ML and statistical methods to the data.

Objective 1: Cluster your samples using simple clustering methods (K-means, Gaussian mixture models) and "paint" a visualization of the clusters with held out meta-data (e.g., age, race, etc.); what do you see? What does each cluster represent?

Objective 2: Fit linear regression models between sets of predictors and quantitative variables. Plot these regression fits and look at the p-values of the coefficients. What do you find? What variables appear associated (conditionally), and what variables appear not associated?

Objective 3: Perform dimension reduction on the data (PCA, factor analysis, sparse factor analysis, t-SNE, Umap). Plot the results, again "painting" the sample points with held out meta-data as in Objective 1. What do you see? What patterns emerge in the factors?

Objective 4: Write these results up in your LaTeX document. Focus on visualizing and plotting results, and interpreting the parameters more than statistical testing. This will be exploratory and qualitative results, and be careful not to include conclusions or statements about the domain; instead, focus on patterns that may be observed from these methods in the data visualizations. Also write these approaches into your Methods section as you go in a declarative way.

Week 6: Encode metrics to evaluate the results; visualize results.

Objective 1: Use linear regression and logistic regression models to quantify associations between variables. Plot the informative results.

Objective 2: Use statistical tests (e.g., Student's T-test to determine if the mean of two groups is different; Fisher's exact test to determine if there is enrichment in a 2x2 table (two binary features)) to ask direct questions about your data. Write down the question you are asking, and make the null hypothesis clear. Create plots of the results.

Objective 3: Write up these results in your LaTeX document. Results should be written up with this type of pattern: "Next, we tested whether <the average age of a Black police shooting victim was the same as the average age of a White police shooting victim>. To do this, we <removed all non-White and non-Black victims and performed a T-test on non-missing victim ages, correcting the p-value using Benjamini-Hochberg>. We found that <the average age of Black victims (32.6) was substantially lower than White victims (39.9; p < 2.2 x 10-16; Figure 3A)>. This difference becomes more extreme when we restrict this test to <male victims only, where the average age of Black victims (32.5) and White male victims (40.0) has a larger gap (p < 2.2 x 10-15; Figure 3B)>. This suggests that <police may selectively focus on younger Black men as potential criminals more than their White or female counterparts.>" Also write these into your Methods section as you go in a declarative way.

Week 7: Related work: Consider other methods.

Objective 1: This week, we take a big step back to look at the bigger picture of our work in the context of related research. Time to reach out to your project partner after doing some footwork yourself in reading up on existing analyses and papers on this topic and these data. Search Google Scholar and other academic resources to identify other work on these topics. What questions did they ask? What were their conclusions? Do those conclusions broadly agree with yours, or disagree? What assumptions did their analysis include that differ from the assumptions in yours? Write all of this up in your LaTeX document in a Related Research section.

Objective 2: If other work was done on the same dataset that you are using, try to recapitulate their results with your data cleaning and normalization workflow. If results are the same, what are the three concrete additions of your research to those results? If results are different, what part of the assumptions and workflow led to these differences? Write the second-to-last paragraph of your Introduction to read as follows: "Our analysis of these data builds upon existing work in the following three ways. First, XXX. Second, XXX. Third, XXX. Taken together, our work adds an important XXX dimension to previous work in this area."

Objective 3: Try to apply simple methods to your data and examine the differences using pre-specified metrics carefully on hold-out data. Create figures and tables to highlight these differences. Articulate using data, p-values, figures, and words what you have found, and return to your paragraph in Objective 2 to state crisply (without including results) what your analysis adds to current research.

Week 8: Iterate the process: What can be done better?

  • Objective 1: Continue to write up your work. Read over what you have written, and read it over with a colleague or collaborator. Begin to articulate the main impacts of your work. What does this data analysis change with respect to general knowledge? What are possible policy implications?

  • Objective 2: If the results do not fully support those implications and impacts, where are the holes in the results? What additional tests, analyses, and visualizations can be performed to fully support the questions you hoped to be able to answer? Articulate those additional analyses, and then implement them to see if they support your hypothesis (or not).

  • Objective 3: As you do this, continue to pull the strings of your existing approaches to ensure they hold up. What if you impute your data differently, or z-score your data rather than quantile normalize them? What if you used different methods or statistical tests to perform hypothesis testing or multiple testing correction? If you get different results, it may be that these results are too brittle to include. Consider other ways of pulling strings.

Week 9: Articulate the assumptions of and caveats to your approach

  • Objective 1: The ideas involved in this week's task are related but distinct from the Week 8 task. This week, your goal is to continue to write up your work and iterate the analyses; however, this week, you take the approach of critic and skeptic. Do not believe the results. Your first goal is to articulate all of the assumptions and caveats of your approach. Your most skeptical reader should feel as though their mind was being read when they get to this part of the manuscript (I'd put it in the Discussion).

  • Objective 2: What are the assumptions made for each of the methods, analyses, tests, and visualizations you used? Did you assume that the data were missing at random? Did you use PCA that assumed that the data were Gaussian? Did you articulate how ties were broken in Benjamini-Hochberg or quantile regression? Is the one-tailed test appropriate here rather than the two-tailed test? Is the null hypothesis reasonable? Is there an exclusion in the data that might lead to collider bias? It is fine to have some assumptions and caveats to your work, but they must be articulated.

  • Objective 3: If you ask yourself whether the results hold up despite the caveats, and the results do not hold up, then it is worthwhile to fix the unacceptable assumptions or caveats of the analyses and improve the results. The list of assumptions and caveats is almost a to-do list of what you would do or fix with infinite time and resources about your analyses -- treat it as such as you continue to iterate your analyses to improve the robustness of the results. On the positive side, write a paragraph of "next steps" in the Discussion also, articulating what more you could do to improve the analyses that is not within scope of the manuscript. If you articulate this and it is within scope, then try it.

Week 10: Presenting your study to a broader audience

  • Objective 1: Generalize your work into a coherent story with concise implications. Look across your data, analyses and results. What thread runs through the results? Articulate this thread, iteratively, making sure it is an accurate representation of your results and is a realistic conclusion given the data you used. What are the next steps for this work (out of scope for this paper)? Write those down in the Discussion.

  • Objective 2: Structure the presentation of each pieces of your research story. Lead the audience toward the same conclusion as you came to as a logical argument, where each piece of the argument is supported by statistical data, figures, and comparative results. In a slide deck, each slide should address one of those pieces of the argument, and each slide should include statistics and p-values, tables, and visualizations, with as few words as possible. In a Results section, all statements about the results should be accompanied by a p-value, a figure, or a table reference. Make the slide titles and section header declarative to underline the results, e.g., "Evidence of increased levels of violent policing in majority Black neighborhoods."

  • Objective 3: Examine the implication of your research story. Does the conclusion possibly do more harm than good to a vulnerable population? It should be modified and adapted then. Are there specific policy implications that may be asserted from the analysis? For example, "Our results suggest that the majority of police violence against Black victims occur in predominantly Black neighborhoods, suggesting that targeting policy changes to these neighborhoods rather than ubiquitously may be the most effective in reducing the number of Black victims of police shooting."

Week 11: Writing up your work in a manuscript

  • Objective 1: Given our write-as-you-go approach, now is the time to fill in all the gaps and details. Make sure the bibliography is complete, that your acknowledgments and conflicts are correct, that all figures and tables are referenced in the proper order, and that the language is clear. Anywhere you noted that there is a gap or a piece missing, add that now.

  • Objective 2: Read, re-read, and have others read and comment on your manuscript. Is the flow logical? Is the work entirely reproducible? Is the code (and the data!) publicly available? Are the conclusions well-supported by the evidence? Are the figures clear and do they tell a compelling story on their own?

  • Objective 3: Be aggressive about cutting text, as painful as that is. The objective of a manuscript is *never* to fill space, but instead to make a complete and concise argument, say everything once, and be finished so your reader is still engaged. Consider the use of an Appendix or Supplemental Information if the main story can be told after removing supporting but non-essential parts.

Week 12: Present your work and ask questions of others

  • Objective 1: Make your work public. When you are ready, make your Github repository (clean and well-annotated) public. If you are submitting your work to a journal or conference, also consider posting it on arXiv.

  • Objective 2: Tweet or email collaborators or colleagues who might be interested in the work, and ask for their feedback genuinely. Take criticism seriously, and adjust the work to address criticism so that the next reader will not have the same criticism. If some part of the analysis is incorrect -- this has happened to all of us at some point! -- apologize and re-work that section of the analysis until it is right.

  • Objective 3: Reviewer and editor comments are important to incorporate into the paper. The quality varies substantially, but in the end they are the gatekeepers between you and your peer-reviewed paper. Every comment should be addressed, and each fix should make its way back to the paper and analyses in some (possibly small, possibly large) way.