Moneyball

Learning objectives (and summaries)

Compare and rank individuals within one or more quantitative distributions.

- Develop a method to approach large spreadsheets of real data and use it to intelligently make decisions
- Use multiple regression in MS Excel to identify correlations between key player attributes
- Use ranking and z-scores of individuals to compare them within a distribution
- Develop ways to rank individuals using a rank and z-scores from multiple variables

Assessment

Formula, written justification, and graphs for simulation (8pts)
- Each team needs to submit one Excel spreadsheet with multiple regression data
- In the red column of the Google spreadsheet, create a formula to rank all players. Make sure the spreadsheet link is shared to "anyone with link". (1pt)
  - As a team, describe all parts of your formula and explain why your team thought this would help you find the best players. (2pts)
  - Once you have a good idea of who you want to draft and what stats are most important, choose at least 3 good players to compare. Use histograms with the same bins to compare the center, shape, and spread of the distribution of a statistic you care about. See instructions in the Ultimate section below. (3pts)
  - In the write-up, add a small section on what information your histograms helped you learn about a few of your top individuals. (2pts)
  - Each team should submit a link to their spreadsheet, their written description, and their Excel histograms via email, with subject "statshw", cc-ing all teammates.
Typed reflection on the Moneyball process, drafting the best possible team, and future applications (6pts)
- - INDIVIDUALLY reflect on your experience performing "Moneyball" on the game of Ultimate. Discuss how you used your intuition and understanding of the game and how you also used pure statistical analysis to best identify talent. Talk about how you approached spreadsheets with over 50,000 cells in a way that left you informed for the draft. You can draw on the days we played Ultimate, watched Moneyball, and worked as teams on the formula/graphs. For full credit, I expect at least two solid paragraphs (think somewhere around 300 words, and no fluffy intros or conclusions to take up space). There is plenty to write about, and if you're stuck, ask me for help in class.
  - Each person should submit their reflection via email with subject "statshw".
- Article discussion: http://fivethirtyeight.com/features/billion-dollar-billy-beane/

Ultimate Frisbee simulation

Your team's goal is to draft the best Ultimate Frisbee team possible from a set of statistics on fictional players. This will be a competitive draft -- on draft day, once a team selects a player, they will be off the market from all teams, so you will want to use the past performance data to your advantage and find the most valuable players that others may not notice. Standardized scores, summary statistics, spreadsheet formulas, and a touch of intuition will aide you with this task. Work in teams of 2-3.

Getting started:

Make your team a copy of this Google Spreadsheet. Name it with your (and your teammates' names)
Share your doc with the world (so anyone with the link can view)
Share your doc with your team by adding their email addresses
Create a formula that uses z-scores or percentiles to standardize values (see first two videos below)
Create 3 histograms to further compare a few individuals around a specific stat (see last two videos below)
See the details in the "assessment" section above for specific details on what you need to turn in

Applying percentiles and z-scores to a spreadsheet:

Combining multiple z-scores into a weighted "super-rank":

Installing the data analysis pack for Excel (if needed):

Using single-stat Excel files with individual game data to produce histograms that measure player consistency:

Practice

A standardized test’s results are normally distributed. The mean score is 520 and the standard deviation is 60.

1. Find the z-score for someone who scored 610.

2. Find the z-score for someone who scored 500.

3. Bob has a z-score of 3. What is his test score? [be careful...you might have to do a little algebra / rearranging here...]

4. Jimmy scored 1.5 standard deviations below the mean. What was his test score?

Mr. Warneke’s M&M dispenser’s output is normally distributed with a μ = 6.1 oz. and σ = 0.3 oz.

5. Find the z-score for a 5.8oz. output.

6. Mary’s handful had a z-score of -2. How many oz. of M&Ms did she get?

7. What is a z-score? Define it in a short sentence.

Use the following list of how many hits different softball players got in a game: 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5

8. Thanks to her restraint with the M&Ms, Mary had 4 hits. What percentile is she in? Which quartile is this?

9. Bob was one of the players with only 1 hit. What is his percentile rank on the team?

10. In the last question about Bob, the quartile is not very useful information. Why?

Apples to oranges: the average squatting weight of a football team is 320lbs, with a standard deviation of 60lbs. The average 40yd dash time of the team is 5.2 seconds with a standard deviation of 0.3 seconds.

11. Imagine you are on a search for someone who has both strong legs and speed (these two characteristics hold equal weight in your mind). Chris squats 280 lbs and has a 40yd dash of 5.05 seconds. Jimmy squats 380 lbs and has a 40yd dash time of 5.35 seconds. What is a good way to compare these two people?

Practice solutions

1. (610 - 520) / 60 = 90 / 60 = 1.5
2. (500 - 520) / 60 = -20 / 60 = -0.33
3. 3 * 60 + 520 = 700 points
4. 520 - 1.5*60 = 430 points
5. (5.8 - 6.1) / 0.3 = -0.3 / 0.3 = -1
6. 6.1 - 2 * 0.3 = 5.5 oz.
7. The number of standard deviations a value is from the mean
8. 12th/13 = .92 = 92nd percentile; 4th quartile
9. 5 players at or below 1 hit / 13 = .38 = 38th percentile
10. With so little data, and the fact that players with 1 hit span both 1st and 2nd quartiles, it isn't all that useful to call it a 2nd quartile value. Quartiles are much more useful in very large data sets with lots of variety in values, not small data sets with lots of repeats in the values.
11. One good method, if both attributes matter the same amount, would be to find the z-scores of each attribute for each person and compare those. That way, you could see how much better/worse than average each person is in each area.
- Chris squat: (280-320)/60 = -0.67, below average squat
- Chris 40yd: (5.05-5.2)/0.3 = -0.5, this is better than average because we want running times to be low numbers
- Jimmy squat: (380-320)/60 = 1, above average squat
- Jimmy 40yd: (5.35-5.2)/0.3 = 0.5, this is worse than average because we want running times to be low numbers
- Jimmy is 1.67 standard deviations better than Chris with his squat, and Chris is only 1 standard deviation better than Jimmy with his 40yd dash, so I would think Jimmy is the better overall choice.
- What other ways could you analyze this problem?

Notes

Old Instruction

Printable guided notes: version 1, version 2