Predicting Over-Under Outcomes for NBA Games

Create a machine learning model that predicted the outcome of over/under bets and identified the best games to bet

Project Question

Predicting if the combined score of the away and home team will be above or below the anticipated total

Approach

Data Set

Statistics from all 2020 to 2021 NBA Games

Compute the total score for each game
Calculate 4-game moving average of team stats
Calculate differential for each matchup

Captures if a team recent performance opposed to average ability

Over-Under Bets for all 2020 to 2021 NBA Games

Determine if the total score was greater than (over) or less then (under) the anticipated total

The data is very complex

Little separation caused models to struggle initially

Clustering

Purpose: raw data had few trends and initial models preformed poorly

Effect: More separation between outcomes as similar observations are grouped together

Process:

Used kmeans algorithm to group the data into 2 clusters
Repeated kmeans to group each cluster into 3 subclusters

Method:

Normalize the data to prevent variables with a larger scale from driving clustering
Randomly select 2 or 3 centroids
Assign observations to the centroid they are closest to
Recalculate centroids as cluster means
Continue steps two and three until the cluster scatter stops decreasing

Model Fitting

All Subcluster are fit with and tuned with the following 3 algorithms

Note: Neural networks were considered but not used due to an insufficient amount of data

Penalized Logistic

Linear decision boundaries
Reduce effect from useless predictors

Support Vector Machines

Non-linear decision boundaries
Good performance on non- separated data
1. High dimensional feature expansion
2. Radial kerneling preformed

Random Forest – Tree method

Aggregates several decision tress
Capture important trends with several different models

Predictions

Predicted outcomes on test data for each subclustered
1. Used data the models were not trained on
2. Combine results from all subclusters for each model type

Results

Average Model Performance

Outcome: No overwhelming improvement in any model
1. Logistic model sensitive is high, but specificity suffers
2. Slight Improvements to Accuracy
3. Random Forest Kappa improved
4. AUC is poor for all models

Average Cluster Performance

Interpterion

Cluster 2 performances is better
More separation between over/under observations
Data in Cluster 2 is more Similar

Interpterion

Largest difference in 4-game moving averages

Points Scored,
Field Goal %,
Field Goals Made
Win Rate

Best Subcluster Performance

Cluster 2 and Subclusters 2 and 3 preform the best

The data is very separated
Within cluster observations are very similar

Conclusion

Clustering

Clustering and Subclustering greatly improved some model’s performance
Identify games that with similar statistics to observations are in cluster 2 and subcluster 3 and 2
1. Only bet on these games for best results
2. No need to bet on every game

Models

1) Logistic

Kappa too low
Best sensitivity
Bad specificity

2) SVM

Most accurate on unclustered data and most subclusters
Kappa decreased with subclusters
Best sensitive And specificity

3) Random Forrest

Only effective on a few subclusters
Worse AUC
Good subcluster Kappa

Page updated

Google Sites

Report abuse