Create a machine learning model that predicted the outcome of over/under bets and identified the best games to bet
Project Question
Predicting if the combined score of the away and home team will be above or below the anticipated total
Statistics from all 2020 to 2021 NBA Games
Compute the total score for each game
Calculate 4-game moving average of team stats
Calculate differential for each matchup
Captures if a team recent performance opposed to average ability
Over-Under Bets for all 2020 to 2021 NBA Games
Determine if the total score was greater than (over) or less then (under) the anticipated total
The data is very complex
Little separation caused models to struggle initially
Purpose: raw data had few trends and initial models preformed poorly
Effect: More separation between outcomes as similar observations are grouped together
Process:
Used kmeans algorithm to group the data into 2 clusters
Repeated kmeans to group each cluster into 3 subclusters
Method:
Normalize the data to prevent variables with a larger scale from driving clustering
Randomly select 2 or 3 centroids
Assign observations to the centroid they are closest to
Recalculate centroids as cluster means
Continue steps two and three until the cluster scatter stops decreasing
All Subcluster are fit with and tuned with the following 3 algorithms
Note: Neural networks were considered but not used due to an insufficient amount of data
Penalized Logistic
Linear decision boundaries
Reduce effect from useless predictors
Support Vector Machines
Non-linear decision boundaries
Good performance on non- separated data
High dimensional feature expansion
Radial kerneling preformed
Random Forest – Tree method
Aggregates several decision tress
Capture important trends with several different models
Predicted outcomes on test data for each subclustered
Used data the models were not trained on
Combine results from all subclusters for each model type
Outcome: No overwhelming improvement in any model
Logistic model sensitive is high, but specificity suffers
Slight Improvements to Accuracy
Random Forest Kappa improved
AUC is poor for all models
Interpterion
Cluster 2 performances is better
More separation between over/under observations
Data in Cluster 2 is more Similar
Interpterion
Largest difference in 4-game moving averages
Points Scored,
Field Goal %,
Field Goals Made
Win Rate
Cluster 2 and Subclusters 2 and 3 preform the best
The data is very separated
Within cluster observations are very similar
Clustering
Clustering and Subclustering greatly improved some model’s performance
Identify games that with similar statistics to observations are in cluster 2 and subcluster 3 and 2
Only bet on these games for best results
No need to bet on every game
Models
1) Logistic
Kappa too low
Best sensitivity
Bad specificity
2) SVM
Most accurate on unclustered data and most subclusters
Kappa decreased with subclusters
Best sensitive And specificity
3) Random Forrest
Only effective on a few subclusters
Worse AUC
Good subcluster Kappa