Project Summary
The project is aimed to create a reliable machine-learning model for stock trading by deriving various stock market indicators for data enrichment and implementing various labeling techniques. As the stock data is highly volatile, to improve model performance our research is focused mainly on the Feature Importance and Explainability of the model towards the output prediction. One of these techniques is the SHAP method, used to explain how each feature affects the model and allows local and global analysis for the dataset and problem at hand. It can tell us how each model feature has contributed to an individual prediction. By aggregating SHAP values, we can also understand trends across multiple predictions, identify, and visualize important relationships in our model.
Oral Presentation Video
Project Files
Below is the list of project files related to the project plan, milestone presentations, CDay submissions, and jupyter notebook (python code).
Explainability of The Model
Machine Learning models are often black boxes that make their interpretation difficult. In order to understand what are the main features that affect the output of the model, we worked on explainable Machine Learning techniques that unravel some of these aspects. One of these techniques is the SHAP method, used to explain how each feature affects the model and allows local and global analysis for the dataset and problem at hand. SHAP is the most powerful Python package for understanding and debugging models. It can tell us how each model feature has contributed to an individual prediction. By aggregating SHAP values, we can also understand trends across multiple predictions, and identify and visualize important relationships in our model.
Shapley Values
Waterfall Plot
There are 19 SHAP values for each of the observations (rows). That is one SHAP value for each feature in our model. We have used the waterfall function to visualize the SHAP values of the first observation. E[f(x)] = 0.393 gives the average predicted labels across all the observations. f(x) = 1 is the predicted label value for this particular row. The SHAP values are all the values in between. For example, the Close value has increased the predicted label value by 0.1. There will be a unique waterfall plot for every observation in the dataset. In each case, the SHAP values tell us how the features have contributed to the prediction when compared to the mean prediction. Large positive/negative values indicate that the feature had a significant impact on the model’s prediction.
Bar Plot
The bar plot will tell us which features are most important. For each feature, we have calculated the mean SHAP value across all observations. Specifically, we take the mean of the absolute values as we do not want positive and negative values to offset each other. There is one bar for each feature. For example, we can see that SMA200 has the largest mean SHAP value. Features that have made large positive/negative contributions will have a large mean SHAP value. In other words, these are the features that have had a significant impact on the model’s predictions.
Fig 1: Waterfall Plot
Fig 2: Bar Plot
Beeswarm Plot
The beeswarm plot is the most useful, it visualizes all the SHAP values. On the y-axis, the values are grouped by feature. For each group, the color of the points is determined by the feature value. The beeswarm plot can be used to highlight important relationships. We can also start to understand the nature of these relationships. For SMA300, notice how as the feature value increases the SHAP values increase. It tells us that larger values for SMA300 will lead to a higher predicted label. You may have noticed that OBV has the opposite relationship. Looking at the beeswarm plot, we can see that larger values for this feature are associated with smaller SHAP values.
Fig 3: Beeswarm Plot
Results:
After implementing Shapley values and multiple feature selection strategies and understanding each feature contribution towards the model selection, we have compared the performance of the Random Forest model on test data against the feature selection strategies by selecting only the features that have high impact on model prediction. We can observe an increase in accuracies, precision and recall values.
Fig 4: Model Performance Metrics