Our first model sought to answer what was the best "cost-to-performance" product we could select. In order to do this, we implemented a Multiple Linear Regression (MLR) on selected feature variables to predict our response variables.
To demonstrate, we used our CPU Data to conduct the MLR, used Price as our response and Core Count and Performance Core Clock as our features.
From the output on the left, we can see that the initial model does not fit very well. It has an adjusted R-squared value of .258 which is not very good, as well as an accuracy score of .265.
With such a low adjusted r-squared value, we decided to check to see if outliers had a part to play.
Our theory was correct that the data frame did have outliers in terms of price, as depicted above. We removed these from our data frame and re-ran the MLR.
The second MLR produced much better results with an adjusted r-squared value of .510 and an accuracy rating of .515.
We concluded that these results were sufficient, as this model will be used to gather the lowest residual for each price category to then determine the product that offers the most performance for the lowest cost.
To determine the price point to set for each price category, we created a boxplot and an output for the distribution of the Price column in our data frame.
These values were used to decide our price point categories:
Low < 151.17
Medium < 229.48
High < 342.49
Ultra is any price range but the highest performing
After the selection of our four CPUs, we decided to plot our MLR to get a visual representation of it's performance.
From the graph to the right we can see that the model does do a pretty good job at predicting the price of the CPU given the two feature variables we selected. It appears there are more outliers on the high end of price so if we removed those, it would produce a better-fitting model.
However, the purpose of this model was to select the ones with the lowest residuals so the model we created accomplishes that goal.
To take this further, we could complete this model for each component required to build a PC. This could be used to find the best-performing unit for each component, and then compile those to build the PC that fits your budget, knowing that you are also maximizing performance.
If we spent more time on removing outlier data to get a model that produces a better fit, we could use this model to predict future prices of CPUs given targeted Core Counts and Performance Core Clock speeds. This could be useful in conjunction with our 3rd model which will be used to predict the year a CPU will come out with given performance measures.
For our second model, we wanted to zoom in a bit on GPUs specifically and see if a particular chipset manufacturer was generally more distinguishable in any way than other brands. Often with PC components, there can be a fair amount of brand loyalty that is not always justified by data. We wanted to see in our dataset, if we could identify any distinct patterns or groups for GPUs that could indicate if different chipset manufacturers were distinguishing themself in some way from their competition.
To answer this question, we chose to use clustering as it would be able to identify any distinct groups of products that fit some trends available in the price and performance data we had collected. If the resulting groups from the clustering analysis also aligned with a particular manufacturer more than another, then we could investigate further to determine exactly what is setting that manufacturer's products apart from the others. The beauty of clustering is that we won't know ahead of time what trends we will find, but only after interpreting our results can we see if there is any merit to the brand loyalty so many customers exhibit.
Below shows a screenshot of how the data looked after processing the HTML files from PC Part Pickr but before doing any cleaning in preparation for the clustering analysis. Notice that the chipset manufacturer is not provided for each GPU, but instead only a name is given representing the chipset model.
Since a core objective of this analysis relates to the manufacturers of the chipsets, some additional processing needed to be done using regular expressions to derive the chipset manufacturer from each of the chipset models. There was also some data that we decided was not relevant for the clustering analysis. The name of the GPU card was removed as we were only looking at chipset manufacturer and all of the rating information was also removed because there really wasn't much data available for the ratings and browsing the website it didn't seem like this data source would be very reputable due to not many reviews and most things being either not rated or rated very highly. Then, since clustering requires all numerical data types, we had to do some additional regular expression manipulation of all the values that had units embedded in their names (making sure to do any necessary conversions). This left us with the following dataset:
You'll notice that even though all our dataset is of a numeric type, there are a bunch of missing values for a few of the attributes. There were a number of methods we could use to handle this missing data (i.e. replace with min/max/average or delete), but we chose to take a look at the relationships between individual attributes to see if there were any notable trends we could use to help populate the missing data. Below shows a pair plot that plots every attribute against each other. We used this plot to help understand which attributes were the most correlated with each other and then we created linear regression models between some of the attributes, as needed, to predict what the missing values would be based off of their trends with other similar attributes. Also shown below is an example of one of these regression models between the Boost Clock and Core Clock speeds. Through careful use of a number of these models, we were able to fill in all missing data from the dataset.
Now that all the missing data values have been filled in, we then proceeded to identify any outlier values that could adversely impact our clustering analysis. Below is a series of boxplots for each attribute. To identify outliers, we chose to use the standard method of anything outside 1.5 times the inter-quartile range above or below the first and third quartiles. This is shown on the plot by all the dots outside of the whiskers of the box plot. You'll notice that most of the outliers are for the price attribute. This was a trend we had noticed throughout the project and it seems like it is very common for older or more rare graphics cards to be significantly overpriced beyond their MSRP value. For this reason, we chose to simply remove all datapoints corresponding to these outliers before moving on because we assumed that the presence of these over-inflated prices could skew our data.
The final preprocessing step, now that we have a clean dataset without any outliers, was to perform principal component analysis (PCA) to determine if any of the information contained in our current list of attributes is repetitive and if there is any opportunity for us to reduce the dimensionality for our clustering analysis. Below is a summary plot showing the results of our PCA analysis where you can compare the explained variance of each of the principal components and see how significantly each of them are contributing to the useful information contained within our original dataset. We decided that a cumulative explained variance threshold of 95% would be a sufficient tradeoff between maintaining as much of the information contained in our original data as possible and reducing the dimensionality of our problem for improving the understandability of our clustering analysis. This mean that we could remove 3 of the principal components and we only needed to keep the first 2 to meet our cumulative explained variance threshold.
Now that we have finalized our dataset and reduced it to only 2 dimensions, we can proceed with the clustering analysis. One of the key variables when performing clustering is how many clusters you should try to form. There is a standard method for identifying the optimal number of clusters called the "elbow method". This method runs the clustering algorithm multiple times and computes a metric called the within-cluster sum of squares (WCSS) which can effectively show the tradeoff between how much of an improved fit (WCSS being smaller means points in the cluster are closer together) vs the complexity of the model (more clusters means higher complexity). The ideal point is where the slope of the elbow line starts to flatten out and you effectively have reached the point of diminishing returns for increasing the number of clusters. For our dataset, the ideal number of clusters was 3.
(Optimal number of clusters = 3)
Finally, now that we have identified the optimal settings for our clustering algorithm, we can visualize the results. Shown below are the same results just with different coloring strategies to help with visualization. On the left shows the resulting clusters formed from the clustering algorithm, with chipset manufacurer for each identified by the point style (circle, square, cross). On the right is the same data, but instead with the coloring for chipset manufacturer and the clustering groups as the point style.
(Color = Cluster, Style = Manufacturer)
(Color = Manufacturer, Style = Cluster)
Notice that even though the clustering algorithm defined what appear to be distinct groupings for our dataset, the chipset manufactures are distributed very homogeneously throughout each of these clusters. The primary attributes used in this analysis (even though PCA mixes them together by design) are price and performance, so these results seem to indicate that there does not appear to be any significant distinguishing factor between chipset manufacturers when considering price and performance data.
This implies that, for our dataset at least, there is no data-based grounds for strong brand loyalty as each chipset brand is shown to be roughly interchangeable in our results. This makes sense as these companies are very close compettitors, and it would be an unexpected result to see any of them stand out substantially from one another when it comes to price and performance.
Having figured out the optimal CPUs to buy using Model 1, given a price threshold on current CPUs in the market, we were interested in how future CPUs might perform.
To do this we created a Generalized Additive Model that, given the feature variables of Brand, Boost Clock, and Performance Core Clock, will be able to predict the Response Variable of Release Year.
To the left you can see our first attempt at creating the model. Our Pseudo R-Squared value was .6943, indicating an okay fit. Our accuracy score, however, was only 28%. When looking at the output of our predicted values to our test data set, we can see that the model had some trouble getting the exact year right but was always within two years.
We reviewed our data and concluded that, because of how CPU tiers are released each year, our model is likely performing poorly due to newer, low-end CPUs being released that do not perform as well as the older high-end CPUs. This would effect the performance-to-release-year correlation. Prior to this realization, we were only capturing release year and performance metrics but now we can provide the model information on how this CPU was classified when it was released.
To accomplish this, we created separate data frames for each Release Year and within these Release Years, we calculated 5 quantiles of the Boost Clock variable to correspond with the lowest tier, low tier, middle tier, high tier, and highest tier of CPUs released in that given year. Once this variable was added to each year's data frame, we combined them back together and reran the model.
This new model produced the following results:
The graph to the right is a visual representation of the spread of the data that the model had to sift through to predict. We can see general clusters within the data frame which gives us the indication that the model will likely have success at being able to predict the Release Year.
The Pseudo R-Squared value is .9962, indicating a very good fit to our model but our accuracy score is only 48.7%. Looking into this further, we can see that the model predicts the test data set with a 48.7% accuracy, however, it is only ever off by 1 year at maximum. This is an acceptable range to be off and if we readjust our parameters of what is counted as an accurate result to be plus or minus 1 year our model achieves 100% accuracy.
Therefore, we can use this model to predict the Year a CPU will be released, given inputs on the Brand(AMD or Intel), Boost Clock Speed, Performance Core Clock Speed, and the Tier of CPU.
Say you have the question, "When will an Intel CPU come out that has a Boost Clock Speed of 8, Performance Core Clock Speed of 6 and will be a medium tier CPU on release?
How about for AMD? As we showed in Model 2, the manufacturer should not play a role in performance and you should not be able to distinguish one from the other. How does that hold up for our predictive model? Will AMD or Intel achieve these results sooner than the other?
As you can see, we got the same result: both should achieve these metrics in the Year 2034!