This project combined data from multiple sources, most impart to PCPartPicker and Steam in order to recommend ideal components for a person to buy to build a PC. In doing so we web scrapped, cleaned, and analyzed our data and then built models in order to answer questions regarding building a PC.
During this process, we learned that:
Web scraping is no longer as trivial as it once was. This is due to many websites blocking automatic requests and more complex workarounds are often required.
Data is not always going to be formatted in an obvious manner to accomplish your goals, such as it is in class.
Pre-processing of your data will take the largest chunk of your time.
It is a cyclical process, as even once you've conducted your pre-processing you may have to go back to pre-processing or even data generation once you get into your data modeling section.
Often times you will need to create your own feature given other variables of data you have collected.
Challenges we faced:
During the collection phase, PCPartPicker completely blocked our access to the website and we had to use a VPN to gain access back. Once we gained access back, we manually pulled the HTML file from each page of data we wanted to collect and processed that into a txt file that we then loaded into Python to scrape.
Certain Steam games were age restricted so scrapping required multiple steps rather than the general requests pull.
Steams also had non-consistent naming conventions for components, which required the use of regular expressions for matching.
The original vision for this project was to build a PC builder rather than a predictive model. The requirement for this project then made us shift our focus to creating a predictive model when our data generation and cleaning were not oriented toward this type of model.
Some key findings from our analysis were:
GPU chipset brand loyalty is not justified by data when looking at cost and price statistics. Both Nvidia and AMD provide a relatively homogeneous distribute when considering price and performance and we did not find any distinct groups or patterns that revealed any brand-specific unique trends.
We were able to define a methodology for determining the optimal PC component for various price tiers from available products currently on the market. We demonstrated this methodology for choosing the best CPU for low, mid, high, and ultra price tiers (price tiers determined statistically based on available products). This methodology proved to be an accurate and useful analysis tool for determining the best PC components to choose when building a PC and can easily be expanded for other components besides CPUs for building a full gaming PC.
Lastly, we were able to build a future forecasting model to gain insight into how soon we might be able to expect a manufacturer to produce a CPU with specific performance attributes. This model was more challenging and a bit less accurate than the others but was still able to provide results that seemed reasonable based on intuition and currently available data from previous years.
Overall, we believe that we have assembled a useful set of tools to enable consumers to make informed, data-driven decisions when choosing PC parts to purchase when building a custom gaming PC. These tools provide answers to key questions such as the significance of component brands and when to expect new products to be released. Having build custom gaming PCs ourselves, we understand how confusing the process can be and how much time it takes to search for the optimal component within a certain budget. We believe this toolset can help alleviate much of this stress and confusion and can help to get people into the game faster and with a more optimal build for their money.