Initializing our data collection process, we determined our best resources to be PCPartPicker and Steam.
From PCPartPicker, we selected each component needed to build a PC: CPU, GPU, Memory, Storage, Mother Board, and Power Supply. For each of the listed components, we had to pull the HTML files from their respective web pages manually because PCPartPickr strictly blocked both the Python requests package as well as any attempt we made using Selenium to automate either a Google Chrome browser or a Microsoft Edge Browser. As a last resort we simply did the HTML collection manually by navigating to each page and saving the raw HTML code to files. We then used Python's BeautifulSoup package to parse these HTML files, pull out the requisite information, and build data frames corresponding to each component. This was an iterative process as each component has multiple web pages corresponding to them, with the most being the Memory Component which had 124 webpages to collect data from.
Before cleaning the Data, the data frames appeared similar to the below example from the CPU component:
However, after running our Python code, we were able to easily translate this into a clean and usable data frame. Our cleaned data frame for our CPU Data can be seen below:
At this stage, our CPU data frame was not complete, as we also wanted to analyze the age of CPUs and their performance. To do this, we web scraped off Wikipedia to collect model names for CPUs produced by AMD and Intel, as well as the year these models were released, and put this data into a new data frame. We then split our CPU Name Column, seen above, into two new columns titled Brand and Model. From there, with our new data frame of CPU Models and their release year, we were able to add the release year into the above data frame. We did this by creating a function that searched for matches in each data frame and if a match was found, the Release Year was added to that row.
It was during this step that we noticed not all CPUs are compatible with every type of Mother Board so we wanted a way to know which models of CPUs were compatible with what type of Motherboard. The data we pulled for Motherboards had socket type, but the CPU data was missing this information. We then used Wikipedia, once again, to identify the socket each CPU fit in, based on the model name. We ran a function that would iterate through this newly formed data frame and our existing CPU data frame to import a new column for the supported socket type, titled "Sockets Supported."
Additionally, during this phase of the cleaning process, we removed any CPUs that did not have a Core Count of 4 or higher, as we used this as an assumption to run modern video games. We also removed any CPU that did not have price information as these, at the time of creating this web page, were out of stock.
The final cleaned CPU data frame can be seen below:
We executed a similar process for the rest of the components on PCPartPicker to clean the data we gathered from their website.
From PCPartPicker we scraped Data on six different components with the following breakdown:
CPU - 1370 products with 10 data points each for a total sum of 13,700 data points. We then cleaned this data to be reduced to 210 products for a total of 2,100 data points.
GPU - 3,016 products with 9 data points each for a total sum of 27,144 data points. Cleaning reduced us to 1,062 products and 9,558 data points.
Memory - 12,331 products with 9 data points each for a sum of 110,979 data points. Cleaning reduced us to 2,460 products and 22,140 data points.
Motherboard - 4,462 products with 8 data points each for a sum of 35,696 data points. Cleaning reduced us to 700 products and 5,600 data points.
Storage - 5,962 products with 9 data points each for a sum of 53,667 data points. Cleaning reduced us to 1,382 products and 12,438 data points.
Power Supply - 3,016 products with 9 data points each for a sum of 27,144 data points. Cleaning reduced us to 691 products and 6,219 data points.
The totals prior to cleaning the data were 30,157 products with 268,330 data points.
Cleaning the data reduced this to 6,505 products with 58,055 data points. This is a reduction of 78% of data that was of no use and demonstrates the importance of cleaning your data prior to building models and conducting analysis.
In addition to the hardware data we collected from PCPartPickr, we also wanted to get some hardware-related information that was relevant to popular games people are currently playing. To do this, we web scraped hardware specifications from Steam game listings for the current top 100 most-popular games on Steam. Steam maintains a webpage that lists the current top 100 most-played games. We used this page to get the names of each of the top 100 games, as well as followed the embedded hyperlinks on the webpage to access each game's individual listing on Steam. From this game listing, we could then extract hardware relevant information such as minimum and recommended specifications for operating system, DirectX, GPU, CPU, storage, and memory.
A complicating factor that was unique to Steam web scraping here was that some of the games require the user to submit a date of birth to access mature-rated games listings. Many of the top played games fall under this rating, so we needed to implement a feature into our web scraping code that could detect if an "age-gate" appeared when trying to access the game's listing and, if so, enter in a specific date and press the "View Page" button to continue onto the games listing webpage. This meant that we not only had to detect the "age-gate" itself, but also had to find and interact with a drop-down list and the "View Page" button on the website.
After collecting all the data from the web scraping process, the resulting information was saved to a CSV file for later use. The raw data collected from the webpages was very messy and did not follow any strict formatting for how the minimum and recommended specifications were defined. Often, chipset manufacturers and component names would be abbreviated, shortened, or sometimes unofficial labels would be used to describe the requirements. In some instances, the specific component could be buried in a full sentence describing the requirement. Below is a snapshot of some of the data prior to our cleaning process:
In order to tackle this difficult cleaning process, we needed to use a very generalized and flexible method for extracting only the necessary information from each field and discarding all the remaining, unnecessary data. To do this, we used a series of regular expression matches to search the sea of text for only the hardware-specific information we needed. We could then clean up this information into a standardized format that would make working with it easier for later steps in the data mining process. An example of the regular expressions used is shown below for extracting GPU data:
The above case is just a small snippet of the code used to clean the GPU data and is specific to NVIDIA cards only (AMD cards followed a similar process, but with exceptions specific to AMD naming conventions). As you can see in the code above, the regular expressions needed to be very flexible as some game listings ignored specifying the GPU chipset manufacturer name, card series name, or both, and others commonly misspelled words like "GeForce" as "GForce" and left out key prefix/suffix labels to help identify the specific graphics card requirement in question. The model number naming conventions for NVIDIA chipsets also changes significantly over time, so, in addition to the regular expressions, we also needed some logical branches to help further drill down the specific chipset based on some historical naming convention rules we derived from looking at NVIDIA graphics card names over time. Similar rules were also derived for AMD GPU chipsets, Intel CPUs, and AMD CPUs as well.
A process similar to this was performed for each type of Steam data: operating system, CPU, DirectX version, memory, and storage requirements. The following image is a snapshot of how the Steam data looked after cleaning. Notice the addition of 4 columns to split up the GPU and CPU requirements by manufacturers (NVIDIA & AMD for GPU / Intel & AMD for CPU). Empty cells are where either no data was provided, or specific hardware was not specified.
For now, we will simply ignore any field that does not have a specification listed. Depending on what models we decide to use in later parts of this project, we may decide to approximate by substituting minimum, maximum, or average performing hardware in place of empty data. This cleaned data was also saved to CSV format for ease of use in later steps.
Lastly, we also captured data regarding usage rates of CPUs and GPUs from a different section of Steam. Rather than being specific to the top 100 most popular games, this data instead lists the popularity of hardware based on all Steam users. This will play a key role in determining what components to recommend as this data reflects what the consumer market has been using for the last 12 months. This is indicative of trends in what hardware manufacturers and what component models are most popular at present day, and to see if there have been any shifts in this popularity over the last 12 months.
Data Exploration
Now that we have performed our data collection and cleaning, we are prepared to conduct some data exploration!
After cleaning our data, we wanted to get an overall feel of the data we were working with. The first step we took, was to use the built-in Python function "describe()" to get the summary statistics of all our data sets. We also used the "info()" function to get information on how Python is storing our data frame.
Python "describe()" function output
Python "info()" function output
From the two outputs above, we can observe relevant information to help us formulate initial models and determine if any further data cleaning is needed. For example, in the "info()" output, we can see that the "Boost Clock" column has 210 non-null counts, with our data frame having a total of 215 entries. This means that five of our CPUs have missing Boost Clock information. Further cleaning was needed to correct this issue and we ultimately decided to remove these rows from our dataset.
You may notice that our "Integrated Graphics" column only has 117 non-null counts, however, this can be explained as not all CPUs have on board graphics, thus it is expected that some CPUs would have a null count for this variable.
After cleaning, we can re-run our "info()" function to get the result shown to our right:
Updated Python "info()" function output
We can see that there are no longer any null values for Boost Clock and our data frame is ready to be analyzed.
In addition to summary statistics, visualizations are a great way to determine what questions can be answered within your data set.
The two visualizations above are representations of CPU Pricing. The graph on the left gives us information on the average and range of prices of AMD and Intel CPUs based on the year they were released. This could be used to see if a certain manufacturer is, on average, cheaper than another as well as which manufacturer we would prefer if the CPUs provide similar performance.
The violin chart on the right lets us interpret the distribution of the prices for each manufacturer of CPU. We can see there are obvious outliers within our dataset that we may want to remove before conducting any further analysis. We may also find that these outliers' "price-to-performance" could be proportionate but that will be explored in visualizations below.
Another factor we will need to take into consideration for our CPU selection is their compatible motherboards. As we can see on the bar plot to the left, AMD-compatible motherboards are on par with Intel-compatible motherboards that take the LGA 1700 socket but are around $100 to $200 cheaper, on average, than the other two Intel socket types. This could mean even if the CPU performance is roughly the same, AMD CPUs may be the better selection as their average supported motherboard prices are much cheaper.
The visualization to the right shows us the average price per component needed to build a PC. This will be useful as we can use this to baseline what percentage of a consumer's budget we should allocate to each component. We will need to run further analysis from this estimate to determine which components most contribute to performance, however, the amount spent on average per component should give us a rough estimate of what this budget allocation should be.
Going into further detail from our histogram, the plot to our left is a box plot of price data for each component needed to build a PC. We can see each component has outliers on the upper end that we may need to remove, but further analysis will still need to be done based on these outliers' performance to know if they should be eliminated from consideration. The box plot of each component could also be used to build budgets for each tier of PC we are looking to recommend, by using each interquartile as a price index for a tier.
The two graphics above are both depicting the price vs boost clock speed for a CPU. We can see a clear linear relationship between our variables on both graphs. The left figure is a density plot where we can see most of the data surrounds a 4.5 GHZ boost speed with a price point of around $150. In the scatter plot on the right, we can see the individual points as well as the manufacturer for that given CPU. When we get into our analysis section, we could use the scatter plot on the left, eliminate the outliers, create a simple linear regression model, and use the point with the largest positive residual from the regression model to find the CPU that gives us the best "Boost Clock Speed-to-Price." We can do a similar process for each component type assuming there is a linear relationship between a performance measurement and price. Readdressing our outliers from our above violin plot, we can see that the outlier CPUs in price are not proportionate in terms of performance and should be eliminated for contention on any tier of PC.
The two graphs above are examples of visualizing the same information for other components. The graph on the left depicts Power Supply Wattage to price, but with an additional category, "Efficiency Rating." Ideally, we would find the cheapest power supply with the highest wattage and best efficiency rating.
The graph on the right shows another density plot depicting what appears to be a linear relationship between the price of a GPU and the amount of memory that card has. Memory may be an indicator of performance or a requirement for certain games to be able to run properly or at their best fidelity.
Consumer usage rates are also an important metric to look at to determine what components to buy. As usage rates can indicate that a certain manufacturer is superior to others. In the case for the graph to the left we see that NVIDIA absolutely dominates the market share when it comes to GPUs, and this maybe a strong indication that the GPUs we want to recommend to consumers is an NVIDIA card.
The graphs below are joint plots that visualize the GPU hardware requirement data mined from the top 100 most-played games on Steam combined with the price data mined from PCPartPickr. On the left is a plot that compares the game's rank (i.e. how popular it currently is, with 1 being the most-popular) versus the cost of purchasing the minimum GPU requirement provided by the game's publisher, colored by GPU chipset manufacturer. The vertical stripe of dots represents individual prices for cards currently on the market manufactured by various companies (such as ASUS or EVGA, for example) for the same chipset (manufactured by either NVIDIA or AMD). The chipset is representative of the performance level of the card, thus cards with the same chipset but sold by different brands can be considered interchangeable for our purposes. On the right is a similar plot, but instead showing the prices of publishers recommended hardware specification versus game rank.
As you might expect, the component prices of the publisher's recommended graphics cards does go higher than the minimum requirement (more performance logically should cost more money). But what is unexpected, is that the typical price for both the a minimum specification GPU and a recommended specification GPU are very similar, sitting right around the $400-$500 mark.
The distributions above are of course not the same, but it's difficult to see exactly how these trends differ in the plots above. To take a closer look, consider the density plot comparison below. These plot merges together the chipset brands and instead just compares the costs of minimum specification GPUs to recommended specification GPUs:
As shown above, the peak of both of these plots is in nearly the same place, but the additional high-price options for the recommended specification gives this dataset (shown in red) a long tail off to the right side of the plot (toward higher price). A caveat to this data is that the prices for individual GPUs are not rigidly fixed, and occasionally some cards will go on sale while others remain at higher prices. This complicates the type of conclusions we can make from these visualizations, but a clear message we see above is that it would be reasonable to expect to pay at least around $500 on a graphics card to be able to play most of the current top 100 most-played games on Steam.
Overall, from our Data Cleaning and Exploration phase we have noticed trends that show a linear relationship between spending more money and gaining performance on components. We have also been able to decipher a potential method for generating budgets for each component and how to break them down into tiers. In the next phase, we will conduct models to determine if our observations from our visualizations are statistically true; will spending more money on components lead to better performance in a linear way? If so, we will be able to build templated tiers of PCs based on the budgets we determined from the box-plots we generated.