Data Exploration

Data exploration is not merely a preliminary step in the analysis process; it is a fundamental pillar that lays the groundwork for uncovering hidden insights, revealing intricate patterns, and extracting valuable knowledge from raw datasets. At its core, data exploration is about delving into the depths of data, unraveling its mysteries, and unlocking its potential. It empowers us to ask the right questions, uncover unexpected correlations, and gain a deeper understanding of the underlying phenomena captured within the data.

Through meticulous examination and analysis, data exploration sheds light on the intricate interplay between variables, uncovering hidden patterns, trends, and outliers that may have gone unnoticed. It fosters a deeper appreciation for the richness and complexity of data, revealing its nuances and intricacies.

Let's begin our exploration into the world of data analysis and uncover the journey we undertook to accomplish this task.

Gathering Data

In today's data landscape, web scraping is an essential technology that allows individuals and businesses to access, analyze, and use massive volumes of data from the internet. Web scraping acts as a bridge in today's world of copious information that is frequently dispersed over multiple websites and platforms, enabling us to combine and synthesize diverse data sources into useful insights. Its significance comes from its capacity to automate the data collection process, saving significant amounts of time and money while granting users access to current, real-time data that can stimulate innovation and help with strategic decision-making.

In our project, the data was gathered from the "https://steamspy.com/year/2021". We used Beautiful Soup to scrape the data from the given url.

Since each game had different designated pages with all the required information such as release date, genre, publishers, developers, etc. we had to scrape through different urls for each game in order to scrape the data for the required game.

We did this by making a list which had the url of each and every game that was mentioned on the website and was thus used to scrape through different details of each game.

By using different functions of Beautiful Soup, we were able to attain all the required values and details about each and every game mentioned on the website.

Once this was achieved, we created a CSV file which was then taken up for cleaning and pre-processing to make it useful for data visualisation.

Data Cleaning

We now proceed to the crucial stage of data cleaning—a vital step that forms the base of our analytical work. Regarded as the essence of data analysis, data cleaning is imperative for converting our raw, disorganized dataset into a refined, accurate, and operational asset. In the ever-evolving gaming sector, where trends and preferences swiftly change, it's crucial to uphold the data's integrity to derive significant insights.

Our assembled gaming dataset, brimming with potential revelations about player habits, market trends, and the gaming environment of 2021, faces its own set of challenges with data irregularities, absences, and repetitions. Neglecting these issues could distort our analysis and lead to unreliable findings. Therefore, data cleaning is not just a routine task but a detailed endeavor to guarantee that our dataset genuinely represents the lively and fluid gaming world.

Within this section, we explore the detailed methods used to purify our dataset, ensuring the precise representation of each game title, launch date, and player engagement. Addressing hidden missing values, rectifying inconsistencies, and eliminating duplicates are all critical steps aimed at enhancing our dataset. This meticulous procedure is essential for establishing a strong groundwork for our future investigations and analyses, where the depth of our insights is intrinsically linked to the dataset's quality.

Dataset before cleaning:

The dataset obtained from the website after web scraping exhibits various inconsistencies and inaccuracies, as highlighted in the figure provided. One noticeable issue is the presence of an extraneous column labeled "Unnamed: 0," which merely contains indices for each record in the dataset and does not contribute any meaningful information. Furthermore, columns such as "Name," "developers_publishers," and "genre" contain entries formatted as strings with unnecessary elements like square brackets and single quotes, which can impede data analysis.

Additionally, the "release_date" column consists of entries in string format, hindering the dataset's usability and analytical capabilities. Converting these entries into datetime format would enhance the dataset's reliability and facilitate more comprehensive data analysis and interpretation. Similarly, entries in the "price" column are currently stored as strings, and converting them to a float data type would improve computational efficiency.

One notable issue is the substantial number of missing values present in the "playtime_total" column, totaling over 8000 entries. This prevalence of missing data can significantly impact the analysis of the dataset, potentially leading to erroneous interpretations. Therefore, it is crucial to address these missing values meticulously to ensure the accuracy and reliability of the dataset for subsequent analysis.

Dataset after cleaning:

Let's break down each step and explain how the dataset is cleaned:

Dropping Columns with Null Values:

The first step involves dropping the column "playtime_total" since it contains a lot of null values. This is done using the `drop()` function with the `axis=1` parameter to specify column-wise dropping.

Dropping Rows with Null Values and Adjusting Indexes:

Next, rows with null values in any column are dropped, and the indexes are reset to ensure continuous indexing using the `dropna()` function. The `reset_index(drop=True)` part resets the index while dropping the old index.

Dropping Unnamed Column:

The column labeled "Unnamed: 0" is dropped using the `drop()` function with the `columns` parameter specifying the column name and `axis=1` indicating column-wise dropping.

Cleaning String Entries:

Entries in the "Name," "developers_publishers," and "genre" columns undergo string cleaning to remove square brackets, double quotes, and single quotes. This is achieved using the `str.replace()` function to replace specific characters with an empty string or to manipulate strings using NumPy's `np.where()` function.

Splitting and Rearranging Developers and Publishers:

Entries in the "developers_publishers" column are split into separate lists for developers and publishers using a loop. The split text is then assigned to separate lists (`dev_list` and `pub_list`) and subsequently added as new columns ("Developers" and "Publishers") to the dataset. The original "developers_publishers" column is dropped using the `drop()` function.

Converting Date Entries to Datetime Format:

Entries in the "Release_Date" column are converted from string format to datetime format using the `pd.to_datetime()` function. The `format` parameter specifies the format of the date string, and `errors='coerce'` is used to handle any conversion errors gracefully by setting invalid parsing to NaT (Not a Time).

Converting Price Entries to Float Data Type:

Entries in the "Price (in $)" column are converted from string format to float data type by removing the dollar sign ('$') using string manipulation and then converting the values to float using `astype(float)`.

Handling Owners Column:

Entries in the "Owners" column are cleaned by removing extraneous characters like '\xa0..\xa0' and replacing them with a hyphen ('-'). This is achieved using list comprehension and string manipulation.

Overall, these cleaning steps address various inconsistencies and inaccuracies in the dataset, ensuring its usability, reliability, and suitability for further analysis. Each step targets specific data quality issues and employs appropriate techniques to clean and transform the dataset effectively. Below is a snapshot of the cleaned dataset.

Dataset after cleaning:

Data Visualization

Image 1 : Game price Count with # of Games
Game Prices on Top left corner. # of Games are on Bottom Left.Each boxes means portion of the game market.

Image 2 : Game Genre Distribution pie plot
All game genres. Because some games has multiple genres, It does not match with total # of games.

Image 3 : Top Categories of Single/Multi/Both Play Type Stacked bar plot.
All game Categories. Because some games has multiple, It does not match with total # of games.

Image 4 : Heatmap of Top 20 Games by followers
The most followers with the game.

Image 6 : Histogram of Game release by Publishers/Developers
Most developer/publisher has only 1 games.

Image 5 : Game Released Date per Month

Label December is not showing of on the x-axis. But still there is 12 dots.

Image 7 : Scatter plot of Followers and Owners
Tried to figure out How much followers affected to buy actual game.

Image 8 : Plot of top 20 Publishers by followers
Because big companies has more followers, we can see friendly names on here.

Image 8 : Plot of top 20 Developers by followers
Also big companies.

Image 9 : Network map of Genres and Categories
Trying to figure out which genre and categories are mostly connected.

Image 10 : Word Cloud of AAA Games
Creating word cloud to see what big companies are usually targeting.

By these Visualizations, We can assume what is the trend of the well selling games.

From Image 1, We can see the pie of the games about prices.
From Image 2, There are almost 30% of games are Indie.
From Image 4, We can find what is most popular game.
From Image 5, September is the biggest month that games are released.
From Image 6, Most of developers/publishers released one game per year.
From image 8, Still the Famous developer/publisher has most followers.

GitHub - JunsooJung/CUBoulder_CSCI5502_DataMining_Project: CSCI5502 UCB Data Mining ProjectCSCI5502 UCB Data Mining Project. Contribute to JunsooJung/CUBoulder_CSCI5502_DataMining_Project development by creating an account on GitHub.

All works are in our Data Mining Project Github

Page updated

Google Sites

Report abuse