When analyzing datasets in predictive modeling, selecting appropriate algorithms is crucial for deriving meaningful insights and achieving accurate predictions. During the semester in course, we learned lots of methods, such as frequent pattern, classification, kNN, decision trees, arm, regression, clustering, tree pruning, SVM, Naive Bayes, perceptron, gradient descent, logistic regression, multi-class classification, neural nets and etc. And these are what we chose for the dataset. And cover a broad spectrum of techniques, each with distinct characteristics and strengths suited for different types of data and predictive tasks. Here’s a brief introduction to why these models are considered for analysis.
Preceding the implementation of models, thorough preparations were undertaken to ensure the dataset was primed for analysis and forecasting changes in video game ownership accurately. The initial step involved eliminating any incomplete observations using the `dropna()` function, ensuring that only complete data points were retained for modeling purposes. Columns with high cardinality, such as "Name," "Developers," "Publishers," and "Category," were subsequently removed to streamline the dataset and focus on more pertinent features. To enhance data consistency, the "Followers" column underwent cleaning, where commas were removed, and the data type was converted to integer (int64).
Further refinement involved extracting the month component from the "Release_Date" column, facilitating the analysis of release trends over various months by creating a new column named "Month." Categorical variables, particularly the "Genre" column, underwent label encoding to convert them into numerical representations, facilitating their integration into machine learning models. Finally, to provide clarity, the "Ownership_Midpoint" column was renamed to "Num_Owners," offering a more descriptive term for the target variable. These preparatory steps set a solid foundation for subsequent modeling tasks, enabling precise forecasts and insightful analysis of video game ownership dynamics.
And these models are our selected model for our dataset.
Linear Regression
Simplicity and Interpretability: Linear Regression is one of the simplest types of regression, easy to implement, interpret, and understand. If the relationship between the independent and dependent variable is believed to be linear, this model can provide a clear insight into how significant predictors are.
Speed: It is computationally inexpensive and thus suitable for situations with a large number of observations.
KNN
Flexibility: KNN makes no assumptions about the underlying data distribution, which is a big advantage with real-world data that may not follow theoretical assumptions.
Ease of Implementation: It is straightforward to implement and understand.
Versatility: It can be used for both regression and classification tasks.
SVM
Effectiveness in High Dimensional Spaces: SVM is effective in cases where the number of dimensions exceeds the number of samples, which makes it highly suitable for image recognition and similar tasks.
Memory Efficiency: It uses a subset of training points (support vectors), which makes it memory efficient.
Versatility: The kernel trick is a powerful feature of SVM, allowing it to handle non-linear decision boundaries effectively.
Neural Networks
Handling Non-linearity: Neural networks excel at identifying complex patterns and relationships between variables, making them suitable for tasks like speech recognition, image recognition, and natural language processing.
Flexibility and Scalability: They can be scaled with more data and can adapt to complex model architectures to improve their performance continuously.
Random Forest
Handling Overfitting: Random Forest helps in overcoming overfitting by averaging multiple deep decision trees, trained on different parts of the same training set.
Variable Importance: It provides useful insights regarding which variables are important for prediction.
Good Performance on Many Problems: It is robust against outliers and non-linear data, and usually performs well on a wide range of problems without hyper-parameter tuning.
Our Main goal is to see how many games are sold and how other components are affecting the sales rate. So our target variable(Y) is the 'Ownership' and all other components are features(X). And then setup Train/Test models using Scikit-Learn
Our Main goal is to see how many games are sold and how other components are affecting the sales rate. So our target variable(Y) is the 'Ownership' and all other components are features(X). By using 'Scikit-Learn' library in Python to make a test and train models, the testing set (X_test, y_test) contains eighty percent of the data, whereas the training set (X_train, y_train) contains the remaining twenty percent. This ensures that the model's performance may be evaluated using unseen data. Following the modeling process, the models were successfully matched to the updated data and used accordingly.
After setting up the train and test model, scale the model for fit to each analysis model, such as 'MinMaxScaler' for linear regression and Neural Networks, and 'StandardSclar' for the KNN and SVR. Now, we are ready to initialize our models.
Model Initialization and Training:
A linear regression model is initialized using the `LinearRegression()` class from scikit-learn. The model is trained on the training data (X_train, y_train) using the `fit()` method, which adjusts the model parameters to minimize the residual sum of squares between the observed and predicted values.
Similar to the preceding models, the dataset undergoes preprocessing to facilitate feature selection and standardization before training the K-nearest neighbors (KNN) regressor model. The dataset is partitioned into feature matrix (X) and target variable (y), following the convention established in previous modeling steps. Leveraging scikit-learn's `StandardScaler()`, the features are standardized to ensure a mean of 0 and a variance of 1 for each feature. This preprocessing step is critical for KNN models, given their reliance on calculating distances between data points. By scaling the features, it ensures that all features contribute equally to the distance calculation, thereby preventing biases in the model's learning process.
For model initialization and training, the `KNeighborsRegressor()` class from scikit-learn is employed to instantiate a K-nearest neighbors regressor model with a specified number of neighbors, set to 5 in this instance. Subsequently, the model is trained on the training dataset (X_train, y_train) using the `fit()` method. This training phase enables the model to learn from the training data and adapt to the underlying patterns present in the dataset. The choice of 5 neighbors provides flexibility in capturing these patterns while ensuring that the training data remains accessible for subsequent prediction tasks. Overall, the preprocessing and training stages ensure that the KNN regressor model is equipped to effectively learn from the dataset and make accurate predictions.
Model Initialization and Training:
An SVR model is initialized with default hyperparameters using the `SVR()` class from scikit-learn. The model is then trained on the training data (X_train, y_train) using the `fit()` method, which adjusts the model parameters to minimize the error between the observed and predicted values.
The model architecture is built using the Sequential API from Keras to construct an artificial neural network (ANN). Comprising multiple layers, including dense (fully connected) layers and dropout layers for regularization, the architecture is structured with three dense layers: the initial layer containing 64 neurons with ReLU activation, followed by a layer of 32 neurons also with ReLU activation, and concluding with an output layer featuring a single neuron and linear activation, ideal for regression tasks.
Following the construction of the model, compilation is performed using the `compile()` method. Here, the loss function is specified as MeanSquaredLogarithmicError (MSLE) from TensorFlow, while the optimizer is set to 'adam', a widely used and efficient optimization algorithm in neural network training. Additionally, MSLE serves as a metric for monitoring model performance throughout the training process.
To prevent overfitting and enhance generalization performance, early stopping is implemented through the `EarlyStopping()` callback from Keras. This mechanism monitors the validation loss during training, halting the process if the validation loss ceases to decrease for a specified number of epochs (patience). This strategic implementation of early stopping ensures efficient training and improves the model's capacity to generalize to unseen data, contributing to enhanced overall performance and efficiency.
A pipeline is established to integrate both feature scaling and the random forest regressor model, streamlining the preprocessing and modeling steps into a cohesive workflow. Comprising two primary stages, the pipeline initially scales the features to ensure uniformity in their magnitude, followed by fitting the random forest regressor model. By encapsulating these steps within a pipeline, the process of feature scaling and model fitting is automated, enhancing efficiency and reproducibility in model deployment.
To facilitate the identification of the most optimal model configuration, a parameter grid is defined to explore various hyperparameters associated with the random forest regressor. This grid encompasses a range of parameters, including the number of estimators, maximum tree depth, minimum samples required to split a node, minimum samples required for a leaf node, and the maximum number of features considered for splitting. By systematically evaluating combinations of these hyperparameters, the grid search aims to pinpoint the configuration that minimizes the chosen evaluation metric, negative mean squared error.
Hyperparameter tuning is executed using GridSearchCV, a cross-validation technique that exhaustively explores the parameter grid to identify the configuration yielding the best model performance. Leveraging the GridSearchCV function from scikit-learn, the process iteratively evaluates each parameter combination using cross-validation and selects the configuration associated with the lowest negative mean squared error. This iterative search ensures that the model's hyperparameters are fine-tuned to optimize predictive performance, enhancing the model's ability to generalize to unseen data.
Upon completion of the hyperparameter tuning process, the best-performing model configuration is selected based on the output of the grid search. The best model, representing the random forest regressor with the optimal hyperparameters, is extracted using the best_estimator_ attribute of the grid search object. This model configuration is identified as the most effective in minimizing prediction error and maximizing model accuracy, culminating in the selection of an optimized random forest regressor model for subsequent deployment in predictive tasks.
We used MSLE instead of MSE to reduce the impacts of Outliers. And there are several more reasons MSLE is fitted than MSE.
Relative Error:
MSLE penalizes relative differences rather than absolute differences. This means that underestimating or overestimating the target value by a significant percentage leads to more significant penalties, regardless of the scale of the target value. This is beneficial when it's more important to get the magnitude right rather than the exact value, which is often the case in growth predictions, stock price movements, and any scenario where the proportionate error is more critical than the absolute error.
Modeling Percent Changes and Multiplicative Factors:
It is suitable for modeling predictions of quantities that are supposed to be non-negative and can vary over several orders of magnitude. It ensures that a prediction of 1,000 instead of a true value of 1,010 has a smaller penalty than a prediction of 10 instead of a true value of 20, even though the absolute error is the same.
Stability in Predictions:
By transforming the target values into a log scale, MSLE can stabilize the variance of the data, which often helps in improving model performance, especially when data spans several orders of magnitude.
Ensuring Non-negative Predictions:
Since MSLE deals with the logarithm of the values, it inherently guarantees that predictions are non-negative. This is a crucial constraint for many real-world problems where negative predictions are not meaningful or possible.
(n = number of samples / y_true,i = true value of the target variable for sample i / y_pred,i = predicted value of the target variable for sample i)
Results
Analyzing the Mean Squared Logarithmic Error (MSLE) results you provided from the various models applied to your dataset, you can draw several conclusions about their performance and implications for your data analysis.
Results of our models were :
Linear Regression : 2.5390
Data features have nonlinear relationships due to poor performance.
KNN : 0.7506
Some local similarity between data points.
SVR : 1.0863
Neural Networks : 0.4787 (Lowest MSLE)
Dataset are complex and possibly nonlinear relationships.
Random Forest : 0.6684
Showing the possibility of complex datasets with nonlinear relationships.
From all the results, Neural Networks has the most lowest MSLE. Neural Networks and Random Forest methods has lower MSLE than KNN, SVM, and Linear Regression. This means this dataset is very complex and has non-linear relationships.