Logistic Regression
Logistic Regression
Overview
Linear regression is a supervised learning algorithm of machine learning used to predict continuous output variables given one or more input features. It plots a linear line between the data so as to cut down on discrepancies between predicted and actual values. The line is represented by the equation given on the left, where the β values are learned at training.
Logistic regression can be applied for binary or multi-class classification. Rather than outputting a continuous value, it outputs the probability that an input belongs to a certain class. It employs a linear combination of input features but employs the Sigmoid function to squish outputs in the range between 0 and 1. The equations shown the probability predictions for binary classification. The first equation computes the probability of the positive class with y=1, while the second computes the probability of the negative class with y=0. Here, b1,...,bk are the feature weights, x1,...,xk are the input values, and a is the intercept.
Both models use a linear function of input variables but logistic regression gives class probabilities and linear regression gives continuous predictions. Linear regression uses least squares to minimize error, while logistic regression uses maximum likelihood estimation. Logistic regression is preferable when the output variable is categorical.
Logistic regression uses the Sigmoid function, as shown on the left, to convert predicted results to probabilities within the range of 0 and 1. The Sigmoid function maps the linear output to a probability so that classification can be made using a threshold. It converts any value on the real number line to the a value within 0 and 1.
To train the model, logistic regression adjusts the weights or coefficients for each feature such that the probabilities predicted as accurately as possible converge to the actually existing labels within the training set. This adjustment is done via the maximization of a likelihood function called Maximum Likelihood Estimation (MLE). Once trained, the model uses the learned weights to predict: the input belongs to one class ("Successful") if the probability output is more than 0.5, or to the other class ("Failure") if less than or equal to 0.5.
Before preparation, the data set had some of the characteristics of movies such as length, rating, votes, budget, worldwide gross box office grossing amount, nominations, and Oscars, as well as categorical target variable MOVIE_CATEGORY. The target had some classes such as Failure, Mediocre, and Successful, which are not the best for binary logistic regression. The first task was thus to filter the data set and only take records labeled as either Failure or Successful. These lines were kept intact to build a binary classification model, so the problem meets logistic regression's requirement of possessing a binary target variable.
After filtering, the data was also preprocessed by selecting relevant numeric features and scaling them using a StandardScaler. The scaling was needed because logistic regression is scale sensitive for input feature. The target variable was label-encoded to convert the class labels (Failure and Successful) into numeric values (0 and 1). The dataset was finally divided into mutually exclusive training and testing sets according to an 80-20 ratio to ensure that model performance is evaluated over unseen data to get a reasonable estimate of generalization ability.
Results and Conclusion
Logistic Regression had 90.69% accuracy. Logistic Regression worked wonderfully well with both classes having great recall and precision, especially for "Successful" movies with it hitting recall of 0.72 and precision of 0.93. Confusion matrix demonstrates less misclassifications compared to MNB, especially having correctly predicted 595 "Successful" movies. Logistic Regression also had the highest weighted average F1-score of 0.90 and macro average precision of 0.92. Feature importance analysis found that features like "Gross Worldwide Revenue", "Oscars", and "Rating" were the highest contributors to the result of the prediction.
Multinomial Naive Bayes (MNB) classifier achieved 82.78% accuracy, and it did a great job in correctly classifying "Failure" movies with precision of 0.81 and a very good recall of 0.99. Conversely, it did not do that well in classifying "Successful" movies correctly because it had a very poor recall of 0.41, i.e., a high percentage of successful films were classified as failures. The confusion matrix supports this with 485 misclassifications for the "Successful" class. The macro average F1-score was 0.73, similar to this class imbalance in strength of prediction. Generally, MNB performed well but could not generalize as well for both classes.
Though both models correctly predicted the "Failure" class, Logistic Regression significantly performed better than Multinomial Naive Bayes for the prediction of "Successful" movies. MNB's assumptions of feature and discretized count-like input usage distribution may have hindered its ability to perform well on continuous real-valued data like ratings and revenue. Logistic Regression, having standardized continuous features and straightforward optimization with sigmoid and likelihood function, was able to better capture the subtleties in the data. Its precision-recall balance and fewer misclassifications suggest that it is better suited to this binary classification task.
From these findings, it can be seen that Logistic Regression produces a more accurate and stable model for the movie classification task of "Successful" versus "Failure." The confusion matrices and feature importance plot display the strong predictive role of revenue, ratings, and award features. This modeling confirms that monetary and critical performance indicators have a direct relationship with a movie's success. The experiment also demonstrates how important it is to prepare the data correctly, especially for those models that rely on making count or category-based assumptions, i.e., Naive Bayes.
Supervised Learning Models Comparison
Naive Bayes Summary
Naive Bayes models performed decently well, specifically Multinomial and CategoricalNB. They fared poorly under class imbalance and were unable to capture fine-grained patterns in numerical features. CategoricalNB got better with encoded categorical data but was still not very accurate compared to other models. These models are quick to train and suitable for early baselines but are limited by their assumptions.
Decision Tree Summary
Decision Tree models delivered high accuracy as well as interpretability. The best among three tree layouts tested was an entropy-based tree that yielded accuracy of 87%. The models proved particularly handy to decide which sets of features like rating and gross revenue had yielded unambiguously interpretable classifications. These models did lack robustness in terms of not being capable to generalize consistently about "Mediocre" classes.
Logistic Regression Summary
Logistic Regression was the most balanced model in terms of accuracy, recall, and precision. Logistic Regression performed wonderfully with continuous values without discretization and represented feature importance in terms of coefficients. Logistic Regression generalized incredibly well and dealt with both classes very well, particularly in addressing the issue of binary classification measures makes it the most appropriate choice to forecast film success.