Naive Bayes
Naive Bayes
Overview
Naive Bayes is a family of probabilistic classifiers based on Bayes' Theorem. Naive Bayes assumes independence between every pair of features. Although the assumption tends to be unrealistic in the majority of real-world settings, the algorithm remains highly efficient in most real-world applications due to its speed and simplicity.
Multinomial Naive Bayes is a supervised method applied mainly on classification problems involving discrete, count data. It performs extremely well when features are counts or frequencies like, the frequency of a word within a document or the frequency of a value within a record.
The algorithm employs Bayes' Theorem, which calculates the probability of a class from a set of observed features. The assumption is that all the features are conditionally independent. Each feature contributes independently to the probability of the class.
Multinomial Naive Bayes algorithm trains to learn the frequency of features between classes. In training, it knows how often each feature appears in the training examples of a class. Frequency distributions enable the model to calculate the probability of a feature in a class. This is particularly suited for data in the form of discrete counts.
For prediction, the algorithm calculates the posterior probability of each class by multiplying the prior probability of the class with the probability of observing features in the class. The prediction is then selected as the class with the maximum posterior probability. Because the model uses count data, input should be non-negative integers. If a feature never occurs in the training data of a class, the model smoothes so that zero probabilities do not occur so that all features have some non-zero chance of occurring in each class.
Gaussian Naive Bayes is designed for numeric input attributes with a normal distribution. It's especially suitable for features such as duration, ratings, and revenue values where data is inherently numeric. This version calculates the probability of the data belonging to each class according to the Gaussian distribution.
Bernoulli Naive Bayes can be applied for binary feature sets, i.e., when variables are either 0 or 1. It is employed in features such as binary indicators of wins, MPA scores, or success thresholds. It presents a good means of handling yes/no feature types.
Categorical Naive Bayes is optimally suited for nominal categorical attributes such as genres or origin. These are represented with ordinal approaches before model training. This model does not make any assumptions regarding numeric structure in input, as compared to Gaussian or Multinomial variants, and therefore is extremely efficient to handle on purely categorical data.
One key ingredient of all Naive Bayes models is smoothing, particularly Laplace smoothing. Smoothing addresses the zero probability issue, which happens when certain combinations of features and labels are absent from the training data. Adding a constant to each count, smoothing ensures that no probability estimate will ever be zero, thus improving model stability and generalization.
Naive Bayes offers a general and computationally efficient solution for classification problems with various kinds of data. The interpretability and flexibility of Naive Bayes make it an ideal candidate for early-stage modeling and comparative analysis, as demonstrated in this project's movie success classification test.
The original dataset contains a wide range of data types and a total of 19 columns. The data values consist of string, list and numeric type data as seen below.
The entire procedure of cleaning data for applying Naive Bayes models begins with the conversion of the dataset into totally numeric data and free from missing or incorrect values. The categorical or string variables are eliminated, and the target variable is label encoded for compatibility with classification algorithms. On this cleaned dataset further prep is done for each of the different Naive Bayes models which can be seen below.
Multinomial Naive Bayes
Prior to data prep for MNB, the feature values were of continuous type i.e., RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, OSCARS. All these values relate to numeric as well as float-based data types, which can't be directly accepted by Multinomial Naive Bayes algorithm since this algorithm makes feature input discrete as well as count-based, like in the case of word occurrence in text features. Direct supply of continuous value would violate assumptions of MNB and make the model perform wrongly or throws errors.
To address this, the data was discretized via KBinsDiscretizer from Scikit-learn. This process converted continuous numeric features to ordinal categorical bins. Specifically, each feature was divided into 10 equal bins, allowing the MNB algorithm to process the inputs as frequency-style values. Following binning, the data was split into training and test sets. The labels were also transformed into numeric form (Successful as 1, Failure as 0) to support binary classification. This transformation made the dataset compatible to the fullest with the need of the MNB model.
Gaussian Naive Bayes
For Gaussian Naive Bayes use, the data preparation was very straightforward since the model natively supports continuous numeric features. The selected dataset had purely numeric columns such as DURATION, RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, and OSCARS that are inherently Gaussian-compatible. These features required no encoding or transformation and were very suitable for GaussianNB, provided that the values follow a normal distribution. The data was split into training and testing sets intact, with no change to the original feature scale and structure.
Categorical Naive Bayes
The features for Categorical Naive Bayes (CNB) required a special transformation to fulfill the model assumption that all of the features are discrete categorical values as integers. Unlike Gaussian or Multinomial Naive Bayes, CNB is a model that natively supports categorical data, e.g., text or object-type features like movies' rating, genre, and language. So, in addition to the base numeric features, the other categorical columns MPA, GENRES, LANGUAGES, and COUNTRIES_ORIGIN were included to model CNB.
In data preparation, all of the categorical features were initially encoded as strings for uniformity, then the whole feature set was encoded using OrdinalEncoder. This encoder converts each category to distinct integer values for each feature. Because CNB is not capable of dealing with unseen categories in the test set natively, an unknown_value parameter was set and a safe clipping approach was used. This clipped any out-of-range test values to the highest value observed in training, so as to avoid index errors in prediction. Encoded and clipped, the dataset met CNB's expectations, allowing it to calculate class likelihoods over categorical feature distributions correctly.
Results and Conclusion
Multinomial Naive Bayes (MNB) recorded an accuracy rate of 75.33%, with poor precision for class 2 (0.86) but absolutely failed to correctly classify class 1 instances that recorded zero recall and precision. This imbalance is evident in the confusion matrix, where class 1 is completely mislabeled. Even though it performed stupendously with the most common class (class 0), MNB's ignorance about less common class characteristics cut its macro-average performance significantly. It indicates that while the count-based discretization was immensely helpful in one way, it wasn't comprehensive enough to handle a more varied class distribution.
Gaussian Naive Bayes (GNB) was worst at 71.70% accuracy. From the confusion matrix, it is clear that class 0 was well classified (2000 correct), but classes 1 and 2 were poorly classified—especially class 2, which had over 440 misclassifications. Precision was very accurate for class 2 (0.81), but recall dropped to 0.20, which indicates that the model was overconfident but wrong on most predictions. Since GNB models usually presume continuous normally distributed features, it is very likely that it may have been affected by skewed and non-Gaussian distributions of features in the data, e.g., budget and gross revenue.
Categorical Naive Bayes, or CNB, produced the highest prediction among the three models, which is 81.80%. The confusion matrix for CNB indicates a high performance in class 0 with 1896 correct predictions and a well-balanced distribution between classes 1 and 2. 0.69 accuracy and 0.81 recall were noted in the class 2 model, showing not only that it picked up a lot of true positives but also made comparatively fewer false positives. The performance shows the value of dealing with categorical features such as MPA, GENRES, and LANGUAGES, likely picked up effective non-numeric patterns that helped towards improving the ability of the model to generalize to film classes.
When comparing all three models, CategoricalNB surprisingly performs exceptionally well in generalizing all three classes. It made good use of both numeric and categorical data by treating all variables as discrete categories. MultinomialNB followed with poor performance but suffered from having overwhelming bias towards majority classes. GaussianNB, though easy to implement, did poorly due to the assumptions of data distribution it made that were not in the best interest of the data type in the dataset. Such differences identify why a model selection suitable to the data type is necessary.
In general, the Naive Bayes models revealed strong patterns regarding the box office performance of movies in the project database. The top performer, Categorical Naive Bayes, highlighted the importance of categorical predictors like MPA, GENRES, and LANGUAGES in forecasting the outcome of box office returns and audience reception, which revealed that genres and content ratings have a stable correlation with successful box office returns and high audience reception. Multinomial and Gaussian models' poor performance using only numeric predictors like BUDGET, VOTES, and RATING revealed that numeric values alone were not enough to decide the whole nuance of film outcomes. This study justifies the argument that film success in the movies industry is not only a function of financial investment or viewer engagement but also of thematic and regulatory characteristics of a film. Through this research, it is evident that the deployment of diverse types of data—primarily well-encodable categorical data—is critical in effectively predicting classes of movie success, which can be used to furnish strategic knowledge to producers, marketers, and analysts operating in the cinema sector.