The Naïve Bayes (NB) algorithm is a probabilistic classifier based on applying Bayes' theorem with the assumption of conditional independence between features. NB is commonly used for text classification, spam detection, sentiment analysis, and recommendation systems because it is fast and effective with a relatively small amount of data. Naïve Bayes assumes that the presence of a particular feature in a class is independent of any other feature, which simplifies computations and often provides robust performance.
There are three types of Naïve Bayes algorithm.
Multinomial Naïve Bayes (MN NB): Suitable for discrete data like word counts in text classification. MN NB models the occurrence of words in documents, making it effective for tasks like spam filtering and document classification.
Gaussian Naïve Bayes (GNB): Ideal for continuous data that follows a Gaussian (normal) distribution, commonly used for classification of real-valued attributes.
Bernoulli Naïve Bayes: Used for binary/boolean features, where features represent presence or absence (0 or 1). This model is effective for tasks like binary text classification.
Each model is chosen based on the nature of the data. For instance, text data with word frequencies fits well with Multinomial NB, continuous numerical data is suitable for Gaussian NB, and binary feature datasets benefit from Bernoulli NB.
From the cleaned dataset, I used a dataset containing patent-related information, including citation counts and other patent attributes. This dataset serves as labeled data which is great fit for supervised learning. As explained on top, this data will use three types of Naive Bayes algorithms. Because each algorithms requires different type of data, we picked fitted columns for each algorithms.
Multinomial Naïve Bayes: This model is suitable for word frequency data, so we could focus on the "patent_abstract" or "patent_title column", transforming the text data into a term frequency matrix.
Gaussian Naïve Bayes: This model works well with continuous data. We can use numeric columns like "patent_num_claims", "detail_desc_length", and "patent_processing_time".
Bernoulli Naïve Bayes: For binary data, we might create binary features, such as whether or not the number of citations is above a certain threshold.
And as usual, we split 80%/20% of training/test dataset.
Now, rearrange data frame to fit each of the Naive Bayes algorithms.
Set the target value.
There was possible targets in this dataset, "patent_type", "patent_kind" and etc.
But some columns have extreme distributions such as 99.9%: 0.1%. which will always result accuracy of 1.
"patent_kind" was the best choice for the train/test module.
And split train/test dataset and train them.
Accuracy : 0.68
Accuracy : 0.79
Accuracy : 0.81
Multinomial Naïve Bayes has 68% accuracy. The confusion matrix shows that the model misclassified a significant number of instances, particularly with 265 false negatives and 175 false positives. Multinomial NB, which is well-suited for text data, might struggle here because the features were not word frequencies or similar discrete counts.
Gaussian Naïve Bayes has 79% accuracy. This model, which assumes continuous features follow a normal distribution, shows improved performance with fewer misclassifications compared to Multinomial NB. The matrix shows 7 false negatives and 275 false positives. Gaussian NB performs better when features are continuous, which might align better with the dataset in this case.
Bernoulli Naïve Bayes has 81% accuracy. Bernoulli is designed for binary/boolean features, achieves the highest accuracy among the three models. It has 36 false negatives and 231 false positives, the lowest number of misclassifications across the models. This suggests that the dataset’s features may work well as binary indicators, which Bernoulli NB can handle effectively.
Each model produced a different accuracy, with Bernoulli Naïve Bayes achieving the highest performance, likely due to the binary structure of the patent_kind target and binary input features based on citation counts. Gaussian Naïve Bayes also performed well, taking advantage of the continuous features, while Multinomial Naïve Bayes had lower accuracy, as it’s generally better suited to text-based frequency data.
From this analysis, it appears that Naïve Bayes models are viable options for predicting patent types. Bernoulli and Gaussian Naïve Bayes provided particularly strong accuracy, demonstrating the importance of selecting model types that match data structures. This insight highlights that binary features or continuous feature distributions are effective when distinguishing between patent classes like B1 and B2.