Blockchain Text Mining - Naïve Bayes TAB

Naïve Bayes

Process Overview

In the endeavor to analyze sentiment across online, textual sources, in this case, news articles, Reddit discussions, and Medium posts, the Naïve Bayes (NB) methodology proves as a foundational tool within the realm of text classification. This probabilistic approach is grounded in the application of Bayes' theorem, based on the assumption that each feature within the dataset operates independently, given the outcome variable's state. This model has gained traction for its efficacy in navigating the complexities of text classification, ranging from the identification of spam to sentiment analysis and the categorization of various documents.

The project at hand leverages the NB model to sift through text data extracted from NewsAPI, Reddit, and Medium, aiming to categorize these texts based on sentiment (positive, neutral, or negative) surrounding documents relating to blockchain conversations; the goal is to decode the overarching sentiment conveyed across these platforms, providing insights into the collective sentiment of blockchain across these online platforms. By systematically classifying sentiment across these media channels, the model aspires to construct a comprehensive landscape of sentiment distribution, serving as a measure for public sentiment.

Sample CountVectorizer Data

Same TfidfVectorizer Data

Data Preparation

Within the framework of supervised learning methodologies, such as the Naïve Bayes (NB) model, having labeled data emerges as a non-negotiable prerequisite. This necessity forks the dataset into distinct sets: a training set, serving the role of model education, and a testing set, designated for the critical evaluation of model efficacy. The need for these datasets to remain mutually exclusive is highlighted by the need to circumvent the potential pitfall of the model's overfitting to the training data, which would compromise its proficiency in adapting to novel, unseen data inputs.

Central to the dataset in question is text data derived from diverse origins, news articles, Reddit threads, and Medium posts, each annotated with sentiment labels (positive, neutral, negative) through the application of TextBlob's sentiment analysis.

Initially, the data corpus was segregated into two principal categories: 'final_countvec_lem' and 'final_tfidf_lem.' This division was informed by the utilization of CountVectorizer and TfidfVectorizer, respectively. These vectorization mechanisms transform the textual content into a numerical format, rendering it compatible with the data requirements of the NB model.

Next, each category of data underwent a further division into training and testing sets. This partition abided by a predetermined ratio, allocating 70% of the data to the training set and the remaining 30% to the testing set. This distribution ensures the model's performance is not only gauged on familiar data but also its capable of navigating uncharted data territories.

This delineation into training and testing cohorts is crucial in upholding the integrity of the model's evaluation process, helping to assist with an assessment of its generalization capabilities across unseen data landscapes.

CountVectorizer Training Data

CountVectorizer Test Data

Code & Data

Python Naïve Bayes Code

NewsAPI Data (CountVectorizer, TdidfVectorizer)

Reddit Data (CountVectorizer, TdidfVectorizer)

Medium Data (CountVectorizer, TdidfVectorizer)

Results

In the nuanced landscape of sentiment analysis, the detailed examination of the COUNTVEC and TF-IDF models, both employing the Lemming approach, elucidates their operational dynamics and avenues for potential refinement. The COUNTVEC model, through its confusion matrix, delineates an interesting paradox. It showcases a high sensitivity (recall) of 90%, adeptly flagging 54 true positive sentiments alongside a modest precision (positive predictive value) of 72%, resulting from a count of 21 false positives. These metrics underscore the model's proficiency in identifying positive sentiments but also highlight its susceptibility to erroneously categorizing non-positive sentiments as positive. This blend of high recall with moderate precision results in an overall accuracy of 26.29%, signaling a challenge in uniformly accurate sentiment prediction across the board.

Conversely, the TF-IDF model manifests a different set of characteristics. With no true negatives and 23 false positives leading to a precision of 75.53%, and an unblemished recall of 100% from 71 true positives, this model illustrates an impeccable capacity to capture positive sentiments. However, this remarkable sensitivity comes at the cost of precision, as evidenced by the model's total accuracy of 30.6%. The absence of true negative predictions conspicuously points to a propensity for over-predicting the positive sentiment, a trait that, while beneficial in maximizing recall, introduces a bias towards positive sentiment classification.

Delving deeper into the prediction probabilities offered by both models provides further insight into their analytical temperaments. The COUNTVEC model exhibits a decisive inclination towards single-class predictions, particularly the 'neutral' sentiment, a trend that might hint at an underlying model bias or overfitting. The TF-IDF model, despite sharing a semblance of confidence in its predictions with the COUNTVEC model, shows a marginally more distributed inclination across different sentiment classes in certain instances. Nevertheless, a discernible preference for certain classes suggests the presence of bias or overfitting within this model as well.

Synthesizing these observations, both models reveal a pronounced confidence in predicting specific sentiment classes, a characteristic potentially indicative of overfitting. The TF-IDF model edges slightly ahead in terms of accuracy and exhibits a remarkable ability to identify positive cases comprehensively. However, its precision is somewhat compromised by the prevalence of false positives. In contrast, the CountVectorizer model's lower overall accuracy is counterbalanced by a more equitable distribution of precision and recall, hinting at a more cautious stance towards positive sentiment prediction. These insights not only illuminate the operational nuances of each model but also chart a path for future enhancements, particularly in addressing the delicate balance between recall and precision, and mitigating the inclination towards overfitting.

Count Vectorizer Naïve Bayes Confusion Matrix

TFIDF Vectorizer Naïve Bayes Confusion Matrix

Conclusions

In the project focusing on mining online text to explore the pro and anti-blockchain sentiments using the Naïve Bayes classifier, several distinct insights have come to light; these results can inform our understanding of the public discourse surrounding blockchain technology on digital platforms.

The classifier has shown a high degree of accuracy in identifying neutral sentiments. This suggests that discussions or mentions of blockchain that do not explicitly lean towards either support or criticism can be effectively isolated. For the project, this means that a significant volume of the conversation might revolve around informational or neutral discussions on blockchain technology rather than polarized debate on the given platforms as opposed to other online spaces.

The differentiation between pro-blockchain (positive sentiment) and anti-blockchain (negative sentiment) stances presents a more considerable challenge. The overlap in the language used by both proponents and critics of blockchain could be causing this ambiguity; for example, both sides might frequently use similar terms such as 'security' or 'crypto' in different contexts or with different connotations. This linguistic overlap complicates the model's task of classifying sentiments as purely positive or negative concerning blockchain.

To enhance the classifier's ability to distinguish between pro and anti-blockchain sentiments, refining the preprocessing of text data could be beneficial. This might involve more nuanced handling of context-specific language or the development of a custom lexicon that captures the subtleties of blockchain discourse. Too, exploring advanced models such as LSTM or BERT, which are capable of understanding context and the sequence of words better, could improve classification accuracy. These models might capture the nuanced difference between positive and negative sentiments within the blockchain discourse more effectively. Moreover, adding features beyond the text itself, such as the author's engagement metrics or the sentiment of responses to a post, could provide additional context that aids in sentiment classification; for instance, a highly debated post with polarized responses might indicate a contentious or critical view of blockchain, even if the original post's language is ambiguous.

The use of the Naïve Bayes classifier in this project has shed light on the general sentiment and the challenges in identifying specific stances within the blockchain debate online. While neutral discussions are clearly identifiable, distinguishing between supportive and critical views on blockchain requires further methodological refinement. The insights gleaned not only underscore the complexity of sentiment analysis in a highly technical and evolving field like blockchain but also highlight the potential pathways for enhancing the precision of sentiment classification in future iterations of the model.

Page updated

Report abuse