Blockchain Text Mining - Decision Trees TAB

Decision Trees

Process Overview

Decision Trees (DTs) represent a crucial analytical tool in this project, facilitating the classification and regression of sentiment within blockchain-related text data gathered from platforms such as Medium, Reddit, and NewsAPI. Employed as a supervised learning method, DTs dissect the dataset into binary decisions—akin to a flowchart—where each node embodies a specific attribute check (like the frequency of a pivotal word), and the terminal leaves denote the sentiment categories: positive, negative, or neutral.

The utility of DTs in this context is multifaceted. Primarily, they enable us to perform sentiment analysis, categorizing the public's opinions on blockchain into distinct sentiments. This analysis is vital for stakeholders to understand prevailing attitudes towards blockchain, impacting areas such as investment decisions and policy formulation. Moreover, DTs can discern the most salient features or terms contributing to sentiment classification, shedding light on the particular facets of blockchain that resonate with the public, either positively or negatively.

A key advantage of DTs is their interpretability. The decision-making process within a DT is transparent and can be visualized, making the model's conclusions accessible to those without an in-depth understanding of machine learning. This transparency is indispensable when the model's findings need to be explained to a broader audience, such as company executives or policy-makers.

Additionally, DTs can serve a practical function in content filtering and recommendation. They are capable of sifting through vast quantities of text to pinpoint specific sentiments, thereby enabling the rapid identification of negative feedback which could flag potential areas of concern. Furthermore, they can contribute to enhancing user experience on content platforms by recommending articles that align with the user's sentiment preferences.

In the execution phase of the project, the application of DTs involved an extensive preprocessing stage where raw text was transformed through tokenization, lemmatization, and vectorization techniques. Following this, the processed data was introduced into the DTs, which were trained to associate textual features with sentiment labels. The performance of these models was meticulously evaluated using a suite of metrics, including those derived from the confusion matrix.

The integration of DTs in this study not only aids in accurately categorizing sentiments towards blockchain but also enriches our understanding of the factors that influence public perception. This insight is invaluable for businesses, content creators, and academics interested in the blockchain space, providing a lens through which to view the shifting landscapes of public opinion.

Data Preparation

The data preparation phase is a critical component of our analysis in sentiment classification of blockchain-related content from Medium, Reddit, and NewsAPI. This phase is comprised of several key steps designed to convert raw text into a format suitable for machine learning models, particularly Decision Trees.

Initially, we collected datasets containing titles and comments from the respective platforms, which underwent a preprocessing regimen involving lemmatization. Lemmatization is the process of reducing words to their base or dictionary form, a practice that enhances the consistency of textual data and primes it for analysis.

Subsequently, we transformed the text data into numerical form using two vectorization techniques: CountVectorizer (COUNTVEC) and Term Frequency-Inverse Document Frequency (TF-IDF). COUNTVEC simply tallies the word frequencies within the text, while TF-IDF computes a weighted frequency designed to diminish the influence of commonly occurring words that are less informative about the sentiment of a text.

Alongside vectorization, sentiment labels were assigned to each text entry using the TextBlob library, which calculates a sentiment polarity score. Text entries were categorized as 'positive' if the polarity score exceeded zero, 'negative' if it fell below zero, and 'neutral' if it was equal to zero. This sentiment categorization is crucial as it establishes the ground truth against which the performance of our classification models is evaluated.

The datasets were then consolidated, forming comprehensive compilations of vectorized text and corresponding sentiment labels. These compilations were split into training and test sets using a 70/30 ratio, ensuring both diversity and sufficiency of data for learning and validation purposes.

For the sentiment classification task, Decision Trees were chosen due to their interpretability and efficacy in handling categorical data. We trained two Decision Tree models, one with the COUNTVEC dataset and another with the TF-IDF dataset, enabling us to later assess and compare the influence of vectorization techniques on model performance.

Evaluation of the models was conducted using a confusion matrix to determine the accuracy of sentiment predictions. The probability predictions of the models provided insights into the certainty of the models' classification decisions.

Additionally, we extended our analysis by applying a Random Forest Classifier, an ensemble method that aggregates the decisions of multiple trees to improve the predictive performance and robustness of the model. We also explored the use of Support Vector Machines (SVMs) with probability estimates for a comparative understanding of different algorithms' efficacies.

To encapsulate the entire process visually, we created graphical representations of the Decision Trees and the individual trees within the Random Forest, using the plot_tree function. This visualization not only aids in understanding the decision-making paths of the models but also serves as an interpretative tool to pinpoint which features (words) are most influential in driving sentiment classification.

In essence, the data preparation stage laid the groundwork for a series of analyses that encompassed vectorization, sentiment labeling, model training and evaluation, and visualization. Through meticulous data curation and model application, we aimed to extract meaningful patterns from the text data, facilitating a deeper understanding of public sentiment towards blockchain technology as reflected across diverse online platforms.

Sample CountVectorizer Data

Same TfidfVectorizer Data

CountVectorizer Training Data

CountVectorizer Test Data

Code & Data

Python Decision Tree Code

NewsAPI Data (CountVectorizer, TdidfVectorizer)

Reddit Data (CountVectorizer, TdidfVectorizer)

Medium Data (CountVectorizer, TdidfVectorizer)

Results

In the current study, sentiment analysis was performed using decision trees on datasets sourced from Medium, Reddit, and NewsAPI. For comprehensive analysis, two distinct models were utilized, one based on the CountVectorizer (COUNTVEC) technique combined with Lemmatization, and the other employing the TF-IDF approach, also paired with Lemmatization.

The model employing COUNTVEC with Lemmatization yielded a set of predictions which were assessed against the actual sentiment labels. Through the analysis of the confusion matrix, the model demonstrated a capability to correctly predict 74 instances of positive sentiment, aligning with the true labels. Additionally, it accurately identified neutral sentiments on 85 occasions and correctly recognized 8 instances of negative sentiment. Despite these successes, the model was not infallible, occasionally mislabeling sentiments. This was evidenced by 13 instances where positive sentiments were incorrectly assigned to neutral statements and 18 instances incorrectly associated with negative ones. Moreover, the model missed recognizing positive sentiments 4 times. The model's confidence in its predictions was quantifiable, with class probabilities ranging from 0 to 1; a higher value indicated a stronger confidence, exemplified by a probability of [0.0, 0.0, 1.0] for a positive prediction, suggesting absolute certainty.

CountVectorizer Decision Trees

Shifting focus to the second model that implemented TF-IDF with Lemmatization, a distinct pattern of predictions emerged. This model correctly identified 80 positive sentiments, slightly surpassing the first model's accuracy for positive sentiment detection. It also accurately predicted neutral sentiments 83 times and negative sentiments on 10 occasions. Similar to the first model, errors were present, including 14 misclassifications of positive sentiments as neutral and 21 as negative, and it failed to recognize positive sentiments on 5 instances.

TfidfVectorizer Decision Trees

Comparing the two models highlighted a marginally superior performance by the TF-IDF model in the accurate prediction of positive sentiments, with 80 correct identifications versus the 74 by the CountVectorizer model. This suggests the potential of TF-IDF, which accounts for the term frequency across documents, to be more suited for this analytical task than the CountVectorizer method, which relies solely on the occurrence count of words.

While the TF-IDF model slightly outperformed the CountVectorizer model in classifying positive sentiments, a deeper analysis employing metrics such as precision, recall, and F1 scores for each class is essential to establish definitive conclusions. Additionally, the distribution of sentiments within the dataset must be considered, as the predominance of a particular sentiment could skew the results.

In conclusion, the TF-IDF model proved marginally more proficient in sentiment classification, notably with positive sentiments. Further refinements to these models could involve additional feature engineering, hyperparameter adjustments, or exploring alternative machine learning algorithms to enhance predictive performance.

Conclusions

In conclusion, this study underscores the effectiveness of Decision Trees (DTs) in performing sentiment analysis on blockchain-related textual data sourced from platforms such as Medium, Reddit, and NewsAPI. Through a methodical approach that encompassed data preparation, model training, and evaluation, we unearthed nuanced insights into public sentiment towards blockchain technology. The deployment of two distinct models, one based on CountVectorizer combined with Lemmatization and the other employing TF-IDF with Lemmatization, allowed for a comprehensive examination of sentiment classification efficacy.

The results indicate that while both models demonstrate considerable accuracy in sentiment prediction, the TF-IDF model exhibits a slight edge, particularly in classifying positive sentiments. This finding suggests that the weighting mechanism inherent to TF-IDF, which mitigates the influence of common but less informative words, may provide a more nuanced understanding of textual sentiment. The nuanced performance difference between the models, particularly in the classification of positive sentiments, underscores the importance of vectorization techniques in sentiment analysis.

Furthermore, the practical implications of these findings extend well into various domains. For stakeholders in the blockchain space, the ability to accurately gauge and understand public sentiment is crucial. It can inform strategic decisions ranging from investment strategies to policy formulation, content curation, and user experience enhancement on digital platforms. Moreover, the interpretability of DTs as demonstrated in this study makes these insights accessible to a broader audience, enabling informed decision-making even by those with limited technical expertise in machine learning.

The slight superiority of the TF-IDF model in identifying positive sentiments also suggests avenues for future research, including the potential for refined model tuning, exploration of alternative vectorization techniques, or the incorporation of additional linguistic features to bolster accuracy. Furthermore, extending the analysis to encompass more granular sentiment categorizations could offer deeper insights into the subtleties of public opinion on blockchain.

In essence, this study not only validates the applicability of DTs in sentiment analysis within the blockchain domain but also highlights the critical role of data preparation techniques in shaping model performance. As the blockchain landscape continues to evolve, leveraging advanced analytics to capture and understand public sentiment will remain a key asset for stakeholders across various sectors.

Page updated

Report abuse