Blockchain Text Mining

Support Vector Machines

Process Overview

In the context of analyzing sentiment towards blockchain on platforms such as Medium, Reddit, and NewsAPI, Support Vector Machines (SVMs) serve as a robust methodology for classifying text data. The process commences with the aggregation of relevant textual content from these platforms, utilizing their respective APIs to gather articles, comments, and posts related to blockchain. Following data collection, a comprehensive preprocessing stage is essential to prepare the text for analysis. This stage involves cleaning the data by removing irrelevant characters, tokenizing the text into individual words, normalizing these tokens to their base forms, and eliminating common stopwords that offer little value to sentiment analysis.

After preprocessing, the next crucial step is feature extraction, where numerical features are derived from the text using techniques such as the Bag of Words model or TF-IDF, both of which quantify the significance of words within the documents. These features then serve as the foundation for training the SVM model. The model is trained on a labeled dataset where each text sample is associated with a sentiment (positive, negative, or neutral), allowing the SVM to learn how to classify the sentiment of unseen text data accurately.

Hyperparameter tuning is an indispensable step that follows, involving the optimization of the SVM parameters, including the selection of the kernel type, to enhance the model's performance. This is particularly critical as the choice of kernel and its parameters profoundly influences the model's capacity to accurately capture sentiment nuances in the text data. Once the SVM model is optimized and trained, it undergoes evaluation using a separate test dataset to assess its efficacy in sentiment classification, employing metrics such as accuracy, precision, recall, and the F1-score.

Upon successful evaluation, the SVM model is deployed to analyze new text data from the specified sources, enabling the classification of sentiments and the assessment of public opinion trends on blockchain technology. This approach leverages SVM's strengths in handling high-dimensional data and modeling complex decision boundaries, making it particularly advantageous for sentiment analysis in the nuanced domain of blockchain discussions.

Sample CountVectorizer Data

Sample TfidfVectorizer Data

Data Preparation

The data preparation process for SVMs in the context of sentiment analysis on text data related to blockchain from sources like Medium, Reddit, and NewsAPI is a multifaceted approach that incorporates data collection, preprocessing, feature extraction, and sentiment categorization. The initial step involves gathering and compiling datasets from the mentioned platforms, which are then subject to a series of preprocessing techniques. These techniques include cleaning (removing irrelevant characters and formatting), tokenization (splitting text into meaningful units), normalization (converting tokens to a standard form), and stop words removal (eliminating common words).

Subsequently, the preprocessed text undergoes feature extraction, where two primary methods are utilized: TF-IDF (Term Frequency-Inverse Document Frequency) and Count Vectorization. Both methods convert text into numerical vectors that represent the significance of words within the documents, albeit in slightly different manners. The datasets are split into training and test sets using 'train_test_split' to facilitate model training and evaluation.

Sentiment analysis is performed by applying the TextBlob library to categorize sentiments as positive, negative, or neutral based on the polarity scores of the text. This step is crucial for assigning sentiment labels to the data, which are essential for supervised learning.

Models such as Multinomial Naive Bayes, Decision Trees, Random Forest, and SVMs (Support Vector Machines) are trained on these features. The choice of model depends on the specific characteristics of the data and the objectives of the analysis. In this context, SVMs are particularly highlighted for their effectiveness in handling high-dimensional data and their flexibility through the use of different kernel functions and cost parameters.

To optimize the SVM models, a variety of kernel functions ('linear,' 'poly,' 'rbf') and cost values are tested to evaluate their impact on the model's performance, particularly in terms of accuracy. The accuracy and confusion matrices of these models are computed and analyzed to determine the optimal configuration.

This process embodies a comprehensive approach to preparing and analyzing text data for sentiment analysis, leveraging the capabilities of SVMs alongside other machine learning algorithms. The integration of feature extraction methods, sentiment categorization, and model optimization forms the backbone of a robust sentiment analysis framework capable of extracting meaningful insights from vast amounts of text data.

CountVectorizer Training Data

CountVectorizer Test Data

Code & Data

Python SVM Code

NewsAPI Data (CountVectorizer, TdidfVectorizer)

Reddit Data (CountVectorizer, TdidfVectorizer)

Medium Data (CountVectorizer, TdidfVectorizer)

Results

In sentiment analysis of blockchain-related text data from Medium, Reddit, and NewsAPI using Support Vector Machines (SVMs), the kernel type significantly influences the performance and accuracy of the models. Each kernel function reflects a different approach to processing the input data and drawing decision boundaries between the classes.

Kernel & Cost Comparison Effect on SVM Accuracy

The linear kernel SVM is the simplest form. It assumes the data is linearly separable in the input space. In simpler terms, it tries to find the best straight line (or hyperplane in higher dimensions) that separates the classes. This approach appears to be effective in our scenario, as evidenced by the accuracy rates obtained. The highest accuracy recorded with a linear kernel is approximately 77.59% at a cost of 1 and 5, which indicates good performance with a linear decision boundary.

However, as the cost parameter increases, we notice a slight drop in accuracy (to about 75.43% at a cost of 20), which suggests that the linear SVM might be overfitting the data when penalized too heavily for misclassifications (which is what the cost parameter controls). This means that while the model becomes very confident about the training data, it might lose its generalization power on unseen data.

The polynomial kernel allows the SVM to fit a polynomial decision boundary of a specified degree in the input space. This approach is beneficial for capturing more complex patterns than a linear kernel but can also be prone to overfitting if the degree of the polynomial is set too high. In our analysis, the polynomial kernel SVM shows a decrease in performance starting from an accuracy of about 70.69% at a cost of 1 and stabilizing around 68.97% regardless of the increase in the cost parameter.

Kernel & Cost Correlation Matrix

The consistency of the accuracy despite the changing cost parameter suggests that the decision boundaries formed by the polynomial kernel might not be the best fit for the nuances of the sentiment classification in the given dataset. It could be that the data's distribution in the feature space does not align well with polynomial shapes, or the chosen degree of the polynomial does not match the complexity of the data.

The RBF kernel, also known as the Gaussian kernel, is a popular choice for non-linear data. It can handle cases where the relationship between class labels and attributes is nonlinear. The RBF kernel maps input data into a higher-dimensional space where a linear separator may be found. For the RBF kernel, the highest accuracy achieved is approximately 76.72% at a cost of 1, after which the accuracy decreases slightly and levels off at about 75.00% for higher costs.

This decrease and subsequent plateau may indicate that while the RBF kernel can model complex patterns in the data, it also starts to overfit as the cost increases, like the linear kernel. The initial higher accuracy suggests that the RBF kernel captures some of the data's complexities, but the lack of significant improvement with increased cost hints that there might be a limit to the complexity that is useful for this dataset.

In conclusion, while each kernel has its strengths, the linear kernel's performance in our sentiment analysis suggests that the decision boundary separating the sentiments in our blockchain-related text data is closer to linear. The complexity added by the polynomial and RBF kernels does not translate to better performance and may even lead to overfitting. These insights should guide the model selection and parameter tuning in future iterations of the sentiment analysis, prioritizing models that balance complexity with the ability to generalize.

Conclusions

In conclusion, the analysis of sentiment towards blockchain technology across various online platforms, including Medium, Reddit, and NewsAPI, through the use of Support Vector Machines (SVMs) has yielded significant insights. The SVMs, equipped with different kernel functions, were adept at handling the high-dimensional and nuanced text data prevalent in discussions about blockchain. The performance of these models varied with the choice of kernel and hyperparameters, underscoring the importance of model selection and optimization in sentiment analysis.

The linear kernel emerged as particularly effective, achieving the highest accuracy among the kernels tested. This suggests that the sentiments expressed in the blockchain-related text data might be separated by near-linear boundaries, a finding that highlights the linear kernel's suitability for such analysis. Despite the good performance at lower cost parameters, a slight decrease in accuracy at higher cost parameters indicated a potential overfitting issue, reminding us of the delicate balance between model confidence and generalizability.

Conversely, the polynomial and RBF kernels, despite their ability to model more complex decision boundaries, did not significantly outperform the linear kernel. This could be interpreted as an indication that the added complexity does not necessarily capture the sentiment nuances more effectively and may, in fact, lead to overfitting. Particularly, the polynomial kernel's performance decline and the RBF kernel's accuracy plateau at higher costs reinforce the notion that simpler models might be more efficient and generalizable for this dataset.

These findings have substantial implications for future research and practical applications in sentiment analysis within the blockchain domain. They suggest that while advanced, non-linear models can capture complex patterns, the choice of kernel and the optimization of hyperparameters must be carefully considered to prevent overfitting and to ensure the models remain applicable to unseen data. Furthermore, the successful application of SVMs in this context underlines the potential of machine learning techniques to understand public sentiment towards emerging technologies like blockchain, providing valuable insights into public opinion trends.

For stakeholders in the blockchain industry, including developers, investors, and policymakers, these insights are invaluable. They offer a nuanced understanding of public sentiment that can inform strategy, communication, and policy-making. As sentiment analysis technology advances, it will become an increasingly vital tool in the arsenal of those seeking to navigate the complex web of public opinion in the digital age.

Page updated

Report abuse