Neural Networks (NNs) are pivotal computational models inspired by the biological neural networks that constitute animal brains. These models are structured in layers, consisting of interconnected nodes or 'neurons,' each capable of processing input and transmitting information to subsequent layers. The architecture of neural networks can vary, such as Artificial Neural Networks (ANNs), where connections flow unidirectionally; Perceptrons, the simplest type of feedforward neural network; Convolutional Neural Networks (CNNs), which excel in processing structured arrays of data like images; Recurrent Neural Networks (RNNs), which process sequences of data by looping output back as input; and Long Short-Term Memory networks (LSTM), a type of RNN designed to recognize patterns in sequences of data over extended intervals. The application of these models spans across a variety of fields, from voice recognition systems to the enhancement of decision-making processes in autonomous vehicles, illustrating their versatility and capacity for handling complex pattern recognition tasks. This project integrates these NN models to analyze and interpret complex patterns in text data, aiming to derive meaningful insights from vast datasets, such as word sentiments surrounding the topic of blockchain, thereby contributing to a nuanced understanding of the underlying data dynamics within NewsAPI, Reddit, and Medium.
Example of Neural Network Sentiment Analyzer Using Keras
Data for X and Y Train and Test Sets (NewsAPI -- also done with Reddit and Medium)
Supervised modeling mandates two key prerequisites: labeled data and a systematic division of this data into a Training Set and a Testing Set. Initially, labeled data is necessary as it contains both input features and the corresponding output labels needed for training the model; these labels provide the ground truth against which the model learns to predict outcomes, in this case as pictured in the visuals the labels were 0, 1, and 2 representing negative, neutral, and positive sentiments. The process then begins by segregating the labeled data into a Training Set, which is utilized to construct and train the model, this set allows the model to learn the relationship between the input features and the expected output by continuously adjusting and optimizing its parameters.
X Training Set
Y Training Set
Following the training phase, the model's performance and accuracy are evaluated using a Testing Set. This set comprises a different subset of labeled data that was not exposed to the model during training. The key here is that the Testing Set serves as a new, unbiased platform to gauge the model’s ability to generalize its learning to new, unseen data, thus validating the effectiveness of the model in real-world applications. It is critical to highlight that only labeled data can be used in supervised learning, as the model relies explicitly on known outcomes to learn and make predictions. This structured approach ensures the model is not only tailored to the specifics of the training data but also retains a more broad capacity to perform well across diverse and unseen datasets.
X Testing Set
Y Testing Set & Y Testing Set One-Hot Encoded
Correlation Matrices for Sentiment by Platform
The model trained on NewsAPI data achieved an accuracy of 71.11%, indicating moderate success; this outcome suggests challenges in handling the formal and diverse language of journalistic content, which often lacks distinct sentiment expressions given the nature of objective journalism. The observed fluctuations in validation loss also imply potential overfitting, indicating that the model may benefit from advanced regularization techniques or alternative architectures better suited to the nuances of news language. Too, expanding the dataset could enhance the model's ability to generalize beyond the specifics of the training data.
Conversely, the model performed best with Reddit data, achieving an accuracy of 88.33%. This high performance is likely due to the platform's conversational style and explicit sentiment expressions given the more subjective nature of Reddit conversations, which simplify the task of sentiment classification. Reddit's structured environment, where posts are often categorized into specific subreddits, may also provide a more defined linguistic context that aids the model's learning process. This suggests that for social media platforms with direct and expressive user-generated content, simpler models can be highly effective, although they require ongoing adjustments to adapt to the dynamic nature of online language.
Moreover, the analysis of Medium posts, which typically contain longer and more exploratory content, presented a greater challenge, reflected in a lower accuracy of 74.39%. The complex and often mixed sentiments within a single article on Medium make it difficult for the model to perform well using basic TF-IDF vectorization, which fails to capture the deeper narrative and contextual sentiments. Too, like the news data, Medium data is typically more objective in nature with an academic focus making sentiment harder to discern. Implementing more sophisticated NLP techniques, such as sentiment context embeddings or sequential models like LSTM or BERT, could potentially improve performance by providing a deeper understanding of the text structure and flow.
These findings highlight the importance of tailoring sentiment analysis models to specific text types and sources. They also show the need for continual optimization of models and datasets to address the distinct linguistic features and challenges presented by different platforms. Moreover, they point to the broader applicability of advanced NLP techniques in improving the accuracy and robustness of sentiment analysis models, particularly for complex textual data like that found on Medium. Overall, sentiment surrounding blockchain across these platforms proved to be promising but in need of more development, in this case.