This repository houses the code files, raw data, and cleaned data for the blockchain text mining project for transparency and reference.
For this process, the focus is on the acquisition, processing, and analysis of textual data from NewsAPI, Reddit, and Medium, highlighting the intersection of data science and natural language processing (NLP) techniques. Initially, the code fetches articles using NewsAPI based on a specific query, "blockchain," iterating through pages to compile a robust dataset of article titles and descriptions. This process includes transforming the gathered data into a pandas DataFrame, facilitating manipulation and analysis. Similarly, Reddit data is accessed via the Python Reddit API Wrapper (this was done due to the recent official API changes), where posts from all of Reddit were retrieved based on the term "blockchain," and their titles, comments, and selftext are compiled into a DataFrame. The script further extends its data collection to Medium, employing Selenium for web scraping, capturing article titles and descriptions through simulated user search interactions.
The text data undergoes extensive processing, employing NLP techniques such as lemmatization, stemming, stop words removal, and tokenization. Lemmatization and stemming aim to reduce words to their base or root form, with lemmatization considering the context for a more meaningful base form and stemming employing a more heuristic approach. The removal of stop words eliminates common but low-meaning words, focusing the collection for analysis on significant terms, and tokenization then breaks down the text into individual words or tokens, preparing the data for vectorization.
Vectorization is a critical step, converting text data into numerical form through count vectorization and TF-IDF vectorization. Count vectorization analyzes the frequency of words within the dataset, while TF-IDF vectorization assigns weight to words based on their distinctiveness across the dataset. These techniques transform the text data into a format suitable for various analyses, such as identifying trends or patterns and conducting sentiment analysis. The processed and vectorized data is finally saved to CSV files, making it readily available for future exploration or machine learning models.
Raw NewsAPI Data (Above)
Raw Titles and Descriptions (Below)
Lemmatized Titles and Descriptions (Above)
Stemmed Titles and Descriptions (Below)
Defining and Applying Vectorizers to the Lemmatized and Stemmed Titles and Descriptions
Raw Reddit Data
Cleaning Step Addressing Missing Values
Lemmatizing and Stemming the Reddit Posts' Titles, Selftext, and Comments
Applying Vectorizers to the Reddit Comments, Titles, and Selftext with Edited max_features Values
Output of Both the Count and TF-IDF Vectorizers on the Lemmatized Reddit Comments
Raw Medium Data
Cleaning Steps to Reorient the DataFrame by Tag (Process in Code); Then Missing Values Were Addressed
Lemmatized and Stemmed Medium Data
Applying Vectorizers to the Medium Titles and Descriptions with Edited min_df and max_features Values