Blockchain Text Mining

Data

GitHub - cobaltdeity392/TxtMining: INFO 5871 Project Code

GitHub Repository

This repository houses the code files, raw data, and cleaned data for the blockchain text mining project for transparency and reference.

Data Preparation Process

For this process, the focus is on the acquisition, processing, and analysis of textual data from NewsAPI, Reddit, and Medium, highlighting the intersection of data science and natural language processing (NLP) techniques. Initially, the code fetches articles using NewsAPI based on a specific query, "blockchain," iterating through pages to compile a robust dataset of article titles and descriptions. This process includes transforming the gathered data into a pandas DataFrame, facilitating manipulation and analysis. Similarly, Reddit data is accessed via the Python Reddit API Wrapper (this was done due to the recent official API changes), where posts from all of Reddit were retrieved based on the term "blockchain," and their titles, comments, and selftext are compiled into a DataFrame. The script further extends its data collection to Medium, employing Selenium for web scraping, capturing article titles and descriptions through simulated user search interactions.

The text data undergoes extensive processing, employing NLP techniques such as lemmatization, stemming, stop words removal, and tokenization. Lemmatization and stemming aim to reduce words to their base or root form, with lemmatization considering the context for a more meaningful base form and stemming employing a more heuristic approach. The removal of stop words eliminates common but low-meaning words, focusing the collection for analysis on significant terms, and tokenization then breaks down the text into individual words or tokens, preparing the data for vectorization.

Vectorization is a critical step, converting text data into numerical form through count vectorization and TF-IDF vectorization. Count vectorization analyzes the frequency of words within the dataset, while TF-IDF vectorization assigns weight to words based on their distinctiveness across the dataset. These techniques transform the text data into a format suitable for various analyses, such as identifying trends or patterns and conducting sentiment analysis. The processed and vectorized data is finally saved to CSV files, making it readily available for future exploration or machine learning models.