Gathering data is like collecting puzzle pieces to understand a big picture. It's like gathering ingredients before cooking a meal. We need to find the right pieces to complete the puzzle or the right ingredients for a tasty dish. Data gathering involves searching for information from different places, like books, websites, and interviews. We gather data to learn more about a topic or to answer questions. It's important to gather accurate and reliable data so that our conclusions are correct. Sometimes, we need to gather data from many different sources to get a complete picture. Data gathering can be challenging, but it's an important step in many projects. With patience and careful searching, we can gather the information we need to solve problems and make decisions.
For the scope of this project, we have chosen to gather data from 3 different sources:
Reddit comments spread over last few years.
News articles from newsapi.org
News articles from FOX news
90% or more of the data that must be gathered for this project work will be text, and cannot be directly downloaded. In other words - APIs or Web scraping is the way to go for data collection.
Figure 1 below shows the response of the API call which was made to the source URL (Reddit) . As can be seen, the API call returns the required data in an improper format and thus, the same need to be changed before any sense can be made out of it.
Web scraping is a technique used to extract information from websites. It involves using a code to retrieve data from web pages. Essentially, the program visits a website, analyzes its content, and then extracts the desired data, such as text, images, or links. Think of it as a digital tool that "scrapes" or collects information from the web, enabling users to gather large amounts of data quickly and efficiently.
Figure 2 below shows the results of a web scraping call made to Fox News. As can be seen in the image, the data is fetched but is unstructured and needs to be further formatted.
Figures 3 and 4 are nothing but the API calls which were made to Reddit and NewsAPI to fetch relevant data
As can be seen, the above fetched data is in a very messy format and need to be formatted. The below images show case the initial steps taken to store the data in the proper format so that it can be cleaned.
Data cleaning is a crucial step in text mining, ensuring that the data used for analysis is accurate and reliable. In this process, various techniques are employed to handle issues like missing values, duplicates, and inconsistencies in the text data. For instance, removing special characters, punctuation marks, and irrelevant symbols helps to standardize the text and make it uniform. Additionally, stopwords, which are common words like "the" and "and," are often removed to focus on the meaningful content. Furthermore, techniques such as stemming or lemmatization are applied to reduce words to their root forms, aiding in text analysis and improving computational efficiency. Overall, by performing thorough data cleaning, text mining algorithms can generate more accurate insights from the data, leading to better decision-making and understanding of the underlying patterns in text.
To further pre-process the data, we have incorporated the below techniques (not limited):
Stemming: This is the process of reducing words to their root form or base by removing suffixes. It helps in reducing the dimensionality of the data.
Lemmatization: Similar to stemming, lemmatization also reduces words to their base form, but it ensures that the resulting word belongs to the language and is meaningful. It typically requires more computational resources compared to stemming but produces better results.
CountVectorizer: This is a technique used to convert a collection of text documents into a matrix of token counts. It counts the frequency of each word in the document and represents it as a numerical vector.
TfidfVectorizer: This is similar to CountVectorizer but uses a term frequency-inverse document frequency (TF-IDF) approach, which helps in identifying the importance of words in a document relative to the entire corpus. It penalizes words that are too common across documents.
The below images showcase the "before" and "after" of each dataset, once the above mentioned techniques are applied on them.
Dataset 1
Stemming
Dataset 1
Lemmatization
Dataset 1
CountVectorizer and TfidfVectorizor
Dataset 1
WordCloud
Dataset 2
Stemming
Dataset 2
Lemmatization
Dataset 2
CountVectorizer and TfidfVectorizor
Dataset 2
WordCloud