Srimedha - Data Gathering

The dataset for this research was compiled from two separate sources. The data from the first source, Kaggle, will be used to train the model, while data from the second source, NewsAPI, will be used to test it. The training dataset is divided into two parts: true news and fake news. Combining these two files yields a complete dataset. The testing dataset provided via API is in JSON format, and it will be cleaned by deleting extraneous columns after converting it into csv format. The link to the original Kaggle dataset can be found here and here.

The data originally extracted from API looked like this:

The data after formatting and first round of cleaning looks like this:

The Kaggle data set came in well formatted. It still needed some cleaning and preparation done to it before performing EDA and data modeling. The initial Kaggle data looked like this:

Data cleaning plays a pivotal role in ensuring the reliability and accuracy of datasets. By systematically identifying and rectifying inconsistencies, errors, and missing values, it enhances the quality and integrity of data, thereby fostering more reliable analyses and decision-making processes. Effective data cleaning procedures eliminate noise and biases that could otherwise distort analytical results, enabling organizations to derive actionable insights and make informed decisions. Moreover, clean data promotes transparency and trust among stakeholders, as it ensures that information is accurate, consistent, and free from errors, ultimately enhancing the overall effectiveness and reliability of data-driven operations and strategies.

After Cleaning, it looked like this: (the details about what exactly is done is mentioned in the data cleaning section)

Data Cleaning