An essential aspect of democracy is the freedom of press - assuming the press tells the truth. However, news companies are run by people and people are inherently biased, and this tends to show through in the content the publication produces. Having opinions is not a bad thing (everyone has them, after all), but when your job is to report objective information to inform people about what is going on in the world, it is important that you remain as objective as possible. In our project, we tried to figure out a way to filter out news articles that take a more subjective approach, in an attempt to only be shown content that is objectively true.
After a very dedicated Google search, we found a reviewed database on kaggle.com by Andrew Thompson that featured 143,000 articles from 15 popular news publications with varied reputations and political-leanings: the New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News, Talking Points Memo, Buzzfeed News, National Review, New York Post, the Guardian, NPR, Reuters, Vox, and the Washington Post. The articles reportedly range mostly from the beginning of 2016 to July 2017, but there may be some published in 2015 or earlier.
The articles are not necessarily evenly distributed (there are considerably more articles from Breitbart than any other publication, and there are few from Fox News and Vox), but this is mostly because some sites are more prolific than others.
The articles were divided into three different text documents (csv.), each with about 50 000 articles from a handful of publications and a wide range of timestamps.
Thank you to Andrew Thompson on kaggle.com for taking the time to compile this data set off RSS feeds. We really appreciate it. We've linked his page here.
To determine which articles are biased, we used TextBlob's "Subjectivity" tool. TextBlob is an already-existing module available for python coders and is capable of analyzing text based on many different attributes, such as polarity.
TextBlob gives each sentence a Subjectivity rating on a scale of 0 to 1. The closer to one the value is, the more subjective the sentence is.
TextBlob can only analyze once sentence at a time, so we divided each article into their own sentences (called "strings"), and then compiled those to get an overall subjectivity rating of each article. To avoid longer articles unnecessarily having a higher subjectivity rating, we divided the overall subjectivity rating by the number of strings in the article. This gives us our BiasRating. A higher number (again between 0 and 1) means the article is more biased.
The average BiasRating was x. We filtered out any articles that had a BiasRating higher than y, in the hopes that this would give us the least biased articles possible.
First, we imported the necessary modules we need for the code to process everything we're doing. "csv" lets the code read the text file(s) and "TextBlob" analyzes the subjectivity of each string.
The "counter" section keeps track of how many articles are printed.
Lines 14-16 split the sentences of the article into strings.
Lines 17-20 give the subjectivity rating of each sentence and adds all the sentences' subjectivity rating together (sum). At the same time, it is counting how many sentences there are per article (sentence_counter).
Line 21 divides the sum by the number of sentences to get the biasrating.
Lines 22-26 set the parameters so that only articles with a low biasrating are printed. It then counts these articles. Once the counter reaches 5, it stops printing "objective" articles. If the user runs the code again, they will receive another 5 objective articles.
When defining bias, it is difficult to not be biased. This is why we used TextBlob's subjectivity rating instead of our own, but it is definitely possible that the way they define bias is also... biased.
We also neglected to read through all 143 000 articles to determine if the algorithm was 100% accurate or not. From the few that we have read, it seems to work fine, but there is definitely room for refinement.
After filtering the articles, they are still a little bit difficult to read. If given more than three days to work, we could probably figure out a way to refine this.
Our data is also outdated. The most recent article is from July 2017 and we are yet to experiment with currently updating articles. In the future, we would like to apply our system to current articles, but we haven't found a dataset that allows us to do this.