Introduction

Problem Definition and Dataset Summary

What is the problem and Why? Misinformation has been widely studied under various themes including conspiracy theories, rumors, hoaxes, fake news and information credibility. Most research studied misinformation in social media. The effect of consuming misinformative content is disastrous and alarming. For example, a study discovered that a significant number of YouTube users believed that Earth is flat when they watched videos that promote the Earth being Flat. Another study examined politically related bias in videos on YouTube showed that videos supporting Trump promoted conspiracies and fake news about Clinton. Prior work also found that content consumed online is usually highly trusted and can change people’s political views and voting behaviors.

The main objective of this project is to build machine learning models that are capable of detecting misinformative YouTube Videos and Amazon items (Fake news, conspiracies, rumors, hoaxes) by classifying such contents based on textual features learnt from users comments on YouTube videos and customers reviews on Amazon items. We classify such content into three different classes: (1) promoting, (2) opposing, or (3) neutral to misinformation. We train classifiers on two datasets. The first is a dataset of 2943 collected in a previous study, we have collected around 3k YouTube videos and annotated them into different classes including those we just mentioned. Among those 3k videos, we selected only those that either promote(pro misinfo), oppose (anti misinfo) or have neutral stance towards misinformation.

Table 1: Number of YouTube videos and Amazon items and number of users comments and customers reviews collected from each platform

Table 2: Sample of comments and reviews from a YouTube video and an Amazon item that both promote misinformation about vaccines.

The videos are categorized under five misinformation/conspiracy topics: 9/11 conspiracies, Chemtraails conspiracy theories, Flat Earth misinformation, Moon Landing conspiracies and misinformation related to Vaccine controversies. We collected all the users comments on each of those videos resulting in around 1.7M comments. The second dataset is collected from a similar study on Amazon, where we collected reviews written by customers on items that are related to Vaccine controversies and each item is classified into one of the three classes similar to the first dataset. That resulted in 1419 unique items that are either promoting (pro misinfo), opposing (anti misinfo) or neutral to misinformation surrounding vaccines. Later we collected all 39.6k reviews written for those items. The number of videos/items and users comments/reviews collected from each platform are summarized in table 1. Samples of users comments and reviews from a YouTube video and an Amazon item are shown in table 2, As shown in the next figures, we see that the number of comments on Anti Videos greatly surpasses the number of comments for other comments types, which encouraged us for diving deeper into the reasons behind this huge difference and whether there are malicious comments and bots that are designed just to increase the popularity of these items that are misleading, such as the conspiracies of chemtrails.

Why classify based on users comments? Users comments on a content (e.g. video, post, tweet ... etc) classified as misinformative have been found to be a potential signal on the credibility level of the content. Also, we reached the same conclusion while we were manually annotating our datasets mentioned in previously, where we noticed that users’ comments and reviews in most cases indicated the stance toward misinformation of the content being annotated.

Description of Amazon Items and Reviews


Description of YouTube Videos and Comments