Fake News Detection

Fake news provides information that aims to manipulate people for different purposes: terrorism, political elections, advertisement, satire, among others. In social networks, misinformation extends in seconds among thousands of people, so it is necessary to develop tools that help control the amount of false information on the web. Similar tasks are detection of popularity in social networks and detection of subjectivity of messages in this media. A fake news detection system aims to help users detect and filter out potentially deceptive news. The prediction of intentionally misleading news is based on the analysis of truthful and fraudulent previously reviewed news, i.e., annotated corpora.

The Spanish Fake News Corpus is a collection of news compiled from several web sources: established newspapers websites,media companies websites, special websites dedicated to validating fake news, websites designated by different journalists as sites that regularly publish fake news. The news were collected from January to July of 2018 and all of them were written in Mexican Spanish.

The corpus has 971 news collected from January to July, 2018, from different sources:

  • Established newspapers websites,
  • Media companies websites,
  • Special websites dedicated to validating fake news,
  • Websites designated by different journalists as sites that regularly publish fake news.

The corpus was tagged considering only two classes (true or fake), following a manual labeling process:

  • A news is true if there is evidence that it has been published in reliable sites.
  • A news is fake if there is news from reliable sites or specialized website in detection of deceptive content that contradicts it or no other evidence was found about the news besides the source.
  • We collected the true-fake news pair of an event so there is a correlation of news in the corpus.

In order to avoid topic bias, the corpus covers news from 9 different topics: Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, and Society. As it can be seen in the table below, the number of fake and true news is quite balanced. Approximately 70% will be used as training corpus (676 news), and the 30% as testing corpus (295 news).

The training corpus contains the following information:

Category: Fake/ True

Topic: Science/ Sport/ Economy/ Education/ Entertainment/ Politics, Health/ Security/ Society

Headline: The title of the news.

Text: The complete text of the news.

Link: The URL where the news was published.