Task & Data

Task description

The objective of this task is to decide if a news item is fake or real by analyzing its textual representation.

The task considers a closed-class detection, i.e., news only can be either fake or real.

Data description

The corpus consists of news compiled mainly from Mexican web sources: established newspaper websites, media companies websites, special websites dedicated to validating fake news, and websites designated by different journalists as sites that regularly publish fake news. The Spanish Fake News Corpus covers the following 9 topics: Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, and Society. In this new edition, we will make available the complete Spanish Fake News Corpus for training the systems, i.e. both training and testing partitions of the corpus.

The Spanish Fake News Corpus is a collection of news compiled from several web sources: established newspaper websites, media companies websites, special websites dedicated to validating fake news, websites designated by different journalists as sites that regularly publish fake news. The news articles were collected from January to July of 2018 and all of them were written in Mexican Spanish.

The corpus has 971 news collected from January to July 2018, from different sources:

Established newspapers websites,
Media companies websites,
Special websites dedicated to validating fake news,
Websites designated by different journalists as sites that regularly publish fake news.

The corpus was tagged considering only two classes (true or fake), following a manual labeling process:

A news article is true if there is evidence that it has been published on reliable sites.
A news article is fake if there is news from reliable sites or specialized websites in the detection of deceptive content that contradicts it or no other evidence was found about the news besides the source.

We collected the true-fake news pair of an event so there is a correlation of news in the corpus.

In order to avoid topic bias, the corpus covers news from 9 different topics: Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, and Society. As can be seen in the table below, the number of fake and true news is quite balanced.

The corpus contains the following information:

Category: Fake/ True
Topic: Science/ Sport/ Economy/ Education/ Entertainment/ Politics, Health/ Security/ Society
Source: The name of the source media.
Headline: The title of the news.
Text: The complete text of the news.
Link: The URL where the news was published.

For the evaluation of systems, we will use a new testing corpus containing news related to COVID-19 and news from other Ibero-American countries. Its availability will introduce two main challenges to the task: thematic and language variation. The participating systems need to take into consideration that part of the testing corpus contains news in a topic that does not exist in the training corpus, likewise, participants should take into account that the other part of the testing corpus contains news in a different variation of the Spanish that is in the training corpus. The test data only includes Id, Headline and Text columns.

Page updated

Google Sites

Report abuse