SemEval 2022 Task 4:

Patronizing and Condescending Language Detection

What is Patronizing and Condescending Language (PCL)?

We all are patronizing and condescending sometimes. And of course, we all are susceptible to be condescended and patronized by others. But some groups are, unfortunately, more used to be referred to with this undervaluing treatment. The so-called vulnerable communities seem to be the perfect target for charity and pity-driven texts, condescension and patronization in news stories.


PCL is often involuntary and unconscious and the authors using such language are usually trying to help the communities in need, by raising awareness, moving the audience to action or standing for the rights of the under-represented. But PCL can potentially be very harmful, as it feeds tereotypes, routinizes discrimination and drives to greater exclusion.

THE TASK

The Patronizing and Condescending Language Detection Task is based on the paper Don't Patronize Me! An annotated Dataset with Patronizing and Condescending Language Towards Vulnerable Communities (Perez-Almendros et al., 2020).

The aim of this task is to identify PCL, and to categorize the linguistic techniques used to express it, specifically when referring to communities identified as being vulnerable to unfair treatment in the media.

Participants are provided with sentences in context (paragraphs), extracted from news articles, in which one or several predefined vulnerable communities are mentioned. The challenge is divided into two subtasks.


  1. Subtask 1: Binary classification. Given a paragraph, a system must predict whether or not it contains any form of PCL.

  2. Subtask 2: Multi-label classification. Given a paragraph, a system must identify which PCL categories express the condescension. Our PCL taxonomy has been defined based on previous works on PCL. We consider the following categories:

  • Unbalanced power relations.

  • Shallow solution.

  • Presupposition.

  • Authority voice.

  • Metaphor.

  • Compassion.

  • The poorer, the merrier.


Find out more about the task on our CodaLab page!

THE DATA

The seed data for this task is the Don't Patronize Me! dataset, a collection of paragraphs mentioning vulnerable communities and published in media in 20 English speaking countries. The paragraphs are manually annotated to show 1) whether the text contains any kind of PCL, and 2) if it contains PCL, what linguistic techniques (categories) are used to express the condescension.

For this task, we divide the data on train and test data, which will be made available as follows:


TRAIN DATA - Already available! Check it out here

The main data for this challenge consists of 10,636 paragraphs for the binary classification subtask (dontpatronizeme_pcl.tsv), and 2,792 instances for the categories classification subtask (dontpatronizeme_categories.tsv ).

dontpatronizeme_pcl.tsv --> Contains paragraphs annotated with a label from 0 (not containing PCL) to 4 (being highly patronizing or condescending) towards vulnerable communities.

The format of each line is:

paragraph_id keyword country_code paragraph label

dontpatronizeme_categories.tsv --> Contains the paragraphs annotated as containing PCL in the previous subdataset (labels 2, 3 or 4) with annotations on the strategies (categories) to express the condescension and the exact text span where the PCL occurs.

The format of each line is:

paragraph_id paragraph keyword country_code span_start span_end span_text category_label number_of_annotators_agreeing_in_that_label


TEST DATA - Full data will be made available after end of competition.

We will share the test data (without labels) for the evaluation phase. After SemEval 2022 ends, you will be able to access the full data (with labels) upon request. It contains around 4,000 manually annotated paragraphs with PCL annotation (binary and categories labels).

MORE ABOUT PCL :