CIQ Task in FIRE 2019
Background & Motivation for the CIQ Task
Community Question Answering(CQA) has seen a spectacular increase in popularity in the recent past. With the advent and popularity of sites like Yahoo! Answers , Cross Validated, Stack Overflow , Quora , more and more people now use these web forums to get answers to their questions. These forums give people the ability to post their queries online, and have multiple experts across the world answer them, while being able to provide their opinions or expertise to help other users, a quality that encourages more participation and consequently has led to their popularity.
While general purpose CQA forums typically contain Information Seeking Questions (ISQ), many questions are not sincere in their information seeking nature, as in questions intended to troll specific groups/communities, intended to incite hate speech, or involving objectionable content, posting opinions in the disguise of questions. Such questions which are not true requests for information are termed as Non-Information Seeking Questions (NISQ) or Insincere Questions. Content moderators of CQA forums filter such questions of insincere and rhetorical nature.
Given the scale of CQA forums on the web, identifying Insincere Questions becomes challenging by human moderators. Hence automated approaches are needed, in addition to manual moderation to filter insincere questions in CQA sites. While this can be simplistically formulated as a binary classification task (either a question is sincere/insincere), insincere or non-information seeking questions exhibit diverse characteristics and hence there are a number of reasons why a question may be classified as insincere. These include the following categories:
● Rhetorical questions: questions which are non-neutral and convey a opinion or take a stand. A true request for information is not about taking a stance or expressing an opinion. On the other hand, questions which try to phrase a opinion as a question are not requests for information and are typically insincere questions.
● Disparaging/Inflammatory questions: These are questions which are intended to insult/attack certain groups of people. For instance, often sexist and trolling comments are posted as questions which are intended to offend specific minority groups or individuals
● Hypothetical or Unreal questions: These are questions which are not real and meant to be fictitious. Typically they are hypothetical and have unreal context.
● Objectionable content questions: These are questions which uses sexual content for shock value.
Given that insincere questions on CQA forums contain subtle different nuances, the track aims to find a suitbale model for Categorisation of Insincere Questions(CIQ) on CQA forums. Instead of considering this as a simple binary short text classification problem (CQA questions are typically short texts), we propose that segmenting insincere/non-information seeking questions will not only provide better identification, but also help to enable effective counter-measures based on the fine grained category it belongs to. While sexually explicit questions need immediate question takedown/user being suspended, rhetorical questions may only need the poser of the question to be cautioned.
Therefore, a fine grained labels set of CQA questions for identifying insincere questions has been defined as follows:
1. Rhetorical questions
2. Hate speech/inflammatory questions
3. Hypothetical questions
4. Sexually explicit/objectionable content questions
5. Other (which is the catch-all bucket for insincere questions which cannot be classified as any of the first four categories)
6. Sincere/true Information Seeking questions
The task is for differentiating true information seeking questions (ISQ) from non-information seeking questions (NISQ) by fine grained categorisation of CQA questions.
Quora has released a labelled dataset for the binary classification of Sincere/Insincere Questions which is available at: https://www.kaggle.com/c/quora-insincere-questions-classification/data. This is a 2-class labelled dataset. We provide an enhanced subset of Quora questions with fine grained category labels as defined earlier for a small subset of 900 samples. The participants are free to use the complete Quora dataset (which is only annotated for binary sincere/insincere question labels) for the task.
We specifically release a small fine grained category annotated dataset since our interest is to encourage unsupervised/weakly supervised/distant supervised techniques for this task as in real life, annotating fine grained category information would be expensive/non-scalable. Hence we would like to encourage the participants to propose novel techniques which can work well in the presence of only small amount of labelled data (instead of treating this as a standard supervised learning problem).
We will release a training dataset of fine grained annotated questions (subset from Quora dataset with fine grained category labels) and participants are free to use the original quora dataset of insincere questions (which is binary labelled) as well.
We have a test dataset annotated with fine grained category information which will not be released during the training phase. We will post this held out dataset without target labels during the evaluation phase and ask the participants to submit their results on this test dataset. To prevent manual interference in final results, the participants will be asked to share the code for running their models on the final test dataset along with results.