Datasets

LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

Huggingface Hub links

- LDKP3K - https://huggingface.co/datasets/midas/ldkp3k
- LDKP10K - https://huggingface.co/datasets/midas/ldkp10k

Learning Rich Representation of Keyphrases from Text (Findings of NAACL-HLT 2022 )

KBIR and KeyBART models: https://zenodo.org/record/5784384#.Yb4CjrtOlH5
Huggingface Hub links
- KBIR - https://huggingface.co/bloomberg/KBIR
- KeyBART - https://huggingface.co/bloomberg/KeyBART

Gupshup (EMNLP 2021)

GupShup: An Annotated Corpus for Abstractive Summarization of Open-Domain Code-Switched Conversations is a Hindi-English (Hinglish) code-switched conversational dataset consisting of 6,831 conversations and their corresponding summaries in English and Hinglish. The paper for the same can be accessed through this link: https://aclanthology.org/2021.emnlp-main.499/. Contact Laiba Mehnaz (lm4428@nyu.edu) for the dataset.

Citation Worthiness of Sentences in Scholarly Articles (NAACL-HLT 2021)

This dataset contains over 2.7 million sentences extracted from scholarly articles (from ACL Anthology [Bird et al.]) and their corresponding citation worthiness labels. The goal of the citation worthiness task is to determine whether a given sentence requires a citation.

Transport Complaint Data (AAAI 2020)

A collection of 3,700 tweets related to complaints in the domain of transport annotated with two classes - complaints and non-complaints

MeTooMA (ICWSM 2020)

MeTooMA is a dataset containing 9,973 tweets related to the MeToo movement that were manually annotated for five different linguistic aspects:

relevance (relevant, not relevant)
stance (support, opposition)
hate speech (directed hate, generalized hate)
sarcasm (sarcastic, not sarcastic)
dialogue acts (allegation, refutation, justification)

Keyphrase Extraction using Contextual Embeddings (ECIR 2020)

Dataset for the paper entitled, Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings. For more details on how the dataset was created, and the models trained on it, please refer to our paper.

BHAAV

BHAAV is the first and largest Hindi text corpus for analyzing emotions that a writer expresses through his/her characters in a story, as perceived by a narrator/reader. The corpus consists of 20,304 sentences collected from 230 different short stories spanning across 18 genres such as प्रेरणादायक (Inspirational) and रहस्यमयी (Mystery). Each sentence has been annotated into one of the five emotion categories anger, joy, suspense, sad, and neutral.

Hindi Discourse (LREC 2020)

The Hindi Discourse Analysis dataset is a corpus for analyzing discourse modes present in its sentences. It contains sentences from stories written by 11 famous authors from the 20th Century. 4-5 stories by each author have been selected which were available in the public domain resulting in a collection of 53 stories. Most of these short stories were originally written in Hindi but some of them were written in other Indian languages and later translated to Hindi.

The corpus contains a total of 10,472 sentences belonging to the following categories:

Argumentative
Descriptive
Dialogic
Informative
Narrative

Hindi NLI Data (AACL-IJCNLP 2020)

hindi-nli-data is the first recasted dataset for natural language inference in Hindi. Evaluating the learning capabilities of deep learning models in the field of Natural Language Processing has always been challenging. The task of Natural Language Inference (NLI) have been the touchstone in measuring their performance. However, there is complete absence of labeled NLI datasets in a low-resource language like Hindi. To address this, we performed automated recasting of three existing text classification datasets related to affective content analysis in Hindi language to Natural Language Inference datasets. This resulted in three NLI datasets with 43K, 17K, and 203K premise hypothesis pairs.

Report abuse