OLID

The Offensive Language Identification Dataset (OLID) contains a collection of 14,200 annotated English tweets using an annotation model that encompasses following three levels:

 

 

OLID was the official dataset used in the OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task. 

OLID has been in students projects in different universities. To the best of our knowledge, so far it has been used by students at The University of Arizona (USA), Imperial College London (UK), and University of Leeds (UK) Some of the student system papers are available here.

 

Download OLID

The complete dataset OLID v1.0 dataset (train, test, and gold labels) is available through this link.

 More information about OLID can be found in the NAACL 2019 paper


 If you used OLID, please refer to this paper:


 @inproceedings{zampierietal2019, 

    title={{Predicting the Type and Target of Offensive Posts in Social Media}}, 

    author={Zampieri, Marcos and Malmasi, Shervin and Nakov, Preslav and Rosenthal, Sara and Farra, Noura and Kumar, Ritesh}, 

    booktitle={Proceedings of NAACL}, 

    year={2019}