The Semi-Supervised Offensive Language Identification Dataset (SOLID) contains over 9,000,000 tweets annotated following OLID's three-level taxonomy:


A: Offensive Language Detection

B: Categorization of Offensive Language

C: Offensive Language Target Identification


SOLID was the official English dataset used in the OffensEval 2020 shared task. 

SOLID can be downloaded here. More information about SOLID can be found in this paper

If you used SOLID, please refer to this paper:



  title={A Large-Scale Semi-Supervised Dataset for Offensive Language Identification},

  author={Rosenthal, Sara and Atanasova, Pepa and Karadzhov, Georgi and Zampieri, Marcos and Nakov, Preslav},

  journal={arXiv preprint arXiv:2004.14454},