Datasets

We strive to design and create Arabic datasets for practical tasks and make them publicly available for the community to advance research on Arabic IR and Arabic NLP.

Authority Finding in Twitter Dataset

This dataset is offered as a shared task (Task 5: Authority Finding in Twitter) at CheckThat! 2023 Lab. The task is defined as follows: Given a tweet stating a rumor, a model has to retrieve a ranked list of authority Twitter accounts that can help verify the rumor, i.e., they may tweet evidence that supports or denies the rumor. This dataset is offered in Arabic. The collection comprises 150 rumors (expressed in tweets) associated with a total of 1,044 authority accounts and a user collection of 395,231 Twitter accounts (members of 1,192,284 unique Twitter lists).

Download

Related Publications

AuSTR: The First Authority STance towards Rumors Dataset

AuSTR is the first Authority STance towards Rumors (AuSTR) dataset, where evidence is retrieved from authority timelines in Arabic Twitter. AuSTR contains 409 pairs covering 171 unique claims, where 41 are true and 130 are false. Among those pairs, 118 are disagree (29%), 62 are agree (15%), and 229 are unrelated (56%). 

Download

Related Publication

IDRISI: Large-scale Twitter Location Mention Prediction Dataset

IDRISI is the largest-scale publicly-available Twitter Location Mention Prediction (LMP) dataset, in both English and Arabic languages. Named after Muhammad Al-Idrisi, who is one of the pioneers and founders of the advanced geography.

Download

Related Publications

To be listed soon.

ArPFN: Arabic User Credibility Dataset

ArPFN is first Arabic users dataset which was developed for the task of identifying users who are prone to spread fake news in Arabic Twitter by leveraging two Arabic misinformation datasets, ArCOV19-Rumors and AraFacts. ArPFN consists of 1,546 users, of which 541 are prone to spread fake news.

Download

Related Publication

QRCD: Qur'anic Reading Comprehension Dataset

QRCD is composed of 1,093 tuples of question-passage pairs that are coupled with their extracted answers to constitute 1,337 question-passage-answer triplets. A question might have more than one answer in the passage; therefore, a typical reading comprehension system is expected to extract all of them and return a ranked list of answer spans. 

Download

Related Publications

AyaTEC: Reusable Verse-Based Test Collection for Arabic Question Answering on the Holy Qur’an

AyaTEC is a reusable test collection for verse-based question answering on the Holy Qur’an, which serves as a common experimental testbed for this task. AyaTEC includes 207 questions (with their corresponding 1,762 answers) covering 11 topic categories of the Holy Qur’an that target the information needs of both curious and skeptical users. The answers to the questions (each represented as a sequence of verses) in AyaTEC were exhaustive—that is, all qur’anic verses that directly answered the questions were exhaustively extracted and annotated.

Download

Related Publication

ArCov19-Rumors: Arabic COVID-19 Twitter Dataset for Misinformation Detection

ArCOV19-Rumors is an Arabic COVID-19 Twitter dataset for misinformation detection composed of tweets containing claims from 27th January till the end of April 2020. We collected 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims. We then manually-annotated the tweets by veracity to support research on misinformation detection, which is one of the major problems faced during a pandemic. We aim to support two classes of misinformation detection problems over Twitter: verifying free-text claims (called claim-level verification) and verifying claims expressed in tweets (called tweet-level verification). Our dataset covers, in addition to health, claims related to other topical categories that were influenced by COVID-19, namely, social, politics, sports, entertainment, and religious. 

Download

Related Publication

ArCov-19:  Arabic COVID-19 Twitter Dataset

ArCOV-19 is an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 31st of March 2020 (and still ongoing). It is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes around 748k popular tweets (according to Twitter search criterion) alongside the propagation networks of the most-popular subset of them. The propagation networks include both retweets and conversational threads (i.e., threads of replies). ArCOV-19 is designed to enable research under several domains including natural language processing, data science, and social computing, among others.

Download

Related Publication


CheckThat! 2021 Fact Checking Arabic Datasets (Tasks 1,2)

Our members, Maram Hasanain, Fatima Haouari, Watheq Mansour, Zien Sheikh Ali, and Dr. Tamer Elsayed, built the Arabic datasets for Tasks 1 and 2 at CheckThat! 2021 lab. The tasks are defined as follows:

Download

Related Publications

ArTest: The First Test Collection for Arabic Web Search with Relevance Rationales

ArTest is the first large-scale test collection designed for the evaluation of ad-hoc search over the Arabic Web. ArTest uses ArabicWeb16, a collection of around 150M Arabic Web pages as the document collection, and includes 50 topics, 10,529 relevance judgments, and (more importantly) a rationale behind each judgment. 

Download

Related Publication

Background Relevance Dataset: Annotations and Analysis for Background Linking

We built this dataset by annotating a subset of the query articles and their corresponding judged articles provided by TREC 2018 news track dataset. We annotated 227 articles, 25 query articles and 202 judged articles (an average of 8 per query) distributed as follows: 51 judged articles of relevance 4, 35 of relevance 3, 33 of relevance 2, 33 of relevance 1, and 50 of 0 relevance. 

Download

Related Publication

CheckThat! 2020 Arabic Datasets (Tasks 1,2,3)

Our members, Maram Hasanain, Fatima Haouari, Reem Suwaileh, Zien Sheikh Ali, and Dr. Tamer Elsayed, built the Arabic datasets for Tasks 1, 2, and 3 at CheckThat! 2020 lab. Tasks are defined as follows:

Related Publications

CheckThat! 2019 Arabic Dataset (Task 2)

Our members, Maram Hasanain, Reem Suwaileh, and Dr. Tamer Elsayed, built Task 2 dataset at CheckThat! 2019 lab. 

Task Definition

Given a claim associated with a set of Web pages P (that constitute the results of Web search in response to using the claim as a search query), identify which of the Web pages (and passages of those Web pages) can be useful in assisting a human who is fact-checking the claim.

More details about the task can be found here.  

Download

You can download data from here.

Related Publication

CheckThat! 2018 Arabic Datasets (Tasks 1,2)

Our members, Reem Suwaileh and Dr. Tamer Elsayed, build Task 1 and 2 datasets at CheckThat! 2018 lab. 

Download

Related Publications

Web Search for Fact Checking Dataset

Download

Related Publication

WebCrowd25k

WebCrowd25k dataset includes three related parts:

Download

Related Publications

EveTAR: The first Arabic Test Collection for multiple Information Retrieval Tasks in Twitter

The first Arabic Test Collection for multiple information retrieval tasks in Twitter. It supports Event detection, Ad-hoc search, Timeline generation, and Real-time summarization. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with a substantial average inter-annotator agreement (Kappa value of 0.71).

Related publications

Download

ArabicWeb16:  Largest Public Arabic Web Crawl

A public Web crawl of 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA). We expect ArabicWeb16 to support various research areas such as ad-hoc search, question answering, filtering, cross-dialect search, dialect detection, entity search, blog search, and spam detection among others.

Download

Related Publication

DART: Dialectal Arabic Tweets Dataset

Dialectal Arabic Tweets (DART) Dataset is a new large manually-annotated multi-dialect dataset of Arabic tweets. The Dialectal ARabic Tweets (DART) dataset has about 25K tweets that are annotated via crowdsourcing, and it is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi. 

Download

Related Publication

AutoTweet: Dataset for Detecting Automatically-Generated Arabic Tweets

We provide two datasets to study automation behavior in Arabic tweets. The 2 datasets are released in tab-separated text files. We describe the content of each as follows:

Download

Related Publications

Journalists Questions on Twitter

We provide 2 datasets to support question identification and question-type classification in Arabic tweets of journalists. The 2 datasets are released in tab-separated text files. We describe the content of each as follows:

Download

Related Publication


Answerable Question Identification in Arabic Tweets

Download

Related Publication

Question Identification in Arabic Tweets

Download

Related Publication