Datasets
We strive to design and create Arabic datasets for practical tasks and make them publicly available for the community to advance research on Arabic IR and Arabic NLP.
Authority Finding in Twitter Dataset
This dataset is offered as a shared task (Task 5: Authority Finding in Twitter) at CheckThat! 2023 Lab. The task is defined as follows: Given a tweet stating a rumor, a model has to retrieve a ranked list of authority Twitter accounts that can help verify the rumor, i.e., they may tweet evidence that supports or denies the rumor. This dataset is offered in Arabic. The collection comprises 150 rumors (expressed in tweets) associated with a total of 1,044 authority accounts and a user collection of 395,231 Twitter accounts (members of 1,192,284 unique Twitter lists).
Download
Download the dataset from here.
Related Publications
Alberto Barrón-Cedeño, Firoj Alam, Tommaso Caselli, Giovanni Da San Martino, Tamer Elsayed, Andrea Galassi, Fatima Haouari, Federico Ruggeri, Julia Maria Struß, Rabindra Nath Nandi, Gullal S. Cheema, Dilshod Azizov, and Preslav Nakov. The CLEF-2023 CheckThat! Lab: Checkworthiness, Subjectivity, Political Bias, Factuality, and Authority of News Articles and Their Sources. ECIR 2023.
Fatima Haouari and Tamer Elsayed: Detecting Stance of Authorities towards Rumors in Arabic Tweets: A Preliminary Study. ECIR 2023.
AuSTR: The First Authority STance towards Rumors Dataset
AuSTR is the first Authority STance towards Rumors (AuSTR) dataset, where evidence is retrieved from authority timelines in Arabic Twitter. AuSTR contains 409 pairs covering 171 unique claims, where 41 are true and 130 are false. Among those pairs, 118 are disagree (29%), 62 are agree (15%), and 229 are unrelated (56%).
Download
Download the dataset from here.
Related Publication
Fatima Haouari and Tamer Elsayed: Detecting Stance of Authorities towards Rumors in Arabic Tweets: A Preliminary Study. ECIR 2023
IDRISI: Large-scale Twitter Location Mention Prediction Dataset
IDRISI is the largest-scale publicly-available Twitter Location Mention Prediction (LMP) dataset, in both English and Arabic languages. Named after Muhammad Al-Idrisi, who is one of the pioneers and founders of the advanced geography.
Download
Download the dataset from here.
Related Publications
To be listed soon.
ArPFN: Arabic User Credibility Dataset
ArPFN is first Arabic users dataset which was developed for the task of identifying users who are prone to spread fake news in Arabic Twitter by leveraging two Arabic misinformation datasets, ArCOV19-Rumors and AraFacts. ArPFN consists of 1,546 users, of which 541 are prone to spread fake news.
Download
Download the dataset from here.
Related Publication
Zien Sheikh Ali, Abdulaziz Al-Ali, and Tamer Elsayed: Detecting Users Prone to Spread Fake News on Arabic Twitter. OSACT 2022
QRCD: Qur'anic Reading Comprehension Dataset
QRCD is composed of 1,093 tuples of question-passage pairs that are coupled with their extracted answers to constitute 1,337 question-passage-answer triplets. A question might have more than one answer in the passage; therefore, a typical reading comprehension system is expected to extract all of them and return a ranked list of answer spans.
Download
Related Publications
Rana Malhas and Tamer Elsayed. Arabic Machine Reading Comprehension on the Holy Qur’an using CL-AraBERT. Information Processing & Management, 59(6), p.103068, 2022.
Rana Malhas and Tamer Elsayed: AyaTEC: Building a Reusable Verse-Based Test Collection for Arabic Question Answering on the Holy Qur’an. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19(6), pp.1-21, 2020.
AyaTEC: Reusable Verse-Based Test Collection for Arabic Question Answering on the Holy Qur’an
AyaTEC is a reusable test collection for verse-based question answering on the Holy Qur’an, which serves as a common experimental testbed for this task. AyaTEC includes 207 questions (with their corresponding 1,762 answers) covering 11 topic categories of the Holy Qur’an that target the information needs of both curious and skeptical users. The answers to the questions (each represented as a sequence of verses) in AyaTEC were exhaustive—that is, all qur’anic verses that directly answered the questions were exhaustively extracted and annotated.
Download
Related Publication
Rana Malhas and Tamer Elsayed: AyaTEC: Building a Reusable Verse-Based Test Collection for Arabic Question Answering on the Holy Qur’an. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19(6), pp.1-21, 2020.
ArCov19-Rumors: Arabic COVID-19 Twitter Dataset for Misinformation Detection
ArCOV19-Rumors is an Arabic COVID-19 Twitter dataset for misinformation detection composed of tweets containing claims from 27th January till the end of April 2020. We collected 138 verified claims, mostly from popular fact-checking websites, and identified 9.4K relevant tweets to those claims. We then manually-annotated the tweets by veracity to support research on misinformation detection, which is one of the major problems faced during a pandemic. We aim to support two classes of misinformation detection problems over Twitter: verifying free-text claims (called claim-level verification) and verifying claims expressed in tweets (called tweet-level verification). Our dataset covers, in addition to health, claims related to other topical categories that were influenced by COVID-19, namely, social, politics, sports, entertainment, and religious.
Download
Download the dataset from here.
Related Publication
Fatima Haouari , Maram Hasanain , Reem Suwaileh and Tamer Elsayed: ArCOV19-Rumors: Arabic COVID-19 Twitter Dataset for Misinformation Detection . WANLP 2021.
ArCov-19: Arabic COVID-19 Twitter Dataset
ArCOV-19 is an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 31st of March 2020 (and still ongoing). It is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes around 748k popular tweets (according to Twitter search criterion) alongside the propagation networks of the most-popular subset of them. The propagation networks include both retweets and conversational threads (i.e., threads of replies). ArCOV-19 is designed to enable research under several domains including natural language processing, data science, and social computing, among others.
Download
Download the dataset from here.
Related Publication
Fatima Haouari , Maram Hasanain , Reem Suwaileh and Tamer Elsayed: ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks. WANLP 2021.
CheckThat! 2021 Fact Checking Arabic Datasets (Tasks 1,2)
Our members, Maram Hasanain, Fatima Haouari, Watheq Mansour, Zien Sheikh Ali, and Dr. Tamer Elsayed, built the Arabic datasets for Tasks 1 and 2 at CheckThat! 2021 lab. The tasks are defined as follows:
Task 1 - Check-Worthiness Estimation : Given a claim, detect whether it is worth fact-checking.
Task 2 - Verified Claim Retrieval: Given a check-worthy claim, and a set of previously fact-checked claims, determine whether the claim has been previously fact-checked.
Download
Related Publications
Shaden Shaar, Maram Hasanain, Bayan Hamdan, Zien Sheikh Ali, Fatima Haouari, Alex Nikolov, Mucahid Kutlu, Yavuz Selim Kartal, Firoj Alam, Giovanni Da San Martino, Alberto Barrón-Cedeño, Ruben Miguez, Javier Beltrán, Tamer Elsayed, Preslav Nakov: Overview of the CLEF-2021 CheckThat! Lab Task 1 on Check-Worthiness Estimation in Tweets and Political Debates. Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum
Shaden Shaar, Fatima Haouari, Watheq Mansour, Maram Hasanain, Nikolay Babulkov, Firoj Alam, Giovanni Da San Martino, Tamer Elsayed, Preslav Nakov: Overview of the CLEF-2021 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims in Tweets and Political Debates. Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum
Preslav Nakov, Giovanni Da San Martino, Tamer Elsayed, Alberto Barrón-Cedeño, Rubén Míguez, Shaden Shaar, Firoj Alam, Fatima Haouari, Maram Hasanain, Watheq Mansour, Bayan Hamdan, Zien Sheikh Ali, Nikolay Babulkov, Alex Nikolov, Gautam Kishore Shahi, Julia Maria Struß, Thomas Mandl, Mucahid Kutlu, and Yavuz Selim Kartal: Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. CLEF 2021
Preslav Nakov, Giovanni Da San Martino, Tamer Elsayed, Alberto Barrón-Cedeño, Rubén Míguez, Shaden Shaar, Firoj Alam, Fatima Haouari, Maram Hasanain, Nikolay Babulkov, Alex Nikolov, Gautam Kishore Shahi, Julia Maria Struß, and Thomas Mandl: The CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. ECIR 2021
ArTest: The First Test Collection for Arabic Web Search with Relevance Rationales
ArTest is the first large-scale test collection designed for the evaluation of ad-hoc search over the Arabic Web. ArTest uses ArabicWeb16, a collection of around 150M Arabic Web pages as the document collection, and includes 50 topics, 10,529 relevance judgments, and (more importantly) a rationale behind each judgment.
Download
Download the dataset from here.
Related Publication
Maram Hasanain, Yassmine Barkallah, Reem Suwaileh, Mucahid Kutlu, and Tamer Elsayed: ArTest: The First Test Collection for Arabic Web Search with Relevance Rationales. SIGIR 2020.
Background Relevance Dataset: Annotations and Analysis for Background Linking
We built this dataset by annotating a subset of the query articles and their corresponding judged articles provided by TREC 2018 news track dataset. We annotated 227 articles, 25 query articles and 202 judged articles (an average of 8 per query) distributed as follows: 51 judged articles of relevance 4, 35 of relevance 3, 33 of relevance 2, 33 of relevance 1, and 50 of 0 relevance.
Download
Download the dataset from here.
Related Publication
Marwa Essam and Tamer Elsayed: Why is That a Background Article: A Qualitative Analysis of Relevance for News Background Linking. CIKM 2020
CheckThat! 2020 Arabic Datasets (Tasks 1,2,3)
Our members, Maram Hasanain, Fatima Haouari, Reem Suwaileh, Zien Sheikh Ali, and Dr. Tamer Elsayed, built the Arabic datasets for Tasks 1, 2, and 3 at CheckThat! 2020 lab. Tasks are defined as follows:
Task 1 - Check-Worthiness on tweets: Predict which tweet from a stream of tweets on a topic should be prioritized for fact-checking.
Task 2 - Verified claim retrieval: Given a check-worthy tweet claim, and a set of previously-checked claims, determine whether the claim has been already fact-checked.
Task 3 - Evidence retrieval: Given a check-worthy claim on a specific topic and a set of text snippets extracted from potentially-relevant webpages, return a ranked list of evidence snippets for the claim.
Related Publications
Maram Hasanain, Fatima Haouari, Reem Suwaileh, Zien Sheikh Ali, Bayan Hamdan, Tamer Elsayed, Alberto Barrón-Cedeño, Giovanni Da San Martino, Preslav Nakov: Overview of CheckThat! 2020 Arabic: Automatic Identification and Verification of Claims in Social Media. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum
Shaden Shaar, Alex Nikolov, Nikolay Babulkov, Firoj Alam, Alberto Barrón-Cedeño, Tamer Elsayed, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Giovanni Da San Martino, Preslav Nakov: Overview of CheckThat! 2020 English: Automatic Identification and Verification of Claims in Social Media. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum
Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, and Zien Sheikh Ali: Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media. CLEF 2020
CheckThat! 2019 Arabic Dataset (Task 2)
Our members, Maram Hasanain, Reem Suwaileh, and Dr. Tamer Elsayed, built Task 2 dataset at CheckThat! 2019 lab.
Task Definition
Given a claim associated with a set of Web pages P (that constitute the results of Web search in response to using the claim as a search query), identify which of the Web pages (and passages of those Web pages) can be useful in assisting a human who is fact-checking the claim.
More details about the task can be found here.
Download
You can download data from here.
Related Publication
Maram Hasanain, Reem Suwaileh, Tamer Elsayed, Alberto Barrón-Cedeño, Preslav Nakov: Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality. Working Notes of CLEF 2019- Conference and Labs of the Evaluation Forum
Tamer Elsayed, Preslav Nakov, Alberto Barrón-Cedeño, Maram Hasanain, Reem Suwaileh, Giovanni Da San Martino, and Pepa Atanasova: Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. CLEF 2019
CheckThat! 2018 Arabic Datasets (Tasks 1,2)
Our members, Reem Suwaileh and Dr. Tamer Elsayed, build Task 1 and 2 datasets at CheckThat! 2018 lab.
Task 1 Definition: Given a transcription of a political debate/speech, predict which claims should be prioritized for fact-checking.
Task 2 Definition: Given a check-worthy claim in the form of a (transcribed) sentence, determine whether the claim is likely to be true, half-true, or false.
Download
You can download data from here.
Related Publications
Preslav Nakov, Alberto Barrón-Cedeño, Tamer Elsayed, Reem Suwaileh, Lluís Màrquez, Wajdi Zaghouani, Pepa Atanasova, Spas Kyuchukov, and Giovanni Da San Martino: Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. CLEF 2018
Pepa Atanasova, Lluis Marquez, Alberto Barron-Cedeno, Tamer Elsayed, Reem Suwaileh, Wajdi Zaghouani, Spas Kyuchukov, Giovanni Da San Martino, Preslav Nakov: Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. Task 1: Check-Worthiness. Working Notes of CLEF 2018- Conference and Labs of the Evaluation Forum
Alberto Barron-Cedeno, Tamer Elsayed, Reem Suwaileh, Lluis Marquez, Pepa Atanasova, Wajdi Zaghouani, Spas Kyuchukov, Giovanni Da San Martino, Preslav Nakov: Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. Task 2: Factuality .Working Notes of CLEF 2018- Conference and Labs of the Evaluation Forum
Web Search for Fact Checking Dataset
Fact Checking Dataset: supports the problem of re-ranking Web search results for better fact-checking. The dataset is a test collection that comprises 22 claims and 20 Web search results for each claim collected from a commercial search engine.
Download
Download the dataset from here.
Related Publication
Khaled Yasser, Mucahid Kutlu, and Tamer Elsayed. Re-ranking Web Search Results for Better Fact-Checking: A Preliminary Study. Proceedings of The 27th ACM International Conference on Information and Knowledge Management (CIKM ’18), October 22–26, 2018, Torino, Italy.
WebCrowd25k
WebCrowd25k dataset includes three related parts:
Crowd Relevance Judgments. 25,099 information retrieval relevance judgments collected on Amazon’s Mechanical Turk platform. For each of the 50 search topics from the 2014 NIST TREC WebTrack, we selected 100 ClueWeb12 documents to be re-judged (without reference to the original TREC assessor judgment) by 5 MTurk workers each (50 topics x 100 documents x 5 workers = 25K crowd judgments). Individual worker IDs from the platform are hashed to new identifiers. We collect relevance judgments on a 4-point graded scale. (See SIGIR’18 & HCOMP’18 papers).
Behavioral Data. For a subset of the judgments, we also collected behavioral data charactering worker behavior in performing the relevance judging. Behavioral data was recorded using MmmTurkey, which captures a variety of worker interaction behaviors while completing MTurk Human Intelligence Tasks. (See HCOMP’18 paper)
Disagreement Analysis. We inspected 1000 crowd judgments for 200 documents (5 judgments per document, where the aggregated crowd judgment differs from the original TREC assessor judgment), and we classified each disagreement according to our disagreement taxonomy. (See SIGIR’18 paper.)
Download
Download the entire dataset here. Please refer to the included README files and associated publications for further details.
Another source for download is here.
Related Publications
Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to Ensure Quality Relevance Annotations. In Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2018.
Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement? In Proceedings of the 41st international ACM SIGIR conference on Research and development in Information Retrieval, 2018.
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 139-148, 2016. Best Paper Award. [ pdf | blog-post |data | slides ]
Brandon Dang, Miles Hutson, and Matthew Lease. MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk. In 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP): Works-in-Progress Track, 2016. 3 pages. arXiv:1609.00945. [ pdf | sourcecode ]
EveTAR: The first Arabic Test Collection for multiple Information Retrieval Tasks in Twitter
The first Arabic Test Collection for multiple information retrieval tasks in Twitter. It supports Event detection, Ad-hoc search, Timeline generation, and Real-time summarization. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with a substantial average inter-annotator agreement (Kappa value of 0.71).
Related publications
[December 21st, 2017] Final published Information Retrieval Journal (IRJ) article on Springer that describes the 2nd version of the collection:
Hasanain, M., Suwaileh, R., Elsayed, T., Kutlu, M., Almerekhi, H. EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets. Information Retrieval Journal (2017). https://doi.org/10.1007/s10791-017-9325-7
[July 17th, 2016] SIGIR 2016 paper that describes the first version of the collection:
Hind Almerekhi, Maram Hasanain, and Tamer Elsayed. EveTAR: A New Test Collection for Event Detection in Arabic Tweets. Proceedings of the 39th annual international ACM SIGIR conference on Research and development in information retrieval: SIGIR ’16, Pisa, Italy, July 2016. Download.
Download
ArabicWeb16: Largest Public Arabic Web Crawl
A public Web crawl of 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA). We expect ArabicWeb16 to support various research areas such as ad-hoc search, question answering, filtering, cross-dialect search, dialect detection, entity search, blog search, and spam detection among others.
Download
For further information on dataset download and statistics, visit ArabicWeb16 website.
Related Publication
Reem Suwaileh, Mucahid Kultlu, Nihal Fathima, Tamer Elsayed, and Matthew Lease. ArabicWeb16: A New Crawl for Today’s Arabic Web. Proceedings of the 39th annual international ACM SIGIR conference on Research and development in information retrieval: SIGIR ’16, Pisa, Italy, July 2016.
DART: Dialectal Arabic Tweets Dataset
Dialectal Arabic Tweets (DART) Dataset is a new large manually-annotated multi-dialect dataset of Arabic tweets. The Dialectal ARabic Tweets (DART) dataset has about 25K tweets that are annotated via crowdsourcing, and it is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi.
Download
Download the dataset from here (.zip).
Related Publication
Khaled Yasser, Mucahid Kutlu, and Tamer Elsayed: Re-ranking Web Search Results for Better Fact-Checking: A Preliminary Study. CIKM 2018
AutoTweet: Dataset for Detecting Automatically-Generated Arabic Tweets
We provide two datasets to study automation behavior in Arabic tweets. The 2 datasets are released in tab-separated text files. We describe the content of each as follows:
Full Dataset: it contains a total of 11764 unique UserIDs and a total of 1281708 TweetIDs.
Labeled Tweets dataset: it contains the TweetIDs, TweetText, and Labels of a total of 3503 tweets that were obtained from crowdsourcing. 1944 of the tweets are labeled as automated tweets and 1559 are labeled as manual tweets.
Download
Download AutoTweet-Dataset-v1.0 from here (.zip).
Related Publications
Hind Almerekhi and Tamer Elsayed: Detecting Automatically-Generated Arabic Tweets. AIRS 2015
Journalists Questions on Twitter
We provide 2 datasets to support question identification and question-type classification in Arabic tweets of journalists. The 2 datasets are released in tab-separated text files. We describe the content of each as follows:
Labelled Tweets dataset: A list of tweets' ids for Arabic tweets labelled by crowdsourcing. Each tweet is associated with one label: question tweet or not. A question tweet is a tweet that has at least one interrogative question.
Labelled Question Tweets dataset: A list of tweets' ids for Arabic question tweets labelled by recruiting in-house annotators. Each question tweet is associated with one label which is the question type (given a taxonomy of 8 types).
Download
ArQAT-JQ-Dataset-v1.0: download zip file.
Related Publication
Maram Hasanain, Mossaab Bagdouri, Tamer Elsayed, Douglas Oard: What Questions Do Journalists Ask on Twitter?. The Workshops of AAAI Conference on Web and Social Media 2016
Answerable Question Identification in Arabic Tweets
Download
ArQAT-AQI-Dataset-v1.0: download txt file.
Related Publication
Maram Hasanain, Tamer Elsayed, and Walid Magdy: Identification of Answer-Seeking Questions in Arabic Microblogs. CIKM 2014
Question Identification in Arabic Tweets
Download
ArQAT-QI-Dataset-v1.0: download zip file.
Related Publication
Maram Hasanain, Tamer Elsayed, and Walid Magdy: Identification of Answer-Seeking Questions in Arabic Microblogs. CIKM 2014