Arabic LLMs Hallucination-Shared Task

Detection of Hallucination in Arabic Factual Claims Generated by ChatGPT and GPT4

@ OSACT 2024 Workshop, LREC-COLING 2024

Torino, Italy, 20-25 May, 2024

Recent Updates

19 Feb 2024: Test set is available.
19 Feb 2024: CodaLab competition for Subtask A (https://codalab.lisn.upsaclay.fr/competitions/17755) is created.
19 Feb 2024: CodaLab competition for Subtask B ( https://codalab.lisn.upsaclay.fr/competitions/17761) is created.
22 Jan 2024: Training and development data splits are shared.
21 Jan 2024: Website is up!

Discussion Group

Please join our Google discussion group (https://groups.google.com/g/arabicllmshallucination) to receive announcements and participate in discussions.

Task Overview

Large Language Models (LLMs) have shown superb abilities to generate texts that are indistinguishable from human-generated texts in many cases. However, sometimes they generate false, incorrect, or misleading content, which is often described as “hallucinations”.

Addressing hallucinations in LLMs for the Arabic language not only enhances the reliability and applicability of these models, but also holds potential implications for a wide array of applications including information retrieval, sentiment analysis, and machine translation.

In this shared task, we share the first Arabic dataset about LLMs hallucination that contains 10K of generated sentences by GPT 3.5 (aka ChatGPT) and GPT-4 (GPT4) and annotate it for factuality and correctness.

Data Collection:

We chose 1,000 random words from SAMER Arabic readability lexicon (Al Khalil et al., 2020) and asked ChatGPT and GPT4 to generate five verifiable factual sentences (or claims) for each word.

The used prompt is: "Give exactly FIVE Arabic complete and diverse factual sentences having the following word: {word}. These sentences should have facts that can be checked and verified. Write the sentences separated by a new line without translation and without numbering"

Annotation:

We trained extensively 50 students from Al-Imam University in Saudi Arabia studying in their last year. Each student annotated 200 random sentences generated by ChatGPT or GPT4. For quality control, we annotated 50 random generated sentences as test questions, and inserted them randomly in the range of each student. The average agreement with test questions is 87%.

Each sentence was labeled for:

factual: whether the sentence has factual information that can be verified. Values are 1 and 0.
correct: whether the factual sentence is correct or not. Values are 1 and 0.

In addition, annotators marked the sentences that had linguistic errors and corrected them, if any, and they added the reference links used in verification. We will not use these fields in this shared task. Examples are shown in Table 1.

We will have two shared subtasks:

Subtask A: Given only a sentence (claim), detect whether it is correct or not. This simulates the case where all that we have are claims generated by AI models and we want to verify them without any other information.

Labels for this task are: Factually Correct (FC), Factually Incorrect (FI), or Not Factual (NF).

Subtask B: Given a sentence (claim), model name (ChatGPT or GPT4), and the input word, its Part-Of-Speech, and readability level (from easy (1) to advanced (4)), detect whether the claim is correct or not. In this case, we want to examine whether information about the source of the claim (ChatGPT or GPT4) and the input word used in prompting the model can help in verifying the generated claims.

Labels are the same as in Subtask A.

We will use the Codalab evaluation platform to manage the submissions.

Data will be split into 70% for training, 10% for development, and 20% for testing.

We encourage participants to use this data and/or any other external data (previous datasets, lexicons, in-house data, etc.) and try to explain model behavior.

License:

The data is made public for research purposes only (non-commercial use).

Important Dates

All times are Anywhere On Earth (AOE).

01 March 2024: Shared-task paper submission deadline
25 March 2024: Notification of acceptance
30 March 2024: Camera-ready submission of manuscripts
25 May 2024: Workshop in Torino, Italy.

Dataset

The data is in a tab-separated format as follows:

Subtask A and Subtask B:

claim_id \t word_pos \t readability \t model \t claim \t label (values are FC/FI/NF) \n

Examples:

123 \t إطلاق_noun \t 4 \t GPT4 \t قامت الصين بإطلاق أول قمر صناعي لها في عام 1970 \t FC \n

456 \t يد_noun \t 1 \t ChaGPT \t تحتوي يد الإنسان على أكثر من 17 عضلة \t FI \n

(Correct info: تحتوي اليد البشرية على 34 عضلة)

789 \t خلاف_noun \t 2 \t ChatGPT \t خلاف بين الحكومة والمعارضة حول قانون العمل الجديد \t NF \n

In Subtask A, participants should use these columns only: claim and label.

In Subtask B, participants can use all columns.

Download training data from: https://alt.qcri.org/resources/OSACT2024/Arabic LLMs Hallucination-OSACT2024-Train.txt

Download development data from: https://alt.qcri.org/resources/OSACT2024/Arabic LLMs Hallucination-OSACT2024-Dev.txt

Note: Click on the datasets, and save them as .txt files.

Download testing data (without gold labels) from: https://alt.qcri.org/resources/OSACT2024/Arabic LLMs Hallucination-OSACT2024-Test-NoLabels.txt

Subtask A - Codalab link: https://codalab.lisn.upsaclay.fr/competitions/17755

Subtask B - Codalab link: https://codalab.lisn.upsaclay.fr/competitions/17761

Submission of System Results:

Evaluation Criteria:

Classification systems will be evaluated using the macro-averaged F1-score for all subtasks.

Submission Format:

Classifications of test and dev datasets (claim_id and labels only) should be submitted as separate files in the following format with a label for each corresponding claim:

For Subtasks A and B:

claim_id \t label (values are: FC, FI, or NF)\n

The first line in the submission file should contain the title in the next format, and it will be skipped during evaluation:

"claim_id" \t "label" \n

Then, the following lines should contain the actual claim ids and the labels, ex:

4584 \t FC \n

283 \t FI \n

1474 \t NF \n

and so on.

Participants can submit up to two system results (primary submission for their best result, and a secondary submission for the 2nd best result).

Official results will consider primary submissions for ranking different teams, and results of secondary submissions will be reported for guidance. All participants are required to report on the development and test sets in their papers.

Submission filename should be in the following format:

ParticipantName_Subtask<A/B>_<test/dev>_<1/2>.zip (a plain .txt file inside each .zip file)

Ex: QCRI_SubtaskA_test_1.zip (the best results for Subtask A for test dataset from QCRI team)

Ex: KSU_SubtaskB_dev_2.zip (the 2nd best results for Subtask B for dev dataset from KSU team)

Paper Submission

Please submit your paper using START system

Content Guidelines

Along with the paper, it is mandatory to submit the code that generated the submitted runs. Instructions on how to submit the code will be communicated later.
The paper title should follow the following template: <TeamID> at Arabic LLMs Hallucination 2024: <Title>. An example title is "QCRI at Arabic LLMs Hallucinations 2024: Detect Hallucinations using Transformers".
The paper should cover (among other standard sections) the following:
- Approach: system overview, models, training, external resources, etc. Please make sure that you give enough details about your approach.
- Experimental Evaluation: results on the dev set, official results on the test set, analysis/discussion of the results, etc. You can also report and analyze the results of other runs that you didn’t officially submit.
For ease of approach reproducibility (and faster learning by others), you are strongly encouraged to release your code and make it publicly available. If so, please indicate that in your paper and provide a public link.

Formatting Guidelines

Papers must use LREC template and comply to the LREC stylesheet.
Papers must be written in English.
Papers must be 4 to 8 pages long, excluding references and appendices (the number of appendix pages should remain reasonable though).
All papers will be peer reviewed by at least two independent reviewers. The review is single-blind. So, the submissions are NOT anonymous, i.e., you need to add the author names on the first page of the submitted paper.
Papers must be submitted electronically in PDF format.
When submitting a paper at the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper.
All authors are encouraged to share the described LRs (i.e., language resources, such as data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).

Organizers

Hamdy Mubarak, Qatar Computing Research Institute, Qatar (hmubarak@hbku.edu.qa)
Hend Al-Khalifa, King Saud University, KSA, (hend.alkhalifa@gmail.com)
Khaloud Alkhalefah, Al-Imam University, KSA (Kholodalkhalifah@gmail.com)
Tamer Elsayed, Qatar University, Qatar (telsayed@qu.edu.qa)

Questions?

Contact hmubarak@hbku.edu.qa to get more information on the task