Arabic LLMs Hallucination-Shared Task

Detection of Hallucination in Arabic Factual Claims Generated by ChatGPT and GPT4



@ OSACT 2024 Workshop, LREC-COLING 2024 

Torino, Italy, 20-25 May, 2024 

 Recent Updates

Discussion Group

Task Overview


Large Language Models (LLMs) have shown superb abilities to generate texts that are indistinguishable from human-generated texts in many cases. However, sometimes they generate false, incorrect, or misleading content, which is often described as “hallucinations.


Addressing hallucinations in LLMs for the Arabic language not only enhances the reliability and applicability of these models, but also holds potential implications for a wide array of applications including information retrieval, sentiment analysis, and machine translation.


In this shared task, we share the first Arabic dataset about LLMs hallucination that contains 10K of generated sentences by GPT 3.5 (aka ChatGPT) and GPT-4 (GPT4) and annotate it for factuality and correctness. 


Data Collection:

We chose 1,000 random words from SAMER Arabic readability lexicon (Al Khalil et al., 2020) and asked ChatGPT and GPT4 to generate five verifiable factual sentences (or claims) for each word.


The used prompt is: "Give exactly FIVE Arabic complete and diverse factual sentences having the following word: {word}. These sentences should have facts that can be checked and verified. Write the sentences separated by a new line without translation and without numbering"


Annotation:

We trained extensively 50 students from Al-Imam University in Saudi Arabia studying in their last year. Each student annotated 200 random sentences generated by ChatGPT or GPT4. For quality control, we annotated 50 random generated sentences as test questions, and inserted them randomly in the range of each student. The average agreement with test questions is 87%.


Each sentence was labeled for:


In addition, annotators marked the sentences that had linguistic errors and corrected them, if any, and they added the reference links used in verification. We will not use these fields in this shared task. Examples are shown in Table 1.




We will have two shared subtasks:



Labels for this task are: Factually Correct (FC), Factually Incorrect (FI), or Not Factual (NF).



Labels are the same as in Subtask A.



We will use the Codalab evaluation platform to manage the submissions.


Data will be split into 70% for training, 10% for development, and 20% for testing.


We encourage participants to use this data and/or any other external data (previous datasets, lexicons, in-house data, etc.) and try to explain model behavior.


License:

The data is made public for research purposes only (non-commercial use).

 Important Dates

All times are Anywhere On Earth (AOE).

Dataset

The data is in a tab-separated format as follows:

Subtask A and Subtask B:

claim_id \t word_pos \t readability \t model \t claim \t label (values are FC/FI/NF) \n

Examples:

123 \t إطلاق_noun \t 4 \t GPT4 \t قامت الصين بإطلاق أول قمر صناعي لها في عام 1970 \t FC \n

456 \t يد_noun \t 1 \t ChaGPT \t تحتوي يد الإنسان على أكثر من 17 عضلة \t FI \n

(Correct info: تحتوي اليد البشرية على 34 عضلة)

789 \t خلاف_noun \t 2 \t ChatGPT \t خلاف بين الحكومة والمعارضة حول قانون العمل الجديد \t NF \n


In Subtask A, participants should use these columns only: claim and label.

In Subtask B, participants can use all columns.


Download training data from: https://alt.qcri.org/resources/OSACT2024/Arabic LLMs Hallucination-OSACT2024-Train.txt

Download development data from: https://alt.qcri.org/resources/OSACT2024/Arabic LLMs Hallucination-OSACT2024-Dev.txt


Note: Click on the datasets, and save them as .txt files.


Download testing data (without gold labels) from: https://alt.qcri.org/resources/OSACT2024/Arabic LLMs Hallucination-OSACT2024-Test-NoLabels.txt


Subtask A - Codalab link: https://codalab.lisn.upsaclay.fr/competitions/17755

Subtask B - Codalab link:  https://codalab.lisn.upsaclay.fr/competitions/17761

Submission of System Results:

Evaluation Criteria:

Classification systems will be evaluated using the macro-averaged F1-score for all subtasks.


Submission Format:

Classifications of test and dev datasets (claim_id and labels only) should be submitted as separate files in the following format with a label for each corresponding claim:

For Subtasks A and B:

    claim_id \t label (values are: FC, FI, or NF)\n

The first line in the submission file should contain the title in the next format, and it will be skipped during evaluation: 

"claim_id" \t "label" \n

Then, the following lines should contain the actual claim ids and the labels, ex:

4584 \t FC \n

283 \t FI \n

1474 \t NF \n

and so on.


Participants can submit up to two system results (primary submission for their best result, and a secondary submission for the 2nd best result).

Official results will consider primary submissions for ranking different teams, and results of secondary submissions will be reported for guidance. All participants are required to report on the development and test sets in their papers.


Submission filename should be in the following format:

ParticipantName_Subtask<A/B>_<test/dev>_<1/2>.zip (a plain .txt file inside each .zip file)

Ex: QCRI_SubtaskA_test_1.zip (the best results for Subtask A for test dataset from QCRI team)

Ex: KSU_SubtaskB_dev_2.zip (the 2nd best results for Subtask B for dev dataset from KSU team)

Paper Submission

Please submit your paper using START system

Content Guidelines

Formatting Guidelines

Organizers

Questions?

Contact hmubarak@hbku.edu.qa to get more information on the task