LLM Hallucination

How LLMs React to Industrial Spatio-Temporal Data? Assessing Hallucination with a Novel Traffic Incident Benchmark Dataset

Accepted by NAACL 2025! 🎉 See you in Albuquerque, US, from April 28 – May 5. Stay tuned! 🚀

The recent advancements in Large Language Models (LLMs) have unlocked a new range of opportunities, with models like GPT-4 exhibiting skilled performance in various NLP tasks [1]. However, alongside these benefits, LLMs, including ChatGPT, are known to hallucinate facts and generate non-factual statements, which could impact the trustworthiness of their outputs [2].

Some techniques such as SelfCheckGPT have been proposed to utilize the stochasticity in sampled responses to assess the factuality of generated outputs [2]. Based on our experience at the GenAI Hackathon, we are among the first few teams who have built a functional GenAI App - Automated Incident Tagging & Reporting for Public Transport.

We aim to build this benchmark and verify our hypothesis in this R&D project to improve the “Hallucination” problem in LLMs, especially concerning time/date forms of input, and non-English language input with logical reasoning on index-tagged embeddings. The outcome could be very useful for implementing GenAI applications in industrial under local languages, for example, German, Chinese, French, etc., and in processing very complex, long text data.

[1] Yejin Bang et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023,

[2] Potsawee Manakul et al. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.

Hallucination Problem in LLM Responses

correctness

LLM-generated results should be trustworthy and cannot fake the results or generate non-factual statements [3]

verifiability

LLM-generated results should be Reproducible, other researchers can track the record and reproduce the results via experiments [3]

[3] Jin et al, Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries, 2023, preprint arXiv::2310.13132

consistency

LLM-generated results should anti- the effect via different temperature parameters settings and output stable answers not causing ambiguities [3]

"This paper introduces a novel corpus of nearly 100,000 traffic incident records from Vienna, aimed at evaluating the spatio-temporal reasoning capabilities of large language models (LLMs). The authors compare the performance of several state-of-the-art (SOTA) LLMs on a small benchmark of spatio-temporal questions and assess the impact of data preprocessing techniques on reducing LLM hallucinations. They also incorporate a retrieval-augmented generation (RAG) component to improve the system's performance. The dataset of traffic incidents is valuable for both academic researchers and developers in industrial applications. The comparison of multiple LLMs on challenging spatio-temporal reasoning tasks provides insightful analysis. The evaluation of data preprocessing techniques offers practical guidance for optimizing data workflows in generative AI."

-- Meta Reviewer

"The paper addresses a critical issue in LLM applications with a multi-pronged strategy, combining various detection methods to improve accuracy and coverage. The authors develop a detailed hallucination taxonomy, providing valuable insights into the nature and distribution of different types of hallucinations. The work is well-grounded in existing research while offering innovative solutions. "

-- Fellow Reviewers

"The findings reveal significant performance differences and the effectiveness of RAG."

-- Fellow Reviewers

Reference paper & Current Existing SOTA Benchmark

Our Code

(released now!)

Our Benchmark Download

(released now!)

News: You can now access our H&PS 2025 Dataset! you can also utilize our data with LlamaIndex Vertex AI for RAG!

Paper Poster

(released now!)

NAACL 2025 Paper ID 21-Industry Track A0 Poster.pdf

GenAI Platform for Wien Incident Classification.mp4

PRoject

The GenAI system independently suggests new tags based on incident descriptions. To gain better insight into the incidents, meaningful and visually appealing reports are generated using the newly created tags and GenAI.

Hypothesis

Multiple linear regression for hypothesis testing

Models

GPT series, tinyLlama model, Claude-3-Haiku, Claude-3-Haiku-200K, Claude-3-Sonnet, Gemini-Pro 1.0, Mistral Medium, Mistral-8x7B, Llama-3-70B-T and Llama-3-70b Inst-FW

Questions?

Page updated

Google Sites

Report abuse