Accepted by NAACL 2025! 🎉 See you in Albuquerque, US, from April 28 – May 5. Stay tuned! 🚀
See Our Paper🎉
[1] Yejin Bang et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023,Â
[2] Potsawee Manakul et al. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
LLM-generated results should be trustworthy and cannot fake the results or generate non-factual statements [3]
LLM-generated results should be Reproducible, other researchers can track the record and reproduce the results via experiments [3]
[3] Jin et al, Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries, 2023, preprint arXiv::2310.13132
LLM-generated results should anti- the effect via different temperature parameters settings and output stable answers not causing ambiguities [3]
"This paper introduces a novel corpus of nearly 100,000 traffic incident records from Vienna, aimed at evaluating the spatio-temporal reasoning capabilities of large language models (LLMs). The authors compare the performance of several state-of-the-art (SOTA) LLMs on a small benchmark of spatio-temporal questions and assess the impact of data preprocessing techniques on reducing LLM hallucinations. They also incorporate a retrieval-augmented generation (RAG) component to improve the system's performance. The dataset of traffic incidents is valuable for both academic researchers and developers in industrial applications. The comparison of multiple LLMs on challenging spatio-temporal reasoning tasks provides insightful analysis. The evaluation of data preprocessing techniques offers practical guidance for optimizing data workflows in generative AI."Â Â
-- Meta Reviewer
"The paper addresses a critical issue in LLM applications with a multi-pronged strategy, combining various detection methods to improve accuracy and coverage. The authors develop a detailed hallucination taxonomy, providing valuable insights into the nature and distribution of different types of hallucinations. The work is well-grounded in existing research while offering innovative solutions. "
-- Fellow Reviewers
"The findings reveal significant performance differences and the effectiveness of RAG."
-- Fellow Reviewers
(released now!) Â
(released now!)Â
News: You can now access our dataset using Google Cloud storage! if you want to utilize our data with LlamaIndex Vertex AI for RAG!
Paper Poster
(released now!) Â
The GenAI system independently suggests new tags based on incident descriptions. To gain better insight into the incidents, meaningful and visually appealing reports are generated using the newly created tags and GenAI.Â
Multiple linear regression for hypothesis testing
GPT series, tinyLlama model, Claude-3-Haiku, Claude-3-Haiku-200K, Claude-3-Sonnet, Gemini-Pro 1.0, Mistral Medium, Mistral-8x7B, Llama-3-70B-T and Llama-3-70b Inst-FWÂ
Contact us to get more information on the project!