Massenon Rhodes, Gambo Ishaya, Khan Javed Ali, Agbonkhese Christopher, Alwadain Ayed
Department of Software Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria.
Department of Computer Science, University of Hertfordshire, Hatfield, UK.
Sci Rep. 2025 Aug 19;15(1):30397. doi: 10.1038/s41598-025-15416-8.
Large Language Models (LLMs) are increasingly integrated into AI-powered mobile applications, offering novel functionalities but also introducing the risk of "hallucinations" generating plausible yet incorrect or nonsensical information. These AI errors can significantly degrade user experience and erode trust. However, there is limited empirical understanding of how users perceive, report, and are impacted by LLM hallucinations in real-world mobile app settings. This paper presents a large-scale empirical study analyzing 3 million user reviews from 90 diverse AI-powered mobile apps to characterize these user-reported issues. Using a mixed-methods approach, a heuristic-based User-Reported LLM Hallucination Detection algorithm were applied to identify 20,000 candidate reviews, from which 1,000 are manually annotated. This analysis estimates the prevalence of user reports indicative of LLM hallucinations, which was found to be approximately 1.75% within reviews initially flagged as relevant to AI errors. A data-driven taxonomy of seven user-perceived LLM hallucination types, were developed with Factual Incorrectness (H1) emerged as the most frequently reported type, accounting for 38% of instances, followed by Nonsensical/Irrelevant Output (H3) at 25%, and Fabricated Information (H2) at 15%. Furthermore, linguistic patterns were identified using N-grams generation, Non-Negative Matrix Factorization (NMF) topics and sentiment characteristics using VADER, showing significantly lower scores for hallucination reports associated with these reviews. These findings offer critical implications for software quality assurance, highlighting the need for targeted monitoring and mitigation strategies for AI mobile apps. This research provides a foundational, user-centric understanding of LLM hallucinations, paving the way for improved AI model development and more trustworthy mobile applications.
大语言模型(LLMs)越来越多地集成到人工智能驱动的移动应用程序中,它们在提供新颖功能的同时,也带来了产生似是而非但不正确或无意义信息的“幻觉”风险。这些人工智能错误会显著降低用户体验并削弱信任。然而,对于用户在现实世界的移动应用场景中如何感知、报告和受到大语言模型幻觉的影响,实证研究还很有限。本文进行了一项大规模实证研究,分析了来自90个不同人工智能驱动的移动应用程序的300万条用户评论,以描述这些用户报告的问题。采用混合方法,应用基于启发式的用户报告大语言模型幻觉检测算法来识别20000条候选评论,其中1000条进行人工标注。该分析估计了表明大语言模型幻觉的用户报告的发生率,发现在最初标记为与人工智能错误相关的评论中,这一发生率约为1.75%。开发了一个数据驱动的七种用户感知的大语言模型幻觉类型的分类法,其中事实错误(H1)是报告最频繁的类型,占实例的38%,其次是无意义/不相关输出(H3),占25%,虚假信息(H2)占15%。此外,使用N元语法生成、非负矩阵分解(NMF)主题以及使用VADER的情感特征来识别语言模式,结果表明与这些评论相关的幻觉报告得分显著较低。这些发现对软件质量保证具有重要意义,突出了对人工智能移动应用程序进行有针对性监测和缓解策略的必要性。本研究提供了对大语言模型幻觉的基础性、以用户为中心的理解,为改进人工智能模型开发和更值得信赖的移动应用程序铺平了道路。