Alduhailan Hessah W, Alshamari Majed A, Wahsheh Heider A M
Department of Information Systems, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia.
Healthcare (Basel). 2025 Jul 26;13(15):1829. doi: 10.3390/healthcare13151829.
: Artificial intelligence (AI) symptom-checker apps are proliferating, yet their everyday usability and transparency remain under-examined. This study provides a triangulated evaluation of three widely used AI-powered mHealth apps: ADA, Mediktor, and WebMD. : Five usability experts applied a 13-item AI-specific heuristic checklist. In parallel, thirty lay users (18-65 years) completed five health-scenario tasks on each app, while task success, errors, completion time, and System Usability Scale (SUS) ratings were recorded. A repeated-measures ANOVA followed by paired-sample -tests was conducted to compare SUS scores across the three applications. : The analysis revealed statistically significant differences in usability across the apps. ADA achieved a significantly higher mean SUS score than both Mediktor ( = 0.0004) and WebMD ( < 0.001), while Mediktor also outperformed WebMD ( = 0.0009). Common issues across all apps included vague AI outputs, limited feedback for input errors, and inconsistent navigation. Each application also failed key explainability heuristics, offering no confidence scores or interpretable rationales for AI-generated recommendations. : Even highly rated AI mHealth apps display critical gaps in explainability and error handling. Embedding explainable AI (XAI) cues such as confidence indicators, input validation, and transparent justifications can enhance user trust, safety, and overall adoption in real-world healthcare contexts.
人工智能(AI)症状检查应用程序正在激增,但其日常可用性和透明度仍未得到充分检验。本研究对三款广泛使用的人工智能驱动的移动健康应用程序进行了三角测量评估:ADA、Mediktor和WebMD。五名可用性专家应用了一份包含13项内容的特定于人工智能的启发式检查表。与此同时,30名普通用户(18至65岁)在每个应用程序上完成了五项健康场景任务,同时记录任务成功率、错误情况、完成时间和系统可用性量表(SUS)评分。进行了重复测量方差分析,随后进行配对样本检验,以比较这三款应用程序的SUS得分。分析显示,各应用程序在可用性方面存在统计学上的显著差异。ADA的平均SUS得分显著高于Mediktor(P = 0.0004)和WebMD(P < 0.001),而Mediktor也优于WebMD(P = 0.0009)。所有应用程序的常见问题包括人工智能输出模糊、对输入错误的反馈有限以及导航不一致。每个应用程序还未通过关键的可解释性启发式检验,没有为人工智能生成的建议提供置信度分数或可解释的理由。即使是评分很高的人工智能移动健康应用程序,在可解释性和错误处理方面也存在关键差距。嵌入可解释人工智能(XAI)线索,如置信度指标、输入验证和透明理由,可以增强用户信任、安全性以及在现实世界医疗环境中的整体采用率。
Healthcare (Basel). 2025-7-26
Cochrane Database Syst Rev. 2015-7-27
J Am Med Inform Assoc. 2025-7-30
JMIR Hum Factors. 2022-5-10
JMIR Hum Factors. 2021-6-18
Hum Factors. 2023-3