Abbasian Mahyar, Khatibi Elahe, Azimi Iman, Oniani David, Shakeri Hossein Abad Zahra, Thieme Alexander, Sriram Ram, Yang Zhongqi, Wang Yanshan, Lin Bryant, Gevaert Olivier, Li Li-Jia, Jain Ramesh, Rahmani Amir M
University of California, Irvine, CA, USA.
HealthUnity, Palo Alto, CA, USA.
NPJ Digit Med. 2024 Mar 29;7(1):82. doi: 10.1038/s41746-024-01074-z.
Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, dynamic scheduling of follow-ups, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present a comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.
生成式人工智能将通过把传统的患者护理转变为更个性化、高效和主动的过程,彻底改变医疗服务的提供方式。聊天机器人作为交互式对话模型,可能会推动医疗保健领域以患者为中心的这一转变。通过提供各种服务,包括诊断、个性化生活方式建议、动态随访安排和心理健康支持,目标是大幅改善患者的健康状况,同时减轻医疗服务提供者的工作量负担。医疗保健应用程序关乎生命的性质,使得有必要为对话模型建立一套统一且全面的评估指标。针对各种通用大语言模型(LLM)提出的现有评估指标表明,对医学和健康概念及其在促进患者福祉方面的重要性缺乏理解。此外,这些指标忽视了以用户为中心的关键方面,包括建立信任、伦理、个性化、同理心、用户理解和情感支持。本文的目的是探索专门适用于评估医疗保健领域交互式对话模型的基于大语言模型的最新评估指标。随后,我们提出了一套全面的评估指标,旨在从终端用户的角度全面评估医疗保健聊天机器人的性能。这些指标包括对语言处理能力的评估、对现实世界临床任务的影响以及在用户交互对话中的有效性。最后,我们讨论了定义和实施这些指标所面临的挑战,特别强调了评估过程中涉及的目标受众、评估方法和提示技术等混杂因素。