用于评估由生成式人工智能驱动的医疗对话有效性的基础指标。

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.

作者信息

Abbasian Mahyar, Khatibi Elahe, Azimi Iman, Oniani David, Shakeri Hossein Abad Zahra, Thieme Alexander, Sriram Ram, Yang Zhongqi, Wang Yanshan, Lin Bryant, Gevaert Olivier, Li Li-Jia, Jain Ramesh, Rahmani Amir M

机构信息

University of California, Irvine, CA, USA.

HealthUnity, Palo Alto, CA, USA.

出版信息

NPJ Digit Med. 2024 Mar 29;7(1):82. doi: 10.1038/s41746-024-01074-z.

DOI:10.1038/s41746-024-01074-z

PMID:38553625

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10980701/

Abstract

Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, dynamic scheduling of follow-ups, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present a comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.

摘要

生成式人工智能将通过把传统的患者护理转变为更个性化、高效和主动的过程，彻底改变医疗服务的提供方式。聊天机器人作为交互式对话模型，可能会推动医疗保健领域以患者为中心的这一转变。通过提供各种服务，包括诊断、个性化生活方式建议、动态随访安排和心理健康支持，目标是大幅改善患者的健康状况，同时减轻医疗服务提供者的工作量负担。医疗保健应用程序关乎生命的性质，使得有必要为对话模型建立一套统一且全面的评估指标。针对各种通用大语言模型（LLM）提出的现有评估指标表明，对医学和健康概念及其在促进患者福祉方面的重要性缺乏理解。此外，这些指标忽视了以用户为中心的关键方面，包括建立信任、伦理、个性化、同理心、用户理解和情感支持。本文的目的是探索专门适用于评估医疗保健领域交互式对话模型的基于大语言模型的最新评估指标。随后，我们提出了一套全面的评估指标，旨在从终端用户的角度全面评估医疗保健聊天机器人的性能。这些指标包括对语言处理能力的评估、对现实世界临床任务的影响以及在用户交互对话中的有效性。最后，我们讨论了定义和实施这些指标所面临的挑战，特别强调了评估过程中涉及的目标受众、评估方法和提示技术等混杂因素。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a145/10980701/904984771ee3/41746_2024_1074_Fig1_HTML.jpg

相似文献

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.

NPJ Digit Med. 2024 Mar 29;7(1):82. doi: 10.1038/s41746-024-01074-z.

Evaluation of the Current State of Chatbots for Digital Health: Scoping Review.

J Med Internet Res. 2023 Dec 19;25:e47217. doi: 10.2196/47217.

The Use of Artificial Intelligence-Based Conversational Agents (Chatbots) for Weight Loss: Scoping Review and Practical Recommendations.

JMIR Med Inform. 2022 Apr 13;10(4):e32578. doi: 10.2196/32578.

The Impact of Generative Conversational Artificial Intelligence on the Lesbian, Gay, Bisexual, Transgender, and Queer Community: Scoping Review.

J Med Internet Res. 2023 Dec 6;25:e52091. doi: 10.2196/52091.

Artificial Intelligence-Based Conversational Agents for Chronic Conditions: Systematic Literature Review.

J Med Internet Res. 2020 Sep 14;22(9):e20701. doi: 10.2196/20701.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

The Role of Humanization and Robustness of Large Language Models in Conversational Artificial Intelligence for Individuals With Depression: A Critical Analysis.

JMIR Ment Health. 2024 Jul 2;11:e56569. doi: 10.2196/56569.

Generative Artificial Intelligence Terminology: A Primer for Clinicians and Medical Researchers.

Cureus. 2023 Dec 4;15(12):e49890. doi: 10.7759/cureus.49890. eCollection 2023 Dec.

Artificial Intelligence Chatbot Behavior Change Model for Designing Artificial Intelligence Chatbots to Promote Physical Activity and a Healthy Diet: Viewpoint.

J Med Internet Res. 2020 Sep 30;22(9):e22845. doi: 10.2196/22845.

Unlocking human-like conversations: Scoping review of automation techniques for personalized healthcare interventions using conversational agents.

Int J Med Inform. 2024 May;185:105385. doi: 10.1016/j.ijmedinf.2024.105385. Epub 2024 Feb 24.

引用本文的文献

Critical conversations: a user-centric approach to chatbots for history taking in the pediatric intensive care unit.

Front Pediatr. 2025 Aug 12;13:1646989. doi: 10.3389/fped.2025.1646989. eCollection 2025.

Large language models for clinical decision support in gastroenterology and hepatology.

Nat Rev Gastroenterol Hepatol. 2025 Aug 22. doi: 10.1038/s41575-025-01108-1.

Conversational health agents: a personalized large language model-powered agent framework.

JAMIA Open. 2025 Jul 6;8(4):ooaf067. doi: 10.1093/jamiaopen/ooaf067. eCollection 2025 Aug.

AI-driven healthcare innovations for enhancing clinical services during mass gatherings (Hajj): task force insights and future directions.

BMC Health Serv Res. 2025 Jul 1;25(1):876. doi: 10.1186/s12913-025-13045-5.

Exploring the Acceptance and Opportunities of Using a Specific Generative AI Chatbot to Assist Parents in Managing Pediatric Rheumatological Chronic Health Conditions: Mixed Methods Study.

JMIR Pediatr Parent. 2025 Jul 1;8:e70409. doi: 10.2196/70409.

ABCD: A Simulation Method for Accelerating Conversational Agents With Applications in Aphasia Therapy.

J Speech Lang Hear Res. 2025 Jul 8;68(7):3322-3336. doi: 10.1044/2025_JSLHR-25-00003. Epub 2025 Jun 13.

Big Data-Driven Health Portraits for Personalized Management in Noncommunicable Diseases: Scoping Review.

J Med Internet Res. 2025 Jun 5;27:e72636. doi: 10.2196/72636.

Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries.

BMC Public Health. 2025 May 15;25(1):1788. doi: 10.1186/s12889-025-22933-8.

A review on generative AI models for synthetic medical text, time series, and longitudinal data.

NPJ Digit Med. 2025 May 15;8(1):281. doi: 10.1038/s41746-024-01409-w.

Patient agency and large language models in worldwide encoding of equity.

NPJ Digit Med. 2025 May 8;8(1):258. doi: 10.1038/s41746-025-01598-y.

本文引用的文献

ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.

Cureus. 2023 Jun 24;15(6):e40895. doi: 10.7759/cureus.40895. eCollection 2023 Jun.

Large language models encode clinical knowledge.

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

AI Chatbots, Health Privacy, and Challenges to HIPAA Compliance.

JAMA. 2023 Jul 25;330(4):309-310. doi: 10.1001/jama.2023.9458.

Toward Improving Health Literacy in Patient Education Materials with Neural Machine Translation Models.

AMIA Jt Summits Transl Sci Proc. 2023 Jun 16;2023:418-426. eCollection 2023.

Holistic Evaluation of Language Models.

Ann N Y Acad Sci. 2023 Jul;1525(1):140-146. doi: 10.1111/nyas.15007. Epub 2023 May 25.

ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.

Front Artif Intell. 2023 May 4;6:1169595. doi: 10.3389/frai.2023.1169595. eCollection 2023.

Measuring Personalization, Embodiment, and Congruence in Online Learning: A Validation Study.

Acad Med. 2023 Mar 1;98(3):357-366. doi: 10.1097/ACM.0000000000005088. Epub 2022 Nov 15.

BioGPT: generative pre-trained transformer for biomedical text generation and mining.

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac409.

Natural language processing: state of the art, current trends and challenges.

Multimed Tools Appl. 2023;82(3):3713-3744. doi: 10.1007/s11042-022-13428-4. Epub 2022 Jul 14.

Chatbot for Health Care and Oncology Applications Using Artificial Intelligence and Machine Learning: Systematic Review.

JMIR Cancer. 2021 Nov 29;7(4):e27850. doi: 10.2196/27850.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于评估由生成式人工智能驱动的医疗对话有效性的基础指标。

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.

作者信息

机构信息

University of California, Irvine, CA, USA.

HealthUnity, Palo Alto, CA, USA.

出版信息

NPJ Digit Med. 2024 Mar 29;7(1):82. doi: 10.1038/s41746-024-01074-z.

DOI:10.1038/s41746-024-01074-z

PMID:38553625

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10980701/

Abstract

摘要

用于评估由生成式人工智能驱动的医疗对话有效性的基础指标。

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

用于评估由生成式人工智能驱动的医疗对话有效性的基础指标。

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献