用于临床会诊中生成式人工智能聊天机器人人体评估的方案。

Protocol for human evaluation of generative artificial intelligence chatbots in clinical consultations.

作者信息

Chiu Edwin Kwan-Yeung, Chung Tom Wai-Hin

机构信息

Department of Microbiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China.

出版信息

PLoS One. 2025 Mar 19;20(3):e0300487. doi: 10.1371/journal.pone.0300487. eCollection 2025.

DOI:10.1371/journal.pone.0300487

PMID:40106443

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11922213/

Abstract

BACKGROUND

Generative artificial intelligence (GenAI) has the potential to revolutionise healthcare delivery. The nuances of real-life clinical practice and complex clinical environments demand a rigorous, evidence-based approach to ensure safe and effective deployment of AI.

METHODS

We present a protocol for the systematic evaluation of large language models (LLMs) as GenAI chatbots within the context of clinical microbiology and infectious diseases clinical consultations. We aim to critically assess recommendations produced by four leading GenAI models, including Claude 2, Gemini Pro, GPT-4.0, and a GPT-4.0-based custom AI chatbot.

DISCUSSION

A standardised, healthcare-specific, universal prompt template is developed to elicit clinically impactful AI responses. Generated responses will be graded by two panels of practicing clinicians, encompassing a wide spectrum of domain expertise in clinical microbiology and virology, as well as infectious diseases. Evaluations will be performed using a 5-point Likert scale across four clinical domains: factual consistency, comprehensiveness, coherence, and medical harmfulness. Our study will offer insights into the feasibility, limitations, and boundaries of GenAI in clinical consultations, providing guidance for future research and clinical implementation. Ethical guidelines and safety guardrails should be developed to uphold patient safety and clinical standards.

摘要

背景

生成式人工智能（GenAI）有潜力彻底改变医疗服务的提供方式。现实生活中的临床实践和复杂的临床环境的细微差别需要一种严谨的、基于证据的方法，以确保人工智能的安全有效部署。

方法

我们提出了一个方案，用于在临床微生物学和传染病临床会诊的背景下，对作为GenAI聊天机器人的大语言模型（LLMs）进行系统评估。我们旨在严格评估由四个领先的GenAI模型生成的建议，包括Claude 2、Gemini Pro、GPT-4.0以及一个基于GPT-4.0的定制人工智能聊天机器人。

讨论

开发了一个标准化的、针对医疗保健的通用提示模板，以引出具有临床影响力的人工智能回复。生成的回复将由两组执业临床医生进行评分，这些医生涵盖临床微生物学、病毒学以及传染病等广泛领域的专业知识。评估将在四个临床领域使用5点李克特量表进行：事实一致性、全面性、连贯性和医疗危害性。我们的研究将深入了解GenAI在临床会诊中的可行性、局限性和边界，为未来的研究和临床应用提供指导。应制定道德准则和安全保障措施，以维护患者安全和临床标准。