Tam Thomas Yu Chow, Sivarajkumar Sonish, Kapoor Sumit, Stolyar Alisa V, Polanska Katelyn, McCarthy Karleigh R, Osterhoudt Hunter, Wu Xizhi, Visweswaran Shyam, Fu Sunyang, Mathur Piyush, Cacciamani Giovanni E, Sun Cong, Peng Yifan, Wang Yanshan
Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.
With generative artificial intelligence (GenAI), particularly large language models (LLMs), continuing to make inroads in healthcare, assessing LLMs with human evaluations is essential to assuring safety and effectiveness. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare across various medical specialties and addresses factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Our literature review of 142 studies shows gaps in reliability, generalizability, and applicability of current human evaluation practices. To overcome such significant obstacles to healthcare LLM developments and deployments, we propose QUEST, a comprehensive and practical framework for human evaluation of LLMs covering three phases of workflow: Planning, Implementation and Adjudication, and Scoring and Review. QUEST is designed with five proposed evaluation principles: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence.
随着生成式人工智能(GenAI),尤其是大语言模型(LLMs)在医疗保健领域的不断深入,通过人工评估来评估大语言模型对于确保安全性和有效性至关重要。本研究回顾了现有关于医疗保健领域中跨各种医学专业对大语言模型进行人工评估方法的文献,并探讨了评估维度、样本类型和规模、评估者的选择与招募、框架和指标、评估过程以及统计分析类型等因素。我们对142项研究的文献综述表明,当前人工评估实践在可靠性、普遍性和适用性方面存在差距。为了克服医疗保健大语言模型开发和部署中的此类重大障碍,我们提出了QUEST,这是一个全面且实用的大语言模型人工评估框架,涵盖工作流程的三个阶段:规划、实施与裁决,以及评分与审查。QUEST的设计遵循五条提出的评估原则:信息质量、理解与推理、表达风格与角色、安全与危害,以及信任与信心。