用于医疗保健的可重复生成式人工智能评估：一种临床医生参与的方法。

Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.

作者信息

Livingston Leah, Featherstone-Uwague Amber, Barry Amanda, Barretto Kenneth, Morey Tara, Herrmannova Drahomira, Avula Venkatesh

机构信息

Elsevier, Health Markets, Philadelphia, PA 19103, United States.

出版信息

JAMIA Open. 2025 Jun 16;8(3):ooaf054. doi: 10.1093/jamiaopen/ooaf054. eCollection 2025 Jun.

DOI:10.1093/jamiaopen/ooaf054

PMID:40524837

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12169418/

Abstract

OBJECTIVES

To develop and apply a reproducible methodology for evaluating generative artificial intelligence (AI) powered systems in health care, addressing the gap between theoretical evaluation frameworks and practical implementation guidance.

MATERIALS AND METHODS

A 5-dimension evaluation framework was developed to assess query comprehension and response helpfulness, correctness, completeness, and potential clinical harm. The framework was applied to evaluate ClinicalKey AI using queries drawn from user logs, a benchmark dataset, and subject matter expert curated queries. Forty-one board-certified physicians and pharmacists were recruited to independently evaluate query-response pairs. An agreement protocol using the mode and modified Delphi method resolved disagreements in evaluation scores.

RESULTS

Of 633 queries, 614 (96.99%) produced evaluable responses, with subject matter experts completing evaluations of 426 query-response pairs. Results demonstrated high rates of response correctness (95.5%) and query comprehension (98.6%), with 94.4% of responses rated as helpful. Two responses (0.47%) received scores indicating potential clinical harm. Pairwise consensus occurred in 60.6% of evaluations, with remaining cases requiring third tie-breaker review.

DISCUSSION

The framework demonstrated effectiveness in quantifying performance through comprehensive evaluation dimensions and structured scoring resolution methods. Key strengths included representative query sampling, standardized rating scales, and robust subject matter expert agreement protocols. Challenges emerged in managing subjective assessments of open-ended responses and achieving consensus on potential harm classification.

CONCLUSION

This framework offers a reproducible methodology for evaluating health-care generative AI systems, establishing foundational processes that can inform future efforts while supporting the implementation of generative AI applications in clinical settings.

摘要

目的

开发并应用一种可重复的方法，用于评估医疗保健领域中由生成式人工智能（AI）驱动的系统，以弥合理论评估框架与实际实施指南之间的差距。

材料与方法

开发了一个五维度评估框架，以评估查询理解、回答的有用性、正确性、完整性以及潜在的临床危害。该框架应用于使用从用户日志、基准数据集和主题专家策划的查询中提取的查询来评估ClinicalKey AI。招募了41名获得董事会认证的医生和药剂师来独立评估查询-回答对。使用众数和改进的德尔菲法的一致性协议解决评估分数中的分歧。

结果

在633个查询中，614个（96.99%）产生了可评估的回答，主题专家完成了对426个查询-回答对的评估。结果显示回答的正确率（95.5%）和查询理解率（98.6%）很高，94.4%的回答被评为有用。两个回答（0.47%）的得分表明存在潜在的临床危害。60.6%的评估中出现了两两一致的情况，其余情况需要第三次决胜审查。