Suppr超能文献

用于医疗保健的可重复生成式人工智能评估:一种临床医生参与的方法。

Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.

作者信息

Livingston Leah, Featherstone-Uwague Amber, Barry Amanda, Barretto Kenneth, Morey Tara, Herrmannova Drahomira, Avula Venkatesh

机构信息

Elsevier, Health Markets, Philadelphia, PA 19103, United States.

出版信息

JAMIA Open. 2025 Jun 16;8(3):ooaf054. doi: 10.1093/jamiaopen/ooaf054. eCollection 2025 Jun.

Abstract

OBJECTIVES

To develop and apply a reproducible methodology for evaluating generative artificial intelligence (AI) powered systems in health care, addressing the gap between theoretical evaluation frameworks and practical implementation guidance.

MATERIALS AND METHODS

A 5-dimension evaluation framework was developed to assess query comprehension and response helpfulness, correctness, completeness, and potential clinical harm. The framework was applied to evaluate ClinicalKey AI using queries drawn from user logs, a benchmark dataset, and subject matter expert curated queries. Forty-one board-certified physicians and pharmacists were recruited to independently evaluate query-response pairs. An agreement protocol using the mode and modified Delphi method resolved disagreements in evaluation scores.

RESULTS

Of 633 queries, 614 (96.99%) produced evaluable responses, with subject matter experts completing evaluations of 426 query-response pairs. Results demonstrated high rates of response correctness (95.5%) and query comprehension (98.6%), with 94.4% of responses rated as helpful. Two responses (0.47%) received scores indicating potential clinical harm. Pairwise consensus occurred in 60.6% of evaluations, with remaining cases requiring third tie-breaker review.

DISCUSSION

The framework demonstrated effectiveness in quantifying performance through comprehensive evaluation dimensions and structured scoring resolution methods. Key strengths included representative query sampling, standardized rating scales, and robust subject matter expert agreement protocols. Challenges emerged in managing subjective assessments of open-ended responses and achieving consensus on potential harm classification.

CONCLUSION

This framework offers a reproducible methodology for evaluating health-care generative AI systems, establishing foundational processes that can inform future efforts while supporting the implementation of generative AI applications in clinical settings.

摘要

目的

开发并应用一种可重复的方法,用于评估医疗保健领域中由生成式人工智能(AI)驱动的系统,以弥合理论评估框架与实际实施指南之间的差距。

材料与方法

开发了一个五维度评估框架,以评估查询理解、回答的有用性、正确性、完整性以及潜在的临床危害。该框架应用于使用从用户日志、基准数据集和主题专家策划的查询中提取的查询来评估ClinicalKey AI。招募了41名获得董事会认证的医生和药剂师来独立评估查询-回答对。使用众数和改进的德尔菲法的一致性协议解决评估分数中的分歧。

结果

在633个查询中,614个(96.99%)产生了可评估的回答,主题专家完成了对426个查询-回答对的评估。结果显示回答的正确率(95.5%)和查询理解率(98.6%)很高,94.4%的回答被评为有用。两个回答(0.47%)的得分表明存在潜在的临床危害。60.6%的评估中出现了两两一致的情况,其余情况需要第三次决胜审查。

讨论

该框架通过全面的评估维度和结构化的评分解决方法,在量化性能方面显示出有效性。主要优势包括具有代表性的查询抽样、标准化的评级量表以及强大的主题专家一致性协议。在管理对开放式回答的主观评估以及就潜在危害分类达成共识方面出现了挑战。

结论

该框架为评估医疗保健领域的生成式AI系统提供了一种可重复的方法,建立了基础流程,可为未来的工作提供参考,同时支持生成式AI应用在临床环境中的实施。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/b6f7514ec3ec/ooaf054f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验