• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于医疗保健的可重复生成式人工智能评估:一种临床医生参与的方法。

Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.

作者信息

Livingston Leah, Featherstone-Uwague Amber, Barry Amanda, Barretto Kenneth, Morey Tara, Herrmannova Drahomira, Avula Venkatesh

机构信息

Elsevier, Health Markets, Philadelphia, PA 19103, United States.

出版信息

JAMIA Open. 2025 Jun 16;8(3):ooaf054. doi: 10.1093/jamiaopen/ooaf054. eCollection 2025 Jun.

DOI:10.1093/jamiaopen/ooaf054
PMID:40524837
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12169418/
Abstract

OBJECTIVES

To develop and apply a reproducible methodology for evaluating generative artificial intelligence (AI) powered systems in health care, addressing the gap between theoretical evaluation frameworks and practical implementation guidance.

MATERIALS AND METHODS

A 5-dimension evaluation framework was developed to assess query comprehension and response helpfulness, correctness, completeness, and potential clinical harm. The framework was applied to evaluate ClinicalKey AI using queries drawn from user logs, a benchmark dataset, and subject matter expert curated queries. Forty-one board-certified physicians and pharmacists were recruited to independently evaluate query-response pairs. An agreement protocol using the mode and modified Delphi method resolved disagreements in evaluation scores.

RESULTS

Of 633 queries, 614 (96.99%) produced evaluable responses, with subject matter experts completing evaluations of 426 query-response pairs. Results demonstrated high rates of response correctness (95.5%) and query comprehension (98.6%), with 94.4% of responses rated as helpful. Two responses (0.47%) received scores indicating potential clinical harm. Pairwise consensus occurred in 60.6% of evaluations, with remaining cases requiring third tie-breaker review.

DISCUSSION

The framework demonstrated effectiveness in quantifying performance through comprehensive evaluation dimensions and structured scoring resolution methods. Key strengths included representative query sampling, standardized rating scales, and robust subject matter expert agreement protocols. Challenges emerged in managing subjective assessments of open-ended responses and achieving consensus on potential harm classification.

CONCLUSION

This framework offers a reproducible methodology for evaluating health-care generative AI systems, establishing foundational processes that can inform future efforts while supporting the implementation of generative AI applications in clinical settings.

摘要

目的

开发并应用一种可重复的方法,用于评估医疗保健领域中由生成式人工智能(AI)驱动的系统,以弥合理论评估框架与实际实施指南之间的差距。

材料与方法

开发了一个五维度评估框架,以评估查询理解、回答的有用性、正确性、完整性以及潜在的临床危害。该框架应用于使用从用户日志、基准数据集和主题专家策划的查询中提取的查询来评估ClinicalKey AI。招募了41名获得董事会认证的医生和药剂师来独立评估查询-回答对。使用众数和改进的德尔菲法的一致性协议解决评估分数中的分歧。

结果

在633个查询中,614个(96.99%)产生了可评估的回答,主题专家完成了对426个查询-回答对的评估。结果显示回答的正确率(95.5%)和查询理解率(98.6%)很高,94.4%的回答被评为有用。两个回答(0.47%)的得分表明存在潜在的临床危害。60.6%的评估中出现了两两一致的情况,其余情况需要第三次决胜审查。

讨论

该框架通过全面的评估维度和结构化的评分解决方法,在量化性能方面显示出有效性。主要优势包括具有代表性的查询抽样、标准化的评级量表以及强大的主题专家一致性协议。在管理对开放式回答的主观评估以及就潜在危害分类达成共识方面出现了挑战。

结论

该框架为评估医疗保健领域的生成式AI系统提供了一种可重复的方法,建立了基础流程,可为未来的工作提供参考,同时支持生成式AI应用在临床环境中的实施。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/03cff04cdc54/ooaf054f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/b6f7514ec3ec/ooaf054f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/d9e40d78e8d2/ooaf054f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/03cff04cdc54/ooaf054f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/b6f7514ec3ec/ooaf054f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/d9e40d78e8d2/ooaf054f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4a3/12169418/03cff04cdc54/ooaf054f3.jpg

相似文献

1
Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.用于医疗保健的可重复生成式人工智能评估:一种临床医生参与的方法。
JAMIA Open. 2025 Jun 16;8(3):ooaf054. doi: 10.1093/jamiaopen/ooaf054. eCollection 2025 Jun.
2
Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries.评估用于提供性健康信息的人工智能(AI)聊天机器人:一项使用真实临床问题的共识研究。
BMC Public Health. 2025 May 15;25(1):1788. doi: 10.1186/s12889-025-22933-8.
3
Artificial intelligence in hospital infection prevention: an integrative review.医院感染预防中的人工智能:一项综合综述。
Front Public Health. 2025 Apr 2;13:1547450. doi: 10.3389/fpubh.2025.1547450. eCollection 2025.
4
Generative artificial intelligence to produce high-fidelity blastocyst-stage embryo images.生成式人工智能生成高保真囊胚期胚胎图像。
Hum Reprod. 2024 Jun 3;39(6):1197-1207. doi: 10.1093/humrep/deae064.
5
Enhancing systematic literature reviews with generative artificial intelligence: development, applications, and performance evaluation.利用生成式人工智能加强系统文献综述:开发、应用及性能评估
J Am Med Inform Assoc. 2025 Apr 1;32(4):616-625. doi: 10.1093/jamia/ocaf030.
6
How to Design, Create, and Evaluate an Instruction-Tuning Dataset for Large Language Model Training in Health Care: Tutorial From a Clinical Perspective.如何为医疗保健领域的大语言模型训练设计、创建和评估指令微调数据集:从临床角度的教程
J Med Internet Res. 2025 Mar 18;27:e70481. doi: 10.2196/70481.
7
Comparing Artificial Intelligence-Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study.比较人工智能生成与临床医生创建的针对膝骨关节炎患者的个性化自我管理指导:盲法观察研究。
J Med Internet Res. 2025 May 7;27:e67830. doi: 10.2196/67830.
8
Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.优化 ChatGPT 对谵妄评估结果的解释和报告:探索性研究。
JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383.
9
Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments.统一健康大语言模型评估框架在篮中消息回复中的应用:弥合定性和定量评估之间的差距。
J Am Med Inform Assoc. 2025 Apr 1;32(4):626-637. doi: 10.1093/jamia/ocaf023.
10
AI for IMPACTS Framework for Evaluating the Long-Term Real-World Impacts of AI-Powered Clinician Tools: Systematic Review and Narrative Synthesis.用于评估人工智能驱动的临床医生工具长期现实世界影响的AI for IMPACTS框架:系统评价与叙述性综合分析
J Med Internet Res. 2025 Feb 5;27:e67485. doi: 10.2196/67485.

本文引用的文献

1
Retrieval augmented generation for large language models in healthcare: A systematic review.医疗保健领域大语言模型的检索增强生成:一项系统综述。
PLOS Digit Health. 2025 Jun 11;4(6):e0000877. doi: 10.1371/journal.pdig.0000877. eCollection 2025 Jun.
2
A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration.医学教育、临床决策支持与医疗管理中的大语言模型综述
Healthcare (Basel). 2025 Mar 10;13(6):603. doi: 10.3390/healthcare13060603.
3
VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions.
VaxBot-HPV:一款基于GPT的聊天机器人,用于回答与HPV疫苗相关的问题。
JAMIA Open. 2025 Feb 19;8(1):ooaf005. doi: 10.1093/jamiaopen/ooaf005. eCollection 2025 Feb.
4
Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答
Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.
5
An evaluation framework for clinical use of large language models in patient interaction tasks.用于患者互动任务中大型语言模型临床应用的评估框架。
Nat Med. 2025 Jan;31(1):77-86. doi: 10.1038/s41591-024-03328-5. Epub 2025 Jan 2.
6
A framework for human evaluation of large language models in healthcare derived from literature review.一个源自文献综述的用于医疗保健领域大语言模型人工评估的框架。
NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.
7
RefAI: a GPT-powered retrieval-augmented generative tool for biomedical literature recommendation and summarization.RefAI:一个基于 GPT 的检索增强型生成工具,用于生物医学文献推荐和总结。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2030-2039. doi: 10.1093/jamia/ocae129.
8
The application of large language models in medicine: A scoping review.大语言模型在医学中的应用:一项范围综述。
iScience. 2024 Apr 23;27(5):109713. doi: 10.1016/j.isci.2024.109713. eCollection 2024 May 17.
9
Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性:范围综述。
BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.
10
Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.评价 ChatGPT 生成的医学回复:系统评价和荟萃分析。
J Biomed Inform. 2024 Mar;151:104620. doi: 10.1016/j.jbi.2024.104620. Epub 2024 Mar 8.