评估大型语言模型回答临床医生对证据总结请求的能力。

Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.

作者信息

Blasingame Mallory N, Koonce Taneya Y, Williams Annette M, Giuse Dario A, Su Jing, Krump Poppy A, Giuse Nunzia Bettinsoli

出版信息

J Med Libr Assoc. 2025 Jan 14;113(1):65-77. doi: 10.5195/jmla.2025.1985.

DOI:10.5195/jmla.2025.1985

PMID:39975503

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11835037/

Abstract

OBJECTIVE

This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.

METHODS

Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.

RESULTS

Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.

CONCLUSIONS

Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.

摘要

目的

本研究调查了使用GPT-4的生成式人工智能（AI）工具在回答临床问题方面的表现，并与医学图书馆员的金标准证据综合进行比较。

方法

问题从医学图书馆员先前回答的内部临床证据请求数据库中提取。有多个部分的问题被细分为单个主题。使用COSTAR框架开发了一个标准化提示。图书馆员将每个问题提交到aiChat，这是一个使用GPT-4的内部管理聊天工具，并记录回答。评估aiChat生成的摘要是否包含图书馆员既定金标准摘要中使用的关键要素。随机选择一部分问题来核实aiChat提供的参考文献。

结果

在216个评估问题中，aiChat的回答被评定为180个（83.3%）问题“正确”，35个（16.2%）问题“部分正确”，1个（0.5%）问题“错误”。按问题类别划分的问题评分未观察到显著差异（p = 0.73）。对于30%（n = 66）的问题子集，aiChat摘要中提供了162条参考文献，其中60条（37%）被确认为非编造。

结论

总体而言，生成式AI工具的表现很有前景。然而，许多包含的参考文献无法独立核实，并且没有尝试评估aiChat引入的任何其他概念在事实上是否准确。因此，我们设想这是一系列调查中的第一项，旨在进一步了解如何使用当前和未来版本的生成式AI并将其整合到医学图书馆员的工作流程中。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估大型语言模型回答临床医生对证据总结请求的能力。

Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.

作者信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

评估大型语言模型回答临床医生对证据总结请求的能力。

Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.

作者信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献