• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型回答临床医生对证据总结请求的能力。

Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.

作者信息

Blasingame Mallory N, Koonce Taneya Y, Williams Annette M, Giuse Dario A, Su Jing, Krump Poppy A, Giuse Nunzia Bettinsoli

出版信息

J Med Libr Assoc. 2025 Jan 14;113(1):65-77. doi: 10.5195/jmla.2025.1985.

DOI:10.5195/jmla.2025.1985
PMID:39975503
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11835037/
Abstract

OBJECTIVE

This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.

METHODS

Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.

RESULTS

Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.

CONCLUSIONS

Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.

摘要

目的

本研究调查了使用GPT-4的生成式人工智能(AI)工具在回答临床问题方面的表现,并与医学图书馆员的金标准证据综合进行比较。

方法

问题从医学图书馆员先前回答的内部临床证据请求数据库中提取。有多个部分的问题被细分为单个主题。使用COSTAR框架开发了一个标准化提示。图书馆员将每个问题提交到aiChat,这是一个使用GPT-4的内部管理聊天工具,并记录回答。评估aiChat生成的摘要是否包含图书馆员既定金标准摘要中使用的关键要素。随机选择一部分问题来核实aiChat提供的参考文献。

结果

在216个评估问题中,aiChat的回答被评定为180个(83.3%)问题“正确”,35个(16.2%)问题“部分正确”,1个(0.5%)问题“错误”。按问题类别划分的问题评分未观察到显著差异(p = 0.73)。对于30%(n = 66)的问题子集,aiChat摘要中提供了162条参考文献,其中60条(37%)被确认为非编造。

结论

总体而言,生成式AI工具的表现很有前景。然而,许多包含的参考文献无法独立核实,并且没有尝试评估aiChat引入的任何其他概念在事实上是否准确。因此,我们设想这是一系列调查中的第一项,旨在进一步了解如何使用当前和未来版本的生成式AI并将其整合到医学图书馆员的工作流程中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2d2/11835037/143ace0de0b1/jmla-113-1-65-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2d2/11835037/143ace0de0b1/jmla-113-1-65-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2d2/11835037/143ace0de0b1/jmla-113-1-65-g001.jpg

相似文献

1
Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.评估大型语言模型回答临床医生对证据总结请求的能力。
J Med Libr Assoc. 2025 Jan 14;113(1):65-77. doi: 10.5195/jmla.2025.1985.
2
Evaluating a Large Language Model's Ability to Answer Clinicians' Requests for Evidence Summaries.评估大型语言模型回答临床医生证据总结请求的能力。
medRxiv. 2024 May 3:2024.05.01.24306691. doi: 10.1101/2024.05.01.24306691.
3
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
4
Integrating PICO principles into generative artificial intelligence prompt engineering to enhance information retrieval for medical librarians.将PICO原则整合到生成式人工智能提示工程中,以增强医学图书馆员的信息检索能力。
J Med Libr Assoc. 2025 Apr 18;113(2):184-188. doi: 10.5195/jmla.2025.2022.
5
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
6
Use of large language model (LLM) to enhance content and structure of a school of dentistry LibGuide.使用大语言模型(LLM)来增强牙科学院LibGuide的内容和结构。
J Med Libr Assoc. 2025 Jan 14;113(1):96-97. doi: 10.5195/jmla.2025.2084.
7
A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.大语言模型与人类受试者在皮肤病学方面表现的比较分析
Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.
8
Evidence-based databases versus primary medical literature: an in-house investigation on their optimal use.循证数据库与医学原始文献:关于其最佳使用的内部调查
J Med Libr Assoc. 2004 Oct;92(4):407-11.
9
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
10
Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study.大语言模型与专家临床医生在远程心理健康患者危机预测中的比较研究。
JMIR Ment Health. 2024 Aug 2;11:e58129. doi: 10.2196/58129.

引用本文的文献

1
The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.大语言模型作为文献综述工具的出现:一项大语言模型辅助的系统综述
J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063.

本文引用的文献

1
Leveraging artificial intelligence to summarize abstracts in lay language for increasing research accessibility and transparency.利用人工智能将摘要用通俗易懂的语言进行总结,以提高研究的可及性和透明度。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2294-2303. doi: 10.1093/jamia/ocae186.
2
Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.利用生成式人工智能进行临床证据综合需要确保其可信度。
J Biomed Inform. 2024 May;153:104640. doi: 10.1016/j.jbi.2024.104640. Epub 2024 Apr 10.
3
Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.
评估不同人工智能聊天机器人对口腔颌面外科基于临床决策案例问题的人工智能生成回复。
Oral Surg Oral Med Oral Pathol Oral Radiol. 2024 Jun;137(6):587-593. doi: 10.1016/j.oooo.2024.02.018. Epub 2024 Mar 6.
4
Does ChatGPT Answer Otolaryngology Questions Accurately?ChatGPT能准确回答耳鼻喉科问题吗?
Laryngoscope. 2024 Sep;134(9):4011-4015. doi: 10.1002/lary.31410. Epub 2024 Mar 28.
5
Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.评价 ChatGPT 生成的医学回复:系统评价和荟萃分析。
J Biomed Inform. 2024 Mar;151:104620. doi: 10.1016/j.jbi.2024.104620. Epub 2024 Mar 8.
6
Reporting Use of AI in Research and Scholarly Publication-JAMA Network Guidance.《研究与学术出版中人工智能的报告——美国医学会杂志网络指南》
JAMA. 2024 Apr 2;331(13):1096-1098. doi: 10.1001/jama.2024.3471.
7
ChatGPT: Game-changer or wildcard for systematic searching?ChatGPT:系统检索的变革者还是不确定因素?
Health Info Libr J. 2024 Mar;41(1):1-3. doi: 10.1111/hir.12517.
8
ChatGPT Can Offer Satisfactory Responses to Common Patient Questions Regarding Elbow Ulnar Collateral Ligament Reconstruction.ChatGPT能够对有关肘部尺侧副韧带重建的常见患者问题提供令人满意的回答。
Arthrosc Sports Med Rehabil. 2024 Feb 13;6(2):100893. doi: 10.1016/j.asmr.2024.100893. eCollection 2024 Apr.
9
The ChatGPT Effect: Nursing Education and Generative Artificial Intelligence.ChatGPT效应:护理教育与生成式人工智能
J Nurs Educ. 2024 Feb 5:1-4. doi: 10.3928/01484834-20240126-01.
10
Prompt Engineering for Generative Artificial Intelligence in Gastroenterology and Hepatology.胃肠病学和肝病学中生成式人工智能的提示工程
Am J Gastroenterol. 2024 Sep 1;119(9):1709-1713. doi: 10.14309/ajg.0000000000002689. Epub 2024 Mar 20.