• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估人工智能在总结预编码文本以支持证据综合方面的性能:聊天机器人与人类的比较。

Evaluating the performance of artificial intelligence in summarizing pre-coded text to support evidence synthesis: a comparison between chatbots and humans.

作者信息

Nordmann Kim, Sauter Stefanie, Stein Mirjam, Aigner Johanna, Redlich Marie-Christin, Schaller Michael, Fischer Florian

机构信息

Kempten University of Applied Sciences, Bavarian Research Center for Digital Health and Social Care, Kempten, Germany.

出版信息

BMC Med Res Methodol. 2025 May 30;25(1):150. doi: 10.1186/s12874-025-02532-2.

DOI:10.1186/s12874-025-02532-2
PMID:40448034
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12123790/
Abstract

BACKGROUND

With the rise of large language models, the application of artificial intelligence in research is expanding, possibly accelerating specific stages of the research processes. This study aims to compare the accuracy, completeness and relevance of chatbot-generated responses against human responses in evidence synthesis as part of a scoping review.

METHODS

We employed a structured survey-based research methodology to analyse and compare responses between two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis of a scoping review dealing with digitally supported interaction between healthcare workers.

RESULTS

The analysis revealed no significant differences in judgments of correctness between answers by chatbots and those given by humans. However, chatbots' answers were found to recognise the context of the original text better, and they provided more complete, albeit longer, responses. Human responses were less likely to add new content to the original text or include interpretation. Amongst the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Correct contextualisation of the answer was positively correlated with completeness and correctness of the answer.

CONCLUSIONS

Chatbots powered by large language models may be a useful tool to accelerate qualitative evidence synthesis. Given the current speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand over the coming years.

摘要

背景

随着大语言模型的兴起,人工智能在研究中的应用正在不断扩展,可能会加速研究过程的特定阶段。作为一项范围综述的一部分,本研究旨在比较聊天机器人生成的回答与人工回答在证据综合方面的准确性、完整性和相关性。

方法

我们采用基于结构化调查的研究方法,分析并比较两位人类研究人员和四个聊天机器人(ZenoChat、ChatGPT 3.5、ChatGPT 4.0和ChatFlash)对基于407篇文章的预编码样本提出的问题的回答。这些问题是一项范围综述的证据综合的一部分,该综述涉及医护人员之间的数字支持互动。

结果

分析表明,聊天机器人的回答与人工回答在正确性判断上没有显著差异。然而,发现聊天机器人的回答能更好地识别原文的上下文,并且提供了更完整(尽管更长)的回答。人工回答不太可能在原文基础上添加新内容或包含解释。在聊天机器人中,ZenoChat的回答评分最高,其次是ChatFlash,ChatGPT 3.5和ChatGPT 4.0并列第三。答案的正确情境化与答案的完整性和正确性呈正相关。

结论

由大语言模型驱动的聊天机器人可能是加速定性证据综合的有用工具。鉴于当前聊天机器人的开发和微调速度,未来几年聊天机器人在促进研究方面的成功应用很可能会继续扩大。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f37/12123790/44c05ae4e772/12874_2025_2532_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f37/12123790/44c05ae4e772/12874_2025_2532_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f37/12123790/44c05ae4e772/12874_2025_2532_Fig1_HTML.jpg

相似文献

1
Evaluating the performance of artificial intelligence in summarizing pre-coded text to support evidence synthesis: a comparison between chatbots and humans.评估人工智能在总结预编码文本以支持证据综合方面的性能:聊天机器人与人类的比较。
BMC Med Res Methodol. 2025 May 30;25(1):150. doi: 10.1186/s12874-025-02532-2.
2
Comparative assessment of artificial intelligence chatbots' performance in responding to healthcare professionals' and caregivers' questions about Dravet syndrome.人工智能聊天机器人在回答医疗专业人员和护理人员有关德雷维特综合征问题时的性能比较评估。
Epilepsia Open. 2025 Apr 1. doi: 10.1002/epi4.70022.
3
Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.前瞻性评估 4 种大型语言模型聊天机器人对患者关于急救护理问题的回答的准确性:实验性对比研究。
J Med Internet Res. 2024 Nov 4;26:e60291. doi: 10.2196/60291.
4
Chatbots' Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis.聊天机器人在生成本科医学生评估的单一最佳答案问题中的作用:比较分析
JMIR Med Educ. 2025 May 30;11:e69521. doi: 10.2196/69521.
5
Assessing the Adherence of ChatGPT Chatbots to Public Health Guidelines for Smoking Cessation: Content Analysis.评估ChatGPT聊天机器人对戒烟公共卫生指南的遵循情况:内容分析
J Med Internet Res. 2025 Jan 30;27:e66896. doi: 10.2196/66896.
6
The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.人工智能聊天机器人大型语言模型在解决骨骼生物学和骨骼健康问题方面的表现。
J Bone Miner Res. 2024 Mar 22;39(2):106-115. doi: 10.1093/jbmr/zjad007.
7
Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study.对ChatGPT的医学建议进行(图灵)测试:调查研究。
JMIR Med Educ. 2023 Jul 10;9:e46939. doi: 10.2196/46939.
8
Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology Cases.多模态人工智能聊天机器人在临床肿瘤病例中的性能评估。
JAMA Netw Open. 2024 Oct 1;7(10):e2437711. doi: 10.1001/jamanetworkopen.2024.37711.
9
Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study.ChatGPT-4、微软 Copilot 和谷歌 Gemini 在意大利医疗科学学位入学考试中的比较准确性:一项横断面研究。
BMC Med Educ. 2024 Jun 26;24(1):694. doi: 10.1186/s12909-024-05630-9.
10
Evaluation of artificial intelligence (AI) chatbots for providing sexual health information: a consensus study using real-world clinical queries.评估用于提供性健康信息的人工智能(AI)聊天机器人:一项使用真实临床问题的共识研究。
BMC Public Health. 2025 May 15;25(1):1788. doi: 10.1186/s12889-025-22933-8.

本文引用的文献

1
ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions.ChatGPT与医学顾问的对比:对耳鼻喉科基于病例问题回答的盲法评估
JMIR Med Educ. 2023 Dec 5;9:e49183. doi: 10.2196/49183.
2
The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation.人工智能在科学研究中的潜力与担忧:ChatGPT性能评估
JMIR Med Educ. 2023 Sep 14;9:e47049. doi: 10.2196/47049.
3
Battle of the (Chat)Bots: Comparing Large Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease Prevention.
(Chat)机器人之战:比较大型语言模型与预防输血相关移植物抗宿主病的实践指南。
Transfus Med Rev. 2023 Jul;37(3):150753. doi: 10.1016/j.tmrv.2023.150753. Epub 2023 Aug 19.
4
Transforming healthcare documentation: harnessing the potential of AI to generate discharge summaries.变革医疗文档:利用人工智能的潜力生成出院小结。
BJGP Open. 2024 Apr 25;8(1). doi: 10.3399/BJGPO.2023.0116. Print 2024 Apr.
5
Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。
J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.
6
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
7
Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment.ChatGPT 和 Bard 在基于文本的放射学知识评估中的比较性能。
Can Assoc Radiol J. 2024 May;75(2):344-350. doi: 10.1177/08465371231193716. Epub 2023 Aug 14.
8
Application ChatGPT in conducting systematic reviews and meta-analyses.ChatGPT在进行系统评价和荟萃分析中的应用。
Br Dent J. 2023 Jul;235(2):90-92. doi: 10.1038/s41415-023-6132-y.
9
ChatGPT and large language models in academia: opportunities and challenges.学术界的ChatGPT与大型语言模型:机遇与挑战
BioData Min. 2023 Jul 13;16(1):20. doi: 10.1186/s13040-023-00339-9.
10
Conceptualizing Interprofessional Digital Communication and Collaboration in Health Care: Protocol for a Scoping Review.医疗保健领域跨专业数字通信与协作的概念化:一项范围综述的方案
JMIR Res Protoc. 2023 Jun 26;12:e45179. doi: 10.2196/45179.