评估人工智能在总结预编码文本以支持证据综合方面的性能：聊天机器人与人类的比较。

Evaluating the performance of artificial intelligence in summarizing pre-coded text to support evidence synthesis: a comparison between chatbots and humans.

作者信息

Nordmann Kim, Sauter Stefanie, Stein Mirjam, Aigner Johanna, Redlich Marie-Christin, Schaller Michael, Fischer Florian

机构信息

Kempten University of Applied Sciences, Bavarian Research Center for Digital Health and Social Care, Kempten, Germany.

出版信息

BMC Med Res Methodol. 2025 May 30;25(1):150. doi: 10.1186/s12874-025-02532-2.

DOI:10.1186/s12874-025-02532-2

PMID:40448034

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12123790/

Abstract

BACKGROUND

With the rise of large language models, the application of artificial intelligence in research is expanding, possibly accelerating specific stages of the research processes. This study aims to compare the accuracy, completeness and relevance of chatbot-generated responses against human responses in evidence synthesis as part of a scoping review.

METHODS

We employed a structured survey-based research methodology to analyse and compare responses between two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis of a scoping review dealing with digitally supported interaction between healthcare workers.

RESULTS

The analysis revealed no significant differences in judgments of correctness between answers by chatbots and those given by humans. However, chatbots' answers were found to recognise the context of the original text better, and they provided more complete, albeit longer, responses. Human responses were less likely to add new content to the original text or include interpretation. Amongst the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Correct contextualisation of the answer was positively correlated with completeness and correctness of the answer.

CONCLUSIONS

Chatbots powered by large language models may be a useful tool to accelerate qualitative evidence synthesis. Given the current speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand over the coming years.

摘要

背景

随着大语言模型的兴起，人工智能在研究中的应用正在不断扩展，可能会加速研究过程的特定阶段。作为一项范围综述的一部分，本研究旨在比较聊天机器人生成的回答与人工回答在证据综合方面的准确性、完整性和相关性。

方法

我们采用基于结构化调查的研究方法，分析并比较两位人类研究人员和四个聊天机器人（ZenoChat、ChatGPT 3.5、ChatGPT 4.0和ChatFlash）对基于407篇文章的预编码样本提出的问题的回答。这些问题是一项范围综述的证据综合的一部分，该综述涉及医护人员之间的数字支持互动。

结果

分析表明，聊天机器人的回答与人工回答在正确性判断上没有显著差异。然而，发现聊天机器人的回答能更好地识别原文的上下文，并且提供了更完整（尽管更长）的回答。人工回答不太可能在原文基础上添加新内容或包含解释。在聊天机器人中，ZenoChat的回答评分最高，其次是ChatFlash，ChatGPT 3.5和ChatGPT 4.0并列第三。答案的正确情境化与答案的完整性和正确性呈正相关。