Suppr超能文献

用于聊天机器人健康建议研究的大语言模型:一项系统综述。

Large Language Models for Chatbot Health Advice Studies: A Systematic Review.

作者信息

Huo Bright, Boyle Amy, Marfo Nana, Tangamornsuksan Wimonchat, Steen Jeremy P, McKechnie Tyler, Lee Yung, Mayol Julio, Antoniou Stavros A, Thirunavukarasu Arun James, Sanger Stephanie, Ramji Karim, Guyatt Gordon

机构信息

Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada.

Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada.

出版信息

JAMA Netw Open. 2025 Feb 3;8(2):e2457879. doi: 10.1001/jamanetworkopen.2024.57879.

Abstract

IMPORTANCE

There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain.

OBJECTIVE

To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART).

EVIDENCE REVIEW

A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies.

FINDINGS

A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs.

CONCLUSIONS AND RELEVANCE

In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.

摘要

重要性

大语言模型(LLMs)在医疗保健中的临床整合备受关注。许多研究评估了大语言模型提供健康建议的能力,但其报告质量尚不确定。

目的

进行一项系统评价,以检查同行评审研究之间的报告差异,这些研究评估了生成式人工智能(AI)驱动的聊天机器人在总结证据和提供健康建议方面的性能,为聊天机器人评估报告工具(CHART)的开发提供信息。

证据审查

在健康科学图书馆员的帮助下,对Ovid平台的MEDLINE、Elsevier平台的Embase以及Web of Science数据库从创建到2023年10月27日进行检索,共获得7752篇文章。两名评审员先通过标题和摘要筛选文章,然后进行全文评审,以确定评估生成式AI驱动的聊天机器人提供健康建议的临床准确性的主要研究(聊天机器人健康建议研究)。随后两名评审员对符合条件的137项研究进行数据提取。

研究结果

共纳入137项研究。研究涉及外科(55项[40.1%])、医学(51项[37.2%])和初级保健(13项[9.5%])等主题。许多研究聚焦于治疗(91项[66.4%])、诊断(60项[43.8%])或疾病预防(29项[21.2%])。大多数研究(136项[99.3%])评估的是无法访问的闭源大语言模型,且未提供足够信息来识别所评估的大语言模型版本。所有研究都缺乏对大语言模型特征的充分描述,包括温度、令牌长度、微调可用性、层数及其他细节。大多数研究(136项[99.3%])在其研究中未描述提示工程阶段。54项(39.4%)研究报告了大语言模型查询的日期。大多数研究(89项[65.0%])使用主观方法来定义聊天机器人的成功表现,而不到三分之一的研究涉及大语言模型临床整合的伦理、监管和患者安全问题。

结论与意义

在这项对137项聊天机器人健康建议研究的系统评价中,报告质量参差不齐,可能为CHART报告标准的制定提供参考。随着大语言模型临床整合的关注度不断提高,伦理、监管和患者安全考量至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fc2/11795331/2fc8b347e505/jamanetwopen-e2457879-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验