Department of Computer Science, Vanderbilt University, Nashville, TN, United States.
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States.
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios.
This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field.
We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns.
Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research.
Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
2022 年 11 月,ChatGPT(OpenAI)的发布引起了公众和学术界对大型语言模型(LLM)的关注,促成了许多其他创新的 LLM 的出现。这些 LLM 已应用于各个领域,包括医疗保健领域。此后,人们对如何在与健康相关的场景中使用最先进的 LLM 进行了大量研究。
本综述旨在总结对话型 LLM 在医疗保健中的应用和关注点,并为该领域的未来研究制定议程。
我们使用 PubMed、ACM 和 IEEE 数字图书馆作为主要资源进行了这项综述。我们遵循 PRISMA(系统评价和荟萃分析的首选报告项目)的指导,筛选和选择了与医疗保健应用和对话型 LLM 相关且于 2023 年 9 月 1 日(我们开始论文收集的日期)之前发表的同行评审研究文章。我们研究了这些论文,并根据它们的应用和关注点进行了分类。
我们的搜索最初根据目标关键字确定了 820 篇论文,其中 65 篇(7.9%)符合我们的标准并包含在综述中。最受欢迎的对话型 LLM 是 ChatGPT(65 篇中的 60 篇,占 92%),其次是 Bard(Google LLC;65 篇中的 1 篇,占 2%)、LLaMA(Meta;65 篇中的 1 篇,占 2%)和其他 LLM(65 篇中的 6 篇,占 9%)。这些论文分为四类应用:(1)总结,(2)医学知识查询,(3)预测(如诊断、治疗建议和药物协同作用),和(4)管理(如文档和信息收集),以及四类关注点:(1)可靠性(如训练数据质量、准确性、可解释性和响应一致性),(2)偏差,(3)隐私,和(4)公众接受度。有 49 篇(75%)论文使用 LLM 进行总结或医学知识查询,或两者兼而有之,有 58 篇(89%)论文表达了对可靠性或偏差的担忧,或两者兼而有之。我们发现,对话型 LLM 在总结和向患者提供一般医学知识方面表现出了有前景的结果,且具有相对较高的准确性。然而,像 ChatGPT 这样的对话型 LLM 并不总是能够为需要专业领域知识的复杂健康相关任务(如诊断)提供可靠的答案。虽然偏差或隐私问题通常被认为是关注点,但我们综述中的研究论文并没有认真研究对话型 LLM 在医疗保健研究中如何导致这些问题。
未来的研究应侧重于提高 LLM 在复杂健康相关任务中的可靠性,以及研究 LLM 应用程序如何带来偏差和隐私问题的机制。考虑到 LLM 的广泛可及性,需要法律、社会和技术方面的努力来解决对 LLM 的担忧,以促进、改进和规范 LLM 在医疗保健中的应用。