Department of Radiological Sciences, Division of Cardiothoracic Imaging, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.
Background The recent release of large language models (LLMs) for public use, such as ChatGPT and Google Bard, has opened up a multitude of potential benefits as well as challenges. Purpose To evaluate and compare the accuracy and consistency of responses generated by publicly available ChatGPT-3.5 and Google Bard to non-expert questions related to lung cancer prevention, screening, and terminology commonly used in radiology reports based on the recommendation of Lung Imaging Reporting and Data System (Lung-RADS) v2022 from American College of Radiology and Fleischner society. Materials and Methods Forty of the exact same questions were created and presented to ChatGPT-3.5 and Google Bard experimental version as well as Bing and Google search engines by three different authors of this paper. Each answer was reviewed by two radiologists for accuracy. Responses were scored as correct, partially correct, incorrect, or unanswered. Consistency was also evaluated among the answers. Here, consistency was defined as the agreement between the three answers provided by ChatGPT-3.5, Google Bard experimental version, Bing, and Google search engines regardless of whether the concept conveyed was correct or incorrect. The accuracy among different tools were evaluated using Stata. Results ChatGPT-3.5 answered 120 questions with 85 (70.8%) correct, 14 (11.7%) partially correct, and 21 (17.5%) incorrect. Google Bard did not answer 23 (19.1%) questions. Among the 97 questions answered by Google Bard, 62 (51.7%) were correct, 11 (9.2%) were partially correct, and 24 (20%) were incorrect. Bing answered 120 questions with 74 (61.7%) correct, 13 (10.8%) partially correct, and 33 (27.5%) incorrect. Google search engine answered 120 questions with 66 (55%) correct, 27 (22.5%) partially correct, and 27 (22.5%) incorrect. The ChatGPT-3.5 is more likely to provide correct or partially answer than Google Bard, approximately by 1.5 folds (OR = 1.55, P = 0.004). ChatGPT-3.5 and Google search engine were more likely to be consistent than Google Bard by approximately 7 and 29 folds (OR = 6.65, P = 0.002 for ChatGPT and OR = 28.83, P = 0.002 for Google search engine, respectively). Conclusion Although ChatGPT-3.5 had a higher accuracy in comparison with the other tools, neither ChatGPT nor Google Bard, Bing and Google search engines answered all questions correctly and with 100% consistency.
背景 最近,大型语言模型(LLMs)如 ChatGPT 和 Google Bard 等已面向公众发布,这为我们带来了许多潜在的益处,同时也带来了诸多挑战。目的 基于美国放射学院的 Lung Imaging Reporting and Data System (Lung-RADS) v2022 和 Fleischner 学会的推荐,评估并比较公众可获取的 ChatGPT-3.5 和 Google Bard 对与肺癌预防、筛查和放射学报告中常用术语相关的非专业问题的回答的准确性和一致性,这些问题是由三位论文作者创建并呈现给 ChatGPT-3.5 和 Google Bard 实验版以及 Bing 和谷歌搜索引擎的。由两位放射科医生对每个答案进行准确性评估。答案被标记为正确、部分正确、错误或未回答。还评估了答案之间的一致性。在这里,一致性被定义为 ChatGPT-3.5、Google Bard 实验版、Bing 和谷歌搜索引擎提供的三个答案之间的一致性,无论传达的概念是否正确。使用 Stata 评估不同工具之间的准确性。结果 ChatGPT-3.5 回答了 120 个问题,其中 85 个(70.8%)正确,14 个(11.7%)部分正确,21 个(17.5%)错误。Google Bard 没有回答 23 个(19.1%)问题。在 Google Bard 回答的 97 个问题中,62 个(51.7%)正确,11 个(9.2%)部分正确,24 个(20%)错误。Bing 回答了 120 个问题,其中 74 个(61.7%)正确,13 个(10.8%)部分正确,33 个(27.5%)错误。谷歌搜索引擎回答了 120 个问题,其中 66 个(55%)正确,27 个(22.5%)部分正确,27 个(22.5%)错误。与 Google Bard 相比,ChatGPT-3.5 更有可能提供正确或部分答案,大约是其 1.5 倍(OR = 1.55,P = 0.004)。ChatGPT-3.5 和谷歌搜索引擎比 Google Bard 更有可能保持一致,分别约为 7 倍和 29 倍(ChatGPT 的 OR = 6.65,P = 0.002,谷歌搜索引擎的 OR = 28.83,P = 0.002)。结论 虽然 ChatGPT-3.5 的准确性与其他工具相比有所提高,但无论是 ChatGPT 还是 Google Bard、Bing 和谷歌搜索引擎都没有正确回答所有问题,并且一致性也没有达到 100%。