Wright Benjamin M, Bodnar Michael S, Moore Andrew D, Maseda Meghan C, Kucharik Michael P, Diaz Connor C, Schmidt Christian M, Mir Hassan R
Morsani College of Medicine, University of South Florida, Tampa, Florida, USA.
Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA.
Bone Jt Open. 2024 Feb 15;5(2):139-146. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1.
While internet search engines have been the primary information source for patients' questions, artificial intelligence large language models like ChatGPT are trending towards becoming the new primary source. The purpose of this study was to determine if ChatGPT can answer patient questions about total hip (THA) and knee arthroplasty (TKA) with consistent accuracy, comprehensiveness, and easy readability.
We posed the 20 most Google-searched questions about THA and TKA, plus ten additional postoperative questions, to ChatGPT. Each question was asked twice to evaluate for consistency in quality. Following each response, we responded with, "Please explain so it is easier to understand," to evaluate ChatGPT's ability to reduce response reading grade level, measured as Flesch-Kincaid Grade Level (FKGL). Five resident physicians rated the 120 responses on 1 to 5 accuracy and comprehensiveness scales. Additionally, they answered a "yes" or "no" question regarding acceptability. Mean scores were calculated for each question, and responses were deemed acceptable if ≥ four raters answered "yes."
The mean accuracy and comprehensiveness scores were 4.26 (95% confidence interval (CI) 4.19 to 4.33) and 3.79 (95% CI 3.69 to 3.89), respectively. Out of all the responses, 59.2% (71/120; 95% CI 50.0% to 67.7%) were acceptable. ChatGPT was consistent when asked the same question twice, giving no significant difference in accuracy (t = 0.821; p = 0.415), comprehensiveness (t = 1.387; p = 0.171), acceptability (χ = 1.832; p = 0.176), and FKGL (t = 0.264; p = 0.793). There was a significantly lower FKGL (t = 2.204; p = 0.029) for easier responses (11.14; 95% CI 10.57 to 11.71) than original responses (12.15; 95% CI 11.45 to 12.85).
ChatGPT answered THA and TKA patient questions with accuracy comparable to previous reports of websites, with adequate comprehensiveness, but with limited acceptability as the sole information source. ChatGPT has potential for answering patient questions about THA and TKA, but needs improvement.
虽然互联网搜索引擎一直是患者问题的主要信息来源,但像ChatGPT这样的人工智能大语言模型正逐渐成为新的主要信息源。本研究的目的是确定ChatGPT能否以一致的准确性、全面性和易读性回答患者关于全髋关节置换术(THA)和膝关节置换术(TKA)的问题。
我们向ChatGPT提出了20个谷歌搜索量最高的关于THA和TKA的问题,以及另外10个术后问题。每个问题问两次以评估质量的一致性。在每次回答后,我们回复“请解释一下,以便更容易理解”,以评估ChatGPT降低回答阅读难度等级的能力,用弗莱什-金凯德等级水平(FKGL)衡量。五位住院医师在1至5的准确性和全面性量表上对120个回答进行评分。此外,他们回答了一个关于可接受性的“是”或“否”的问题。计算每个问题的平均得分,如果≥四名评分者回答“是”,则该回答被视为可接受。
平均准确性和全面性得分分别为4.26(95%置信区间(CI)4.19至4.33)和3.79(95%CI 3.69至3.89)。在所有回答中,59.2%(71/120;95%CI 50.0%至67.7%)是可接受的。当同一个问题问ChatGPT两次时,其表现具有一致性,准确性(t = 0.821;p = 0.415)、全面性(t = 1.387;p = 0.171))、可接受性(χ = 1.832;p = 0.176)和FKGL(t = 0.264;p = 0.793)均无显著差异。与原始回答(12.15;95%CI 11.45至12.85)相比,更容易理解的回答(11.14;95%CI 10.57至11.71)的FKGL显著更低(t = 2.204;p = 0.029)。
ChatGPT回答THA和TKA患者问题的准确性与之前网站的报告相当,具有足够的全面性,但作为唯一信息源的可接受性有限。ChatGPT有潜力回答患者关于THA和TKA的问题,但需要改进。