Morgan State University, Baltimore, MD.
Harvard Medical School, Boston, MA.
Cornea. 2024 Jun 1;43(6):746-750. doi: 10.1097/ICO.0000000000003439. Epub 2023 Nov 28.
ChatGPT is a commonly used source of information by patients and clinicians. However, it can be prone to error and requires validation. We sought to assess the quality and accuracy of information regarding corneal transplantation and Fuchs dystrophy from 2 iterations of ChatGPT, and whether its answers improve over time.
A total of 10 corneal specialists collaborated to assess responses of the algorithm to 10 commonly asked questions related to endothelial keratoplasty and Fuchs dystrophy. These questions were asked from both ChatGPT-3.5 and its newer generation, GPT-4. Assessments tested quality, safety, accuracy, and bias of information. Chi-squared, Fisher exact tests, and regression analyses were conducted.
We analyzed 180 valid responses. On a 1 (A+) to 5 (F) scale, the average score given by all specialists across questions was 2.5 for ChatGPT-3.5 and 1.4 for GPT-4, a significant improvement ( P < 0.0001). Most responses by both ChatGPT-3.5 (61%) and GPT-4 (89%) used correct facts, a proportion that significantly improved across iterations ( P < 0.00001). Approximately a third (35%) of responses from ChatGPT-3.5 were considered against the scientific consensus, a notable rate of error that decreased to only 5% of answers from GPT-4 ( P < 0.00001).
The quality of responses in ChatGPT significantly improved between versions 3.5 and 4, and the odds of providing information against the scientific consensus decreased. However, the technology is still capable of producing inaccurate statements. Corneal specialists are uniquely positioned to assist users to discern the veracity and application of such information.
ChatGPT 是患者和临床医生常用的信息来源。然而,它可能容易出错,需要验证。我们旨在评估来自 ChatGPT 两个版本的关于角膜移植和 Fuchs 营养不良的信息的质量和准确性,以及其答案是否随着时间的推移而改善。
共有 10 名角膜专家合作评估算法对 10 个与内皮角膜移植和 Fuchs 营养不良相关的常见问题的回答。这些问题是分别向 ChatGPT-3.5 和其较新的一代 GPT-4 提出的。评估测试了信息的质量、安全性、准确性和偏差。进行了卡方检验、Fisher 精确检验和回归分析。
我们分析了 180 个有效回复。在 1(A+)到 5(F)的评分中,所有专家对所有问题的平均评分分别为 ChatGPT-3.5 的 2.5 和 GPT-4 的 1.4,有显著提高(P < 0.0001)。ChatGPT-3.5(61%)和 GPT-4(89%)的大多数回复都使用了正确的事实,这一比例在迭代中显著提高(P < 0.00001)。约三分之一(35%)的 ChatGPT-3.5 回复与科学共识相悖,这是一个显著的错误率,而 GPT-4 的回复中只有 5%(P < 0.00001)与之相悖。
ChatGPT 版本 3.5 和 4 之间的回复质量有显著提高,提供与科学共识相悖的信息的可能性降低。然而,该技术仍有可能产生不准确的陈述。角膜专家在帮助用户辨别此类信息的真实性和适用性方面具有独特的优势。