Delsoz Mohammad, Madadi Yeganeh, Munir Wuqaas M, Tamm Brendan, Mehravaran Shiva, Soleimani Mohammad, Djalilian Ali, Yousefi Siamak
Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA.
Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
medRxiv. 2023 Aug 28:2023.08.25.23294635. doi: 10.1101/2023.08.25.23294635.
Assessing the capabilities of ChatGPT-4.0 and ChatGPT-3.5 for diagnosing corneal eye diseases based on case reports and compare with human experts.
We randomly selected 20 cases of corneal diseases including corneal infections, dystrophies, degenerations, and injuries from a publicly accessible online database from the University of Iowa. We then input the text of each case description into ChatGPT-4.0 and ChatGPT3.5 and asked for a provisional diagnosis. We finally evaluated the responses based on the correct diagnoses then compared with the diagnoses of three cornea specialists (Human experts) and evaluated interobserver agreements.
The provisional diagnosis accuracy based on ChatGPT-4.0 was 85% (17 correct out of 20 cases) while the accuracy of ChatGPT-3.5 was 60% (12 correct cases out of 20). The accuracy of three cornea specialists were 100% (20 cases), 90% (18 cases), and 90% (18 cases), respectively. The interobserver agreement between ChatGPT-4.0 and ChatGPT-3.5 was 65% (13 cases) while the interobserver agreement between ChatGPT-4.0 and three cornea specialists were 85% (17 cases), 80% (16 cases), and 75% (15 cases), respectively. However, the interobserver agreement between ChatGPT-3.5 and each of three cornea specialists was 60% (12 cases).
The accuracy of ChatGPT-4.0 in diagnosing patients with various corneal conditions was markedly improved than ChatGPT-3.5 and promising for potential clinical integration.
基于病例报告评估ChatGPT-4.0和ChatGPT-3.5诊断角膜疾病的能力,并与人类专家进行比较。
我们从爱荷华大学一个可公开访问的在线数据库中随机选择了20例角膜疾病病例,包括角膜感染、营养不良、变性和损伤。然后我们将每个病例描述的文本输入ChatGPT-4.0和ChatGPT-3.5,并要求给出初步诊断。我们最终根据正确诊断评估回复,然后与三位角膜专家(人类专家)的诊断进行比较,并评估观察者间的一致性。
基于ChatGPT-4.0的初步诊断准确率为85%(20例中有17例正确),而ChatGPT-3.5的准确率为60%(20例中有12例正确)。三位角膜专家的准确率分别为100%(20例)、90%(18例)和90%(18例)。ChatGPT-4.0和ChatGPT-3.5之间的观察者间一致性为65%(13例),而ChatGPT-4.0与三位角膜专家之间的观察者间一致性分别为85%(17例)、80%(16例)和75%(15例)。然而,ChatGPT-3.5与三位角膜专家中每一位之间的观察者间一致性为60%(12例)。
ChatGPT-4.0在诊断各种角膜疾病患者方面的准确率比ChatGPT-3.5有显著提高,并且有望实现潜在的临床整合。