Hussain Zain S, Delsoz Mohammad, Elahi Muhammad, Jerkins Brian, Kanner Elliot, Wright Claire, Munir Wuqaas M, Soleimani Mohammad, Djalilian Ali, Lao Priscilla A, Fong Joseph W, Kahook Malik Y, Yousefi Siamak
medRxiv. 2025 Mar 17:2025.03.14.25323836. doi: 10.1101/2025.03.14.25323836.
This study evaluates the diagnostic performance of several AI models, including Deepseek, in diagnosing corneal diseases, glaucoma, and neuro□ophthalmologic disorders.
We retrospectively selected 53 case reports from the Department of Ophthalmology and Visual Sciences at the University of Iowa, comprising 20 corneal disease cases, 11 glaucoma cases, and 22 neuro□ophthalmology cases. The case descriptions were input into DeepSeek, ChatGPT□4.0, ChatGPT□01, and Qwens 2.5 Max. These responses were compared with diagnoses rendered by human experts (corneal specialists, glaucoma attendings, and neuro□ophthalmologists). Diagnostic accuracy and interobserver agreement, defined as the percentage difference between each AI model's performance and the average human expert performance, were determined.
DeepSeek achieved an overall diagnostic accuracy of 79.2%, with specialty-specific accuracies of 90.0% in corneal diseases, 54.5% in glaucoma, and 81.8% in neuro□ophthalmology. ChatGPT□01 outperformed the other models with an overall accuracy of 84.9% (85.0% in corneal diseases, 63.6% in glaucoma, and 95.5% in neuro□ophthalmology), while Qwens exhibited a lower overall accuracy of 64.2% (55.0% in corneal diseases, 54.5% in glaucoma, and 77.3% in neuro□ophthalmology). Interobserver agreement analysis revealed that in corneal diseases, DeepSeek differed by -3.3% (90.0% vs 93.3%), ChatGPT□01 by -8.3%, and Qwens by -38.3%. In glaucoma, DeepSeek outperformed the human expert average by +3.0% (54.5% vs 51.5%), while ChatGPT□4.0 and ChatGPT□01 exceeded it by +12.1%, and Qwens was +3.0% above the human average. In neuro□ophthalmology, DeepSeek and ChatGPT□4.0 were 9.1% lower than the human average, ChatGPT□01 exceeded it by +4.6%, and Qwens was 13.6% lower.
ChatGPT□01 demonstrated the highest overall diagnostic accuracy, especially in neuro□ophthalmology, while DeepSeek and ChatGPT□4.0 showed comparable performance. Qwens underperformed relative to the other models, especially in corneal diseases. Although these AI models exhibit promising diagnostic capabilities, they currently lag behind human experts in certain areas, underscoring the need for a collaborative integration of clinical judgment.
This study evaluated how well several artificial intelligence (AI) models diagnose eye diseases compared to human experts. We tested four AI systems across three types of eye conditions: diseases of the cornea, glaucoma, and neuro-ophthalmologic disorders. Overall, one AI model, ChatGPT-01, performed the best, correctly diagnosing about 85% of cases, and it excelled in neuro-ophthalmology by correctly diagnosing 95.5% of cases. Two other models, DeepSeek and ChatGPT-4.0, each achieved an overall accuracy of around 79%, while the Qwens model performed lower, with an overall accuracy of about 64%. When compared with human experts, who achieved very high accuracy in corneal diseases (93.3%) and neuro-ophthalmology (90.9%) but lower in glaucoma (51.5%), the AI models showed mixed results. In glaucoma, for instance, some AI models even outperformed human experts slightly, while in corneal diseases, all AI models were less accurate than the experts. These findings indicate that while AI shows promise as a supportive tool in diagnosing eye conditions, it still needs further improvement. Combining AI with human clinical judgment appears to be the best approach for accurate eye disease diagnosis.
With the rising burden of eye diseases and the inherent diagnostic challenges for complex conditions like glaucoma and neuro-ophthalmologic disorders, there is an unmet need for innovative diagnostic tools to support clinical decision-making. This study evaluated the diagnostic performance of four AI models across three ophthalmologic subspecialties, testing the hypothesis that advanced language models can achieve accuracy levels comparable to human experts. Our results showed that ChatGPT-01 achieved the highest overall accuracy (84.9%), excelling in neuro-ophthalmology with a 95.5% accuracy, while DeepSeek and ChatGPT-4.0 each achieved 79.2%, and Qwens reached 64.2%. In glaucoma, AI model accuracies ranged from 54.5% to 63.6%, with some models slightly surpassing the human expert average of 51.5%, underscoring the diagnostic difficulty of this condition. These findings highlight the potential of AI as a valuable adjunct to clinical judgment in ophthalmology, although further research and the integration of multimodal data are essential to optimize these tools for routine clinical practice.
本研究评估了包括DeepSeek在内的几种人工智能模型在诊断角膜疾病、青光眼和神经眼科疾病方面的诊断性能。
我们回顾性地从爱荷华大学眼科与视觉科学系选取了53例病例报告,其中包括20例角膜疾病病例、11例青光眼病例和22例神经眼科病例。将病例描述输入到DeepSeek、ChatGPT-4.0、ChatGPT-01和文生2.5 Max中。将这些模型的回答与人类专家(角膜专科医生、青光眼主治医生和神经眼科医生)做出的诊断进行比较。确定诊断准确性和观察者间一致性,观察者间一致性定义为每个人工智能模型的表现与人类专家平均表现之间的百分比差异。
DeepSeek的总体诊断准确率为79.2%,在角膜疾病中的专科特异性准确率为90.0%,在青光眼方面为54.5%,在神经眼科方面为81.8%。ChatGPT-01的总体准确率为84.9%,超过了其他模型(在角膜疾病中为85.0%,在青光眼方面为63.6%,在神经眼科方面为95.5%),而文生的总体准确率较低,为64.2%(在角膜疾病中为55.0%,在青光眼方面为54.5%,在神经眼科方面为77.3%)。观察者间一致性分析显示,在角膜疾病中,DeepSeek的差异为-3.3%(90.0%对93.3%),ChatGPT-01为-8.3%,文生为-38.3%。在青光眼方面,DeepSeek比人类专家平均水平高出+3.0%(54.5%对51.5%),而ChatGPT-4.0和ChatGPT-01超过人类专家平均水平+12.1%,文生比人类平均水平高出+3.0%。在神经眼科方面,DeepSeek和ChatGPT-四比人类平均水平低9.1%,ChatGPT-01超过人类平均水平+4.6%,文生比人类平均水平低13.6%。
ChatGPT-01表现出最高的总体诊断准确率,尤其是在神经眼科方面,而DeepSeek和ChatGPT-4.0表现相当。文生相对于其他模型表现较差,尤其是在角膜疾病方面。尽管这些人工智能模型展现出了有前景的诊断能力,但它们目前在某些领域仍落后于人类专家,这凸显了临床判断协同整合的必要性。
本研究评估了几种人工智能(AI)模型与人类专家相比在诊断眼部疾病方面的效果。我们在三种眼部疾病类型中测试了四个AI系统:角膜疾病、青光眼和神经眼科疾病。总体而言,一个AI模型ChatGPT-01表现最佳,正确诊断了约85%的病例,并且在神经眼科疾病方面表现出色,正确诊断率为95.5%。另外两个模型DeepSeek和ChatGPT-4.0的总体准确率均约为79%,而文生模型的表现较低,总体准确率约为64%。与人类专家相比,人类专家在角膜疾病(93.3%)和神经眼科疾病(90.9%)方面准确率非常高,但在青光眼方面(51.5%)较低,AI模型的结果参差不齐。例如,在青光眼方面,一些AI模型甚至略高于人类专家,而在角膜疾病方面,所有AI模型都不如专家准确。这些发现表明,虽然AI作为诊断眼部疾病的辅助工具具有潜力,但仍需要进一步改进。将AI与人类临床判断相结合似乎是准确诊断眼部疾病的最佳方法。
随着眼部疾病负担的增加以及青光眼和神经眼科疾病等复杂病症固有的诊断挑战,迫切需要创新的诊断工具来支持临床决策。本研究评估了四个AI模型在三个眼科亚专业中的诊断性能,检验了先进语言模型可以达到与人类专家相当的准确率这一假设。我们的结果显示,ChatGPT-01的总体准确率最高(84.9%),在神经眼科方面表现出色,准确率为95.5%,而DeepSeek和ChatGPT-4.0均为79.2%,文生为64.2%。在青光眼方面,AI模型的准确率在54.5%至63.6%之间,一些模型略高于人类专家平均水平的51.5%,这凸显了该病症的诊断难度。这些发现突出了AI作为眼科临床判断有价值辅助工具的潜力,尽管进一步的研究和多模态数据的整合对于优化这些工具以用于常规临床实践至关重要。