Casagrande Diego, Gobira Mauro
Ophthalmology, Vision institute (IPEPO), São Paulo, BRA.
Cureus. 2025 Feb 24;17(2):e79565. doi: 10.7759/cureus.79565. eCollection 2025 Feb.
Large language models (LLMs) like Gemini 2.0 Advanced and ChatGPT-4o are increasingly applied in medical contexts. This study assesses their accuracy in answering cataract-related questions from Brazilian ophthalmology board exams, evaluating their potential for clinical decision support.
A retrospective analysis was conducted using 221 multiple-choice questions. Responses from both LLMs were evaluated by two independent ophthalmologists against the official answer key. Accuracy rates and inter-evaluator agreement (Cohen's kappa) were analyzed.
Gemini 2.0 Advanced achieved 85.45% and 80.91% accuracy, while ChatGPT-4o scored 80.00% and 84.09%. Inter-evaluator agreement was moderate (κ = 0.514 and 0.431, respectively). Performance varied across exam years.
Both models demonstrated high accuracy in cataract-related board exam questions, supporting their potential as educational tools. However, moderate agreement and performance variability indicate the need for further refinement and validation.
像Gemini 2.0 Advanced和ChatGPT-4o这样的大语言模型越来越多地应用于医学领域。本研究评估了它们在回答巴西眼科委员会考试中与白内障相关问题时的准确性,评估了它们在临床决策支持方面的潜力。
使用221道多项选择题进行回顾性分析。两位独立的眼科医生根据官方答案对两个大语言模型的回答进行评估。分析准确率和评估者间一致性(科恩kappa系数)。
Gemini 2.0 Advanced的准确率分别为85.45%和80.91%,而ChatGPT-4o的得分分别为80.00%和84.09%。评估者间一致性为中等(κ分别为0.514和0.431)。不同考试年份的表现有所不同。
两个模型在与白内障相关的委员会考试问题上都表现出了较高的准确性,支持了它们作为教育工具的潜力。然而,中等的一致性和表现的变异性表明需要进一步完善和验证。