Andreão Filipi Fim, Moura Nascimento Matheus, De Faria André M, Virgilio Ribeiro Filipe, da Costa Otavio Augusto, Palavani Lucca B, Santos Piedade Guilherme, Morell Alexis, Almeida Timoteo, Martins da Cunha Pedro Henrique, Komotar Ricardo J, Cordeiro Joacir Graciolli, Assumpcao de Monaco Bernardo
Department of Neurosurgery, Federal University of Rio de Janeiro, Rio de Janeiro, BRA.
Faculty of Medicine, University Center of Maceio, Maceió, BRA.
Cureus. 2025 May 6;17(5):e83592. doi: 10.7759/cureus.83592. eCollection 2025 May.
Background and objective The integration of artificial intelligence (AI) into functional neurosurgery holds great promise for improving diagnostic precision and therapeutic decision-making. This study aimed to assess the diagnostic accuracy and treatment recommendations provided by five AI models - ChatGPT-3.5, ChatGPT-4, Perplexity, Gemini, and AtlasGPT - when applied to complex clinical cases. Methods Ten clinical cases related to functional neurosurgery were selected from the medical literature to minimize ambiguity and ensure clarity. Each case was presented to the AI models with the directive to propose a diagnosis and therapeutic approach, using medical terminology. The AI responses were evaluated by a panel of seven functional neurosurgeons, who scored the accuracy of diagnoses and treatment recommendations on a scale from 0 to 10. The scores were analyzed using one-way ANOVA, with post-hoc analysis via Tukey's test to identify significant differences among the AI models. Results Diagnostic accuracy varied significantly among the AI models. AtlasGPT achieved a median diagnostic score of 9 [quartile 1 (Q1): 9, quartile 3 (Q3): 10, interquartile range (IQR): 1], demonstrating superior performance compared to Perplexity, which had a median score of 9 with a higher IQR of 3 (p=0.04), and ChatGPT-3.5, which had a median score of 10 but with a lower IQR of 2 (p=0.03). In terms of treatment recommendations, AtlasGPT's median score was 8, notably higher than ChatGPT-3.5, which had a median score of 7 (p<0.01), and Perplexity, which also had a median score of 8 (p<0.01). Conclusions This study's findings underscore the potential of AI models in functional neurosurgery, particularly in enhancing diagnostic accuracy and expanding therapeutic options. However, the variability in performance among different AI systems suggests the need for continuous evaluation and refinement of these technologies. Rigorous assessment and interdisciplinary collaboration are essential to ensure the safe and effective integration of AI into clinical practice.
背景与目的 人工智能(AI)融入功能神经外科有望极大地提高诊断精度和治疗决策水平。本研究旨在评估五种人工智能模型——ChatGPT-3.5、ChatGPT-4、Perplexity、Gemini和AtlasGPT——应用于复杂临床病例时提供的诊断准确性和治疗建议。方法 从医学文献中选取10例与功能神经外科相关的临床病例,以尽量减少歧义并确保清晰度。向人工智能模型呈现每个病例,并要求其使用医学术语提出诊断和治疗方法。由七名功能神经外科医生组成的小组对人工智能的回答进行评估,他们在0至10分的量表上对诊断和治疗建议的准确性进行评分。使用单因素方差分析对分数进行分析,并通过Tukey检验进行事后分析,以确定人工智能模型之间的显著差异。结果 人工智能模型之间的诊断准确性差异显著。AtlasGPT的诊断中位数得分为9 [四分位数1(Q1):9,四分位数3(Q3):10,四分位距(IQR):1],与Perplexity相比表现更优,Perplexity的中位数得分为9,但四分位距更高,为3(p = 0.04),ChatGPT-3.5的中位数得分为10,但四分位距更低,为2(p = 0.03)。在治疗建议方面,AtlasGPT的中位数得分为8,显著高于ChatGPT-3.5,后者的中位数得分为7(p < 0.01),Perplexity的中位数得分也为8(p < 0.01)。结论 本研究结果强调了人工智能模型在功能神经外科中的潜力,特别是在提高诊断准确性和扩展治疗选择方面。然而,不同人工智能系统之间性能的差异表明需要对这些技术进行持续评估和改进。严格的评估和跨学科合作对于确保人工智能安全有效地融入临床实践至关重要。