Hajdari Shefqet, Farooq Minaam, Habib Aleeza, Siddiqui Asad Ali, Sarfraz Laiba, Habib Syed Mohammad, Mirza Uneeb Adnan, Boshara Mohamed, Fadlalla Mohamed B A
University Hospital Bonn, Department of Neurosurgery, Bonn, Germany.
King Edward Medical University, Mayo Hospital, Lahore, Pakistan.
J Clin Neurosci. 2025 Sep 3;141:111597. doi: 10.1016/j.jocn.2025.111597.
Large language models (LLMs), with their remarkable ability to retrieve and analyse the information within seconds, are generating significant interest in the domain of healthcare. This study aims to assess and compare the accuracy, completeness, and usefulness of the responses of Gemini Advanced, ChatGPT-3.5, and ChatGPT-4, in neuro-oncology cases.
For 20 common neuro-oncology cases, four questions regarding differential diagnosis, diagnostic workup, provisional diagnosis and management plan were asked from 3 LLMs. All the responses after replication and blinding were evaluated by senior neuro-oncologists on 3 scales: Accuracy (1-6), completeness (1-3), and usefulness (1-3). To compare the performance of all three LLMs, ANOVA and Dunn's post hoc test were employed. A p-value of less than 0.05 was considered statistically significant.
Combining all domains, ChatGPT-4 remained the most accurate (x̅ = 5.12), the most complete (x̅ = 2.05) and the most useful (x̅ = 2.06) followed by Gemini Advanced (x̅ = 4.97, 2.04, 2.05) and ChatGPT-3.5 (x̅ = 4.9, 1.9, 1.99). Using Dunn's post hoc test, completeness of differential diagnosis has two statistically significant pairs, ChatGPT-3.5 and ChatGPT-4 (p = 0.001), and ChatGPT-3.5 and Gemini Advanced (p = 0.003). For overall completeness, statistically significant difference was found between three LLMs according to ANOVA (p < 0.001).
For further studies, more extensive tests should be conducted with a larger number of neurosurgeons assessing the responses to a more diverse range of clinical cases to better understand their strengths and limitations. The performance of Retrieval Augmented Generation (RAG), after specific training of LLMs for treatment guidelines of neuro-oncology cases can also be assessed in future studies.
大语言模型(LLMs)能够在数秒内检索和分析信息,在医疗保健领域引起了极大的兴趣。本研究旨在评估和比较Gemini Advanced、ChatGPT-3.5和ChatGPT-4在神经肿瘤病例中的回答的准确性、完整性和实用性。
针对20例常见的神经肿瘤病例,向3个大语言模型提出了关于鉴别诊断、诊断检查、初步诊断和管理计划的4个问题。在复制和盲法处理后的所有回答由资深神经肿瘤学家按照3个量表进行评估:准确性(1-6)、完整性(1-3)和实用性(1-3)。为了比较所有3个大语言模型的性能,采用了方差分析和邓恩事后检验。p值小于0.05被认为具有统计学意义。
综合所有领域,ChatGPT-4仍然是最准确的(x̅ = 5.12)、最完整的(x̅ = 2.05)和最有用的(x̅ = 2.06),其次是Gemini Advanced(x̅ = 4.97, 2.04, 2.05)和ChatGPT-3.5(x̅ = 4.9, 1.9, 1.99)。使用邓恩事后检验,鉴别诊断的完整性有两组具有统计学意义的配对,即ChatGPT-3.5和ChatGPT-4(p = 0.001),以及ChatGPT-3.5和Gemini Advanced(p = 0.003)。对于总体完整性,根据方差分析,3个大语言模型之间存在统计学显著差异(p < 0.001)。
对于进一步的研究,应该进行更广泛的测试,让更多的神经外科医生评估对更多样化临床病例的回答,以更好地了解它们的优势和局限性。在未来的研究中,也可以评估在对神经肿瘤病例的治疗指南进行大语言模型的特定训练后,检索增强生成(RAG)的性能。