Liu Xu, Shi Suming, Zhang Xin, Gao Qianwen, Wang Wuqing
ENT Institute, Department of Otorhinolaryngology, Eye & ENT Hospital, Fudan University, Shanghai, 200031, China.
NHC Key Laboratory of Hearing Medicine (Fudan University), Shanghai, 200031, China.
Sci Rep. 2025 May 28;15(1):18688. doi: 10.1038/s41598-025-96309-8.
To compare the diagnostic accuracy of an artificial intelligence chatbot and clinical experts in vertigo-related diseases and evaluate the ability of the AI chatbot to address vertigo-related issues. 20 clinical questions about vertigo were input to ChatGPT-4o, and three otologists evaluated the responses using a 5-point Likert scale for accuracy, comprehensiveness, clarity, practicality, and credibility. Readability was assessed using Flesch Reading Ease and Flesch-Kincaid Grade Level formulas. The model and two otologists diagnosed 15 outpatient vertigo cases, and the diagnostic accuracy was calculated. The Kruskal-Wallis test, Analysis of Variance (ANOVA), and paired t-test were employed for statistical analysis. ChatGPT-4o scored highest in credibility (4.78). Repeated Measures ANOVA showed that ChatGPT's responses to the 20 questions exhibited statistically significant differences across the five scoring dimensions (F = 2.682, p = 0.038). Readability analysis showed that diagnosis-related outputs were more challenging compared to other types of content. The model's diagnostic accuracy was comparable to a clinician with one year of experience but inferior to a clinician with five years of experience, and the differences in accuracy among the three methods are statistically significant (p = 0.04). ChatGPT-4o shows promise as a supplementary tool for managing vertigo but requires improvements in readability and diagnostic capabilities.
比较人工智能聊天机器人和临床专家对眩晕相关疾病的诊断准确性,并评估人工智能聊天机器人解决眩晕相关问题的能力。将20个关于眩晕的临床问题输入ChatGPT-4o,三位耳科医生使用5点李克特量表对回答的准确性、全面性、清晰度、实用性和可信度进行评估。使用弗莱什易读性公式和弗莱什-金凯德年级水平公式评估可读性。该模型和两位耳科医生对15例门诊眩晕病例进行诊断,并计算诊断准确性。采用Kruskal-Wallis检验、方差分析(ANOVA)和配对t检验进行统计分析。ChatGPT-4o在可信度方面得分最高(4.78)。重复测量方差分析表明,ChatGPT对20个问题的回答在五个评分维度上存在统计学显著差异(F = 2.682,p = 0.038)。可读性分析表明,与其他类型的内容相比,与诊断相关的输出更具挑战性。该模型的诊断准确性与有一年经验的临床医生相当,但低于有五年经验的临床医生,三种方法在准确性上的差异具有统计学意义(p = 0.04)。ChatGPT-4o作为管理眩晕的辅助工具显示出前景,但在可读性和诊断能力方面需要改进。