Maniaci Antonino, Hoch Cosima C, Sogalow Lise, Schmidl Benedikt, Lechien Jerome R
Department of Medical and Surgical Sciences, Faculty of Medicine, University of Enna Kore, Enna, Italy.
Yoifos Research Committee, Paris, France.
Eur Arch Otorhinolaryngol. 2025 Jun;282(6):3293-3302. doi: 10.1007/s00405-025-09371-3. Epub 2025 Apr 12.
To evaluate the diagnostic accuracy, appropriateness of additional examination recommendations, and consistency of therapeutic regimens by ChatGPT-4 and Llama2 based on real otolaryngology cases.
A prospective controlled study was conducted on 98 anonymized otolaryngology cases. Clinical information was entered in ChatGPT-4 and Llama2 for reaching primary diagnoses, additional examination recommendations, and treatment strategies. Two independent otolaryngologists evaluated the AI outputs using the artificial intelligence performance instrument (AIPI), evaluating diagnostic accuracy, appropriateness of examination, and adequacy of treatment. Statistical comparisons were conducted between the AI systems and expert decisions. Interrater reliability was evaluated with kappa statistics.
ChatGPT-4 diagnosed 82% correctly, outperforming Llama2 at 76%. For additional examinations, ChatGPT-4 suggested relevant and appropriate tests in 88% of the studies, while Llama2 did so in 83%. Treatment appropriateness was achieved in 80% of the cases through ChatGPT-4 and 72% through Llama2. Sometimes, both systems suggested inappropriate tests. The interrater reliability was high for AIPI scores (kappa = 0.85).
ChatGPT-4 and Llama2 have shown great potential as clinical decision-support tools in otolaryngology, with ChatGPT-4 exhibiting superior performance. At the same time, non-relevant recommendations indicate further refinement and human oversight to ensure safe application in clinical practice.
基于真实的耳鼻喉科病例,评估ChatGPT-4和Llama2的诊断准确性、额外检查建议的合理性以及治疗方案的一致性。
对98例匿名的耳鼻喉科病例进行了一项前瞻性对照研究。将临床信息输入ChatGPT-4和Llama2以得出初步诊断、额外检查建议和治疗策略。两名独立的耳鼻喉科医生使用人工智能性能评估工具(AIPI)评估人工智能的输出结果,评估诊断准确性、检查的合理性和治疗的充分性。对人工智能系统和专家决策进行了统计比较。使用kappa统计量评估评分者间的可靠性。
ChatGPT-4的正确诊断率为82%,优于Llama2的76%。对于额外检查,ChatGPT-4在88%的研究中建议了相关且合适的检查,而Llama2为83%。ChatGPT-4在80%的病例中实现了治疗合理性,Llama2为72%。有时,两个系统都会建议不适当的检查。AIPI评分的评分者间可靠性较高(kappa = 0.85)。
ChatGPT-4和Llama2在耳鼻喉科作为临床决策支持工具显示出了巨大潜力,ChatGPT-4表现更优。同时,不相关的建议表明需要进一步完善并进行人工监督,以确保在临床实践中的安全应用。