Suppr超能文献

临床决策中的人工智能:用于耳鼻喉科病例的ChatGPT-4与Llama2对比

AI in clinical decision-making: ChatGPT-4 vs. Llama2 for otolaryngology cases.

作者信息

Maniaci Antonino, Hoch Cosima C, Sogalow Lise, Schmidl Benedikt, Lechien Jerome R

机构信息

Department of Medical and Surgical Sciences, Faculty of Medicine, University of Enna Kore, Enna, Italy.

Yoifos Research Committee, Paris, France.

出版信息

Eur Arch Otorhinolaryngol. 2025 Jun;282(6):3293-3302. doi: 10.1007/s00405-025-09371-3. Epub 2025 Apr 12.

Abstract

PURPOSE

To evaluate the diagnostic accuracy, appropriateness of additional examination recommendations, and consistency of therapeutic regimens by ChatGPT-4 and Llama2 based on real otolaryngology cases.

METHODS

A prospective controlled study was conducted on 98 anonymized otolaryngology cases. Clinical information was entered in ChatGPT-4 and Llama2 for reaching primary diagnoses, additional examination recommendations, and treatment strategies. Two independent otolaryngologists evaluated the AI outputs using the artificial intelligence performance instrument (AIPI), evaluating diagnostic accuracy, appropriateness of examination, and adequacy of treatment. Statistical comparisons were conducted between the AI systems and expert decisions. Interrater reliability was evaluated with kappa statistics.

RESULTS

ChatGPT-4 diagnosed 82% correctly, outperforming Llama2 at 76%. For additional examinations, ChatGPT-4 suggested relevant and appropriate tests in 88% of the studies, while Llama2 did so in 83%. Treatment appropriateness was achieved in 80% of the cases through ChatGPT-4 and 72% through Llama2. Sometimes, both systems suggested inappropriate tests. The interrater reliability was high for AIPI scores (kappa = 0.85).

CONCLUSION

ChatGPT-4 and Llama2 have shown great potential as clinical decision-support tools in otolaryngology, with ChatGPT-4 exhibiting superior performance. At the same time, non-relevant recommendations indicate further refinement and human oversight to ensure safe application in clinical practice.

摘要

目的

基于真实的耳鼻喉科病例,评估ChatGPT-4和Llama2的诊断准确性、额外检查建议的合理性以及治疗方案的一致性。

方法

对98例匿名的耳鼻喉科病例进行了一项前瞻性对照研究。将临床信息输入ChatGPT-4和Llama2以得出初步诊断、额外检查建议和治疗策略。两名独立的耳鼻喉科医生使用人工智能性能评估工具(AIPI)评估人工智能的输出结果,评估诊断准确性、检查的合理性和治疗的充分性。对人工智能系统和专家决策进行了统计比较。使用kappa统计量评估评分者间的可靠性。

结果

ChatGPT-4的正确诊断率为82%,优于Llama2的76%。对于额外检查,ChatGPT-4在88%的研究中建议了相关且合适的检查,而Llama2为83%。ChatGPT-4在80%的病例中实现了治疗合理性,Llama2为72%。有时,两个系统都会建议不适当的检查。AIPI评分的评分者间可靠性较高(kappa = 0.85)。

结论

ChatGPT-4和Llama2在耳鼻喉科作为临床决策支持工具显示出了巨大潜力,ChatGPT-4表现更优。同时,不相关的建议表明需要进一步完善并进行人工监督,以确保在临床实践中的安全应用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验