临床决策中的人工智能：用于耳鼻喉科病例的ChatGPT-4与Llama2对比

AI in clinical decision-making: ChatGPT-4 vs. Llama2 for otolaryngology cases.

作者信息

Maniaci Antonino, Hoch Cosima C, Sogalow Lise, Schmidl Benedikt, Lechien Jerome R

机构信息

Department of Medical and Surgical Sciences, Faculty of Medicine, University of Enna Kore, Enna, Italy.

Yoifos Research Committee, Paris, France.

出版信息

Eur Arch Otorhinolaryngol. 2025 Jun;282(6):3293-3302. doi: 10.1007/s00405-025-09371-3. Epub 2025 Apr 12.

DOI:10.1007/s00405-025-09371-3

PMID:40220179

Abstract

PURPOSE

To evaluate the diagnostic accuracy, appropriateness of additional examination recommendations, and consistency of therapeutic regimens by ChatGPT-4 and Llama2 based on real otolaryngology cases.

METHODS

A prospective controlled study was conducted on 98 anonymized otolaryngology cases. Clinical information was entered in ChatGPT-4 and Llama2 for reaching primary diagnoses, additional examination recommendations, and treatment strategies. Two independent otolaryngologists evaluated the AI outputs using the artificial intelligence performance instrument (AIPI), evaluating diagnostic accuracy, appropriateness of examination, and adequacy of treatment. Statistical comparisons were conducted between the AI systems and expert decisions. Interrater reliability was evaluated with kappa statistics.

RESULTS

ChatGPT-4 diagnosed 82% correctly, outperforming Llama2 at 76%. For additional examinations, ChatGPT-4 suggested relevant and appropriate tests in 88% of the studies, while Llama2 did so in 83%. Treatment appropriateness was achieved in 80% of the cases through ChatGPT-4 and 72% through Llama2. Sometimes, both systems suggested inappropriate tests. The interrater reliability was high for AIPI scores (kappa = 0.85).

CONCLUSION

ChatGPT-4 and Llama2 have shown great potential as clinical decision-support tools in otolaryngology, with ChatGPT-4 exhibiting superior performance. At the same time, non-relevant recommendations indicate further refinement and human oversight to ensure safe application in clinical practice.

摘要

目的

基于真实的耳鼻喉科病例，评估ChatGPT-4和Llama2的诊断准确性、额外检查建议的合理性以及治疗方案的一致性。

方法

对98例匿名的耳鼻喉科病例进行了一项前瞻性对照研究。将临床信息输入ChatGPT-4和Llama2以得出初步诊断、额外检查建议和治疗策略。两名独立的耳鼻喉科医生使用人工智能性能评估工具（AIPI）评估人工智能的输出结果，评估诊断准确性、检查的合理性和治疗的充分性。对人工智能系统和专家决策进行了统计比较。使用kappa统计量评估评分者间的可靠性。

结果

ChatGPT-4的正确诊断率为82%，优于Llama2的76%。对于额外检查，ChatGPT-4在88%的研究中建议了相关且合适的检查，而Llama2为83%。ChatGPT-4在80%的病例中实现了治疗合理性，Llama2为72%。有时，两个系统都会建议不适当的检查。AIPI评分的评分者间可靠性较高（kappa = 0.85）。

结论

ChatGPT-4和Llama2在耳鼻喉科作为临床决策支持工具显示出了巨大潜力，ChatGPT-4表现更优。同时，不相关的建议表明需要进一步完善并进行人工监督，以确保在临床实践中的安全应用。

相似文献

AI in clinical decision-making: ChatGPT-4 vs. Llama2 for otolaryngology cases.临床决策中的人工智能：用于耳鼻喉科病例的ChatGPT-4与Llama2对比

Eur Arch Otorhinolaryngol. 2025 Jun;282(6):3293-3302. doi: 10.1007/s00405-025-09371-3. Epub 2025 Apr 12.

Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).评估智能聊天机器人性能的工具的有效性和可靠性：人工智能性能评估工具（AIPI）。

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2063-2079. doi: 10.1007/s00405-023-08219-y. Epub 2023 Sep 12.

Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports.对比分析诊断性能：基于案例报告的 LLaMA3 与 LLaMA2 的鉴别诊断列表。

JMIR Form Res. 2024 Nov 19;8:e64844. doi: 10.2196/64844.

Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较：随机对照试验

JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.

The Role of ChatGPT in osteoporosis management: a comparative analysis with clinical expertise.ChatGPT在骨质疏松症管理中的作用：与临床专业知识的比较分析

Arch Osteoporos. 2025 Apr 9;20(1):51. doi: 10.1007/s11657-025-01533-4.

Performance and Consistency of ChatGPT-4 Versus Otolaryngologists: A Clinical Case Series.ChatGPT-4与耳鼻喉科医生的表现及一致性：临床病例系列

Otolaryngol Head Neck Surg. 2024 Jun;170(6):1519-1526. doi: 10.1002/ohn.759. Epub 2024 Apr 9.

Artificial intelligence (ChatGPT 4.0) vs. Human expertise for epileptic seizure and epilepsy diagnosis and classification in Adults: An exploratory study.成人癫痫发作及癫痫诊断与分类中的人工智能（ChatGPT 4.0）与人类专业知识：一项探索性研究

Epilepsy Behav. 2025 May;166:110364. doi: 10.1016/j.yebeh.2025.110364. Epub 2025 Mar 12.

Artificial intelligence in neurovascular decision-making: a comparative analysis of ChatGPT-4 and multidisciplinary expert recommendations for unruptured intracranial aneurysms.人工智能在神经血管决策中的应用：ChatGPT-4与颅内未破裂动脉瘤多学科专家建议的比较分析

Neurosurg Rev. 2025 Feb 21;48(1):261. doi: 10.1007/s10143-025-03341-3.

A randomized controlled trial on evaluating clinician-supervised generative AI for decision support.一项评估临床医生监督下的生成式人工智能用于决策支持的随机对照试验。

Int J Med Inform. 2025 Mar;195:105701. doi: 10.1016/j.ijmedinf.2024.105701. Epub 2024 Nov 29.

[AI-supported decision-making in obstetrics - a feasibility study on the medical accuracy and reliability of ChatGPT].[人工智能支持的产科决策——关于ChatGPT医学准确性和可靠性的可行性研究]

Z Geburtshilfe Neonatol. 2025 Feb;229(1):15-21. doi: 10.1055/a-2411-9516. Epub 2024 Oct 14.

本文引用的文献

Evaluating the Performance of ChatGPT, Gemini, and Bing Compared with Resident Surgeons in the Otorhinolaryngology In-service Training Examination.评估ChatGPT、Gemini和必应在耳鼻咽喉科在职培训考试中与住院医师相比的表现。

Turk Arch Otorhinolaryngol. 2024 Oct 23;62(2):48-57. doi: 10.4274/tao.2024.3.5.

ChatGPT-4 Consistency in Interpreting Laryngeal Clinical Images of Common Lesions and Disorders.ChatGPT-4 对常见病变和疾病的喉部临床图像的解释一致性。

Otolaryngol Head Neck Surg. 2024 Oct;171(4):1106-1113. doi: 10.1002/ohn.897. Epub 2024 Jul 24.

Performance and Consistency of ChatGPT-4 Versus Otolaryngologists: A Clinical Case Series.ChatGPT-4与耳鼻喉科医生的表现及一致性：临床病例系列

Otolaryngol Head Neck Surg. 2024 Jun;170(6):1519-1526. doi: 10.1002/ohn.759. Epub 2024 Apr 9.

Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study.探索ChatGPT在耳鼻咽喉科临床决策中的作用：一项ChatGPT设计的研究。

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2023-2030. doi: 10.1007/s00405-024-08498-z. Epub 2024 Feb 12.

Chat GPT for the management of obstructive sleep apnea: do we have a polar star?Chat GPT 在阻塞性睡眠呼吸暂停管理中的应用：我们是否有了一颗指路明星？

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2087-2093. doi: 10.1007/s00405-023-08270-9. Epub 2023 Nov 19.

Human-computer collaboration for skin cancer recognition.人机协作进行皮肤癌识别。

Nat Med. 2020 Aug;26(8):1229-1234. doi: 10.1038/s41591-020-0942-0. Epub 2020 Jun 22.

Artificial intelligence in healthcare.人工智能在医疗保健领域的应用。

Nat Biomed Eng. 2018 Oct;2(10):719-731. doi: 10.1038/s41551-018-0305-z. Epub 2018 Oct 10.

High-performance medicine: the convergence of human and artificial intelligence.高性能医学：人机智能融合。

Nat Med. 2019 Jan;25(1):44-56. doi: 10.1038/s41591-018-0300-7. Epub 2019 Jan 7.

A guide to deep learning in healthcare.深度学习在医疗保健中的应用指南。

Nat Med. 2019 Jan;25(1):24-29. doi: 10.1038/s41591-018-0316-z. Epub 2019 Jan 7.

Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists.深度学习在胸片诊断中的应用：CheXNeXt 算法与临床放射科医生的回顾性比较。

PLoS Med. 2018 Nov 20;15(11):e1002686. doi: 10.1371/journal.pmed.1002686. eCollection 2018 Nov.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

临床决策中的人工智能：用于耳鼻喉科病例的ChatGPT-4与Llama2对比

AI in clinical decision-making: ChatGPT-4 vs. Llama2 for otolaryngology cases.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献