大语言模型在喉科学中生成研究方法的评估：ChatGPT-4.0与Gemini 1.5闪速版的比较分析

Evaluation of research methodology generation by large language models in laryngology: a comparative analysis of ChatGPT-4.0 and Gemini 1.5 flash.

作者信息

Türe Nurullah, Umurhan Elif, Tahir Emel

机构信息

Department of Otorhinolaryngology, Kütahya Health Sciences University, Kütahya, Türkiye.

Department of Otorhinolaryngology, Ondokuz Mayıs University, Samsun, Türkiye.

出版信息

Eur Arch Otorhinolaryngol. 2025 Sep 18. doi: 10.1007/s00405-025-09656-7.

DOI:10.1007/s00405-025-09656-7

PMID:40968205

Abstract

OBJECTIVES

This study aimed to compare the ability of two major language models, ChatGPT-4.0 and Gemini 1.5 Flash, to establish a research methodology based on scientific publications in laryngology.

METHODS

We screened 80 articles selected from five prestigious otolaryngology journals and included 60 articles with a methods section and statistical analysis. These were classified according to six research types: cell culture, animal experiments, prospective, retrospective, systematic review, and artificial intelligence. A total of 30 studies were analyzed, with five articles randomly selected from each group. For each article, both language models were asked to produce research methodologies, and the responses were evaluated by two independent raters.

RESULTS

There was no statistically significant difference between the mean scores of the models (p > 0.05). ChatGPT 4.0 had a higher mean score (5.17 ± 1.12), especially in the data collection and measurement-assessment category. The Gemini model showed relatively more balanced performance in the statistical analysis category. The weighted kappa values were between 0.54 and 0.71, indicating a moderate to high agreement between the raters. In the analysis by article type, Gemini's performance in Q1 showed significant variation (p = 0.038).

CONCLUSION

Large language models such as ChatGPT and Gemini provide similarly consistent results in establishing the methodology of scientific studies in laryngology. Both models can be considered supportive tools; however, expert supervision is needed, especially for complex constructs such as statistical analysis. This study makes original contributions to the usability of LLMs for study design in laryngology.

摘要

目的

本研究旨在比较两种主要语言模型ChatGPT - 4.0和Gemini 1.5 Flash基于喉科学科学出版物建立研究方法的能力。

方法

我们从五本著名的耳鼻咽喉科期刊中筛选了80篇文章，纳入60篇有方法部分和统计分析的文章。这些文章根据六种研究类型进行分类：细胞培养、动物实验、前瞻性、回顾性、系统评价和人工智能。总共分析了30项研究，每组随机选取5篇文章。对于每篇文章，要求两个语言模型生成研究方法，并由两名独立评分者对回答进行评估。

结果

模型的平均得分之间无统计学显著差异（p > 0.05）。ChatGPT 4.0的平均得分较高（5.17 ± 1.12），尤其是在数据收集和测量评估类别中。Gemini模型在统计分析类别中表现出相对更平衡的性能。加权kappa值在0.54至0.71之间，表明评分者之间存在中度至高度一致性。在按文章类型进行的分析中，Gemini在Q1中的表现存在显著差异（p = 0.038）。

结论

ChatGPT和Gemini等大型语言模型在建立喉科学科学研究方法方面提供了类似一致的结果。两种模型都可被视为支持工具；然而，需要专家监督，特别是对于统计分析等复杂结构。本研究为大型语言模型在喉科学研究设计中的可用性做出了原创性贡献。

相似文献

Evaluation of research methodology generation by large language models in laryngology: a comparative analysis of ChatGPT-4.0 and Gemini 1.5 flash.大语言模型在喉科学中生成研究方法的评估：ChatGPT-4.0与Gemini 1.5闪速版的比较分析

Eur Arch Otorhinolaryngol. 2025 Sep 18. doi: 10.1007/s00405-025-09656-7.

A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估：ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较

BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Evaluation of the accuracy of ChatGPT-4 and Gemini's responses to the World Dental Federation's frequently asked questions on oral health.评估ChatGPT-4和Gemini对世界牙科联盟关于口腔健康常见问题的回答的准确性。

BMC Oral Health. 2025 Aug 2;25(1):1293. doi: 10.1186/s12903-025-06624-9.

Artificial Intelligence in Peripheral Artery Disease Education: A Battle Between ChatGPT and Google Gemini.外周动脉疾病教育中的人工智能：ChatGPT与谷歌Gemini的较量

Cureus. 2025 Jun 1;17(6):e85174. doi: 10.7759/cureus.85174. eCollection 2025 Jun.

Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较：人工智能在医学教育中的应用启示

Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.

Large Language Models and Empathy: Systematic Review.大语言模型与同理心：系统综述

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Evaluation of the Reliability of AI-Based Large Language Models in Developing Orthodontic Treatment Plans.基于人工智能的大语言模型在制定正畸治疗方案中的可靠性评估。

Cureus. 2025 Jul 31;17(7):e89149. doi: 10.7759/cureus.89149. eCollection 2025 Jul.

A Cross-Sectional Comparison of Patient Information Guides Generated by ChatGPT Versus Google Gemini for Alzheimer's Disease, Parkinsonism, and Migraine.ChatGPT与谷歌Gemini生成的针对阿尔茨海默病、帕金森症和偏头痛的患者信息指南的横断面比较

Cureus. 2025 May 20;17(5):e84507. doi: 10.7759/cureus.84507. eCollection 2025 May.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较：大型语言模型、ChatGPT 和未经训练的急诊医生：一项对比研究。

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

本文引用的文献

Comparative Assessment of Otolaryngology Knowledge Among Large Language Models.大型语言模型中耳鼻喉科知识的比较评估

Laryngoscope. 2025 Feb;135(2):629-634. doi: 10.1002/lary.31781. Epub 2024 Sep 21.

Investigating the role of artificial intelligence in predicting perceived dysphonia level.研究人工智能在预测感知发声障碍程度中的作用。

Eur Arch Otorhinolaryngol. 2024 Nov;281(11):6093-6097. doi: 10.1007/s00405-024-08868-7. Epub 2024 Aug 22.

Concordance in bacterial colonization profiles between voice prostheses and oral microbiota post-laryngectomy: An experimental study.喉切除术后人工发声器与口腔微生物群的细菌定植特征一致性：一项实验研究。

Auris Nasus Larynx. 2024 Aug;51(4):783-791. doi: 10.1016/j.anl.2024.06.006. Epub 2024 Jun 28.

Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models.使用大型语言模型评估随机临床试验的偏倚风险。

JAMA Netw Open. 2024 May 1;7(5):e2412687. doi: 10.1001/jamanetworkopen.2024.12687.

Demographic and clinical characteristics of our patients diagnosed with laryngeal dystonia.我们诊断为喉肌痉挛患者的人口统计学和临床特征。

Eur Arch Otorhinolaryngol. 2024 Aug;281(8):4265-4271. doi: 10.1007/s00405-024-08688-9. Epub 2024 May 7.

Multi-instance learning based artificial intelligence model to assist vocal fold leukoplakia diagnosis: A multicentre diagnostic study.基于多实例学习的人工智能模型辅助声带白斑诊断：一项多中心诊断研究。

Am J Otolaryngol. 2024 Jul-Aug;45(4):104342. doi: 10.1016/j.amjoto.2024.104342. Epub 2024 Apr 30.

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.信任还是不信任：评估人工智能对喉癌查询的回应的可靠性和安全性。

Eur Arch Otorhinolaryngol. 2024 Nov;281(11):6069-6081. doi: 10.1007/s00405-024-08643-8. Epub 2024 Apr 23.

A large language model's assessment of methodology reporting in head and neck surgery.大型语言模型对头颈外科方法学报告的评估。

Am J Otolaryngol. 2024 Mar-Apr;45(2):104145. doi: 10.1016/j.amjoto.2023.104145. Epub 2023 Dec 6.

Endoscopic assisted microscopic posterior cordotomy for bilateral abductor vocal fold paralysis using radiofrequency versus coblation.内镜辅助显微镜下后索切开术治疗双侧外展性声带麻痹：射频与等离子的对比。

Eur Arch Otorhinolaryngol. 2024 Feb;281(2):835-841. doi: 10.1007/s00405-023-08331-z. Epub 2023 Dec 2.

Nature's Take: How will ChatGPT and generative AI transform research?自然的视角：ChatGPT和生成式人工智能将如何改变研究？

Nature. 2023 Nov 3. doi: 10.1038/d41586-023-03467-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型在喉科学中生成研究方法的评估：ChatGPT-4.0与Gemini 1.5闪速版的比较分析

Evaluation of research methodology generation by large language models in laryngology: a comparative analysis of ChatGPT-4.0 and Gemini 1.5 flash.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献