Türe Nurullah, Umurhan Elif, Tahir Emel
Department of Otorhinolaryngology, Kütahya Health Sciences University, Kütahya, Türkiye.
Department of Otorhinolaryngology, Ondokuz Mayıs University, Samsun, Türkiye.
Eur Arch Otorhinolaryngol. 2025 Sep 18. doi: 10.1007/s00405-025-09656-7.
This study aimed to compare the ability of two major language models, ChatGPT-4.0 and Gemini 1.5 Flash, to establish a research methodology based on scientific publications in laryngology.
We screened 80 articles selected from five prestigious otolaryngology journals and included 60 articles with a methods section and statistical analysis. These were classified according to six research types: cell culture, animal experiments, prospective, retrospective, systematic review, and artificial intelligence. A total of 30 studies were analyzed, with five articles randomly selected from each group. For each article, both language models were asked to produce research methodologies, and the responses were evaluated by two independent raters.
There was no statistically significant difference between the mean scores of the models (p > 0.05). ChatGPT 4.0 had a higher mean score (5.17 ± 1.12), especially in the data collection and measurement-assessment category. The Gemini model showed relatively more balanced performance in the statistical analysis category. The weighted kappa values were between 0.54 and 0.71, indicating a moderate to high agreement between the raters. In the analysis by article type, Gemini's performance in Q1 showed significant variation (p = 0.038).
Large language models such as ChatGPT and Gemini provide similarly consistent results in establishing the methodology of scientific studies in laryngology. Both models can be considered supportive tools; however, expert supervision is needed, especially for complex constructs such as statistical analysis. This study makes original contributions to the usability of LLMs for study design in laryngology.
本研究旨在比较两种主要语言模型ChatGPT - 4.0和Gemini 1.5 Flash基于喉科学科学出版物建立研究方法的能力。
我们从五本著名的耳鼻咽喉科期刊中筛选了80篇文章,纳入60篇有方法部分和统计分析的文章。这些文章根据六种研究类型进行分类:细胞培养、动物实验、前瞻性、回顾性、系统评价和人工智能。总共分析了30项研究,每组随机选取5篇文章。对于每篇文章,要求两个语言模型生成研究方法,并由两名独立评分者对回答进行评估。
模型的平均得分之间无统计学显著差异(p > 0.05)。ChatGPT 4.0的平均得分较高(5.17 ± 1.12),尤其是在数据收集和测量评估类别中。Gemini模型在统计分析类别中表现出相对更平衡的性能。加权kappa值在0.54至0.71之间,表明评分者之间存在中度至高度一致性。在按文章类型进行的分析中,Gemini在Q1中的表现存在显著差异(p = 0.038)。
ChatGPT和Gemini等大型语言模型在建立喉科学科学研究方法方面提供了类似一致的结果。两种模型都可被视为支持工具;然而,需要专家监督,特别是对于统计分析等复杂结构。本研究为大型语言模型在喉科学研究设计中的可用性做出了原创性贡献。