Sheridan Gerard A, Howard Lisa C, Neufeld Michael E, Doyle Tom R, Hughes Andrew J, Sculco Peter K, Beverland David E, Garbuz Donald S, Masri Bassam A
University of British Columbia, Vancouver, BC, Canada.
Hospital for Special Surgery, New York, NY, USA.
Ir J Med Sci. 2025 Jun 12. doi: 10.1007/s11845-025-03971-y.
There is huge interest in the use of artificial intelligence (AI) in the production and assessment of academic material; however, the role of AI remains unclear.
The purpose of this study was to perform a reviewer-blinded assessment of the quality of scientific discussion generated by an advanced AI language model (ChatGPT-4, Open AI) and determine whether this could be recommended for high-impact journal publication.
The introduction, methods and results sections of a recently published article from a high-impact journal were input into a current AI model. The AI application then produced a discussion and conclusion based on the provided text using a standardized prompt. Six experienced blinded reviewers scored all five sections of the hybrid article. A one-way analysis of variance (ANOVA) was used to assess significant differences between scores of each section. Reviewers recommended a decision regarding the suitability of the article for publication.
AI composed a scientific discussion and conclusion. The median score was 80 (IQR 70-90) for introduction, 77.5 (IQR 70-90) for methods, 82.5 (IQR 50-90) for results, 60 (IQR 40-75) for discussion and 60 (IQR 40-80) for the conclusion. The median scores for the AI-generated sections were non-significantly lower than other sections (p = 0.37). The majority of reviewers (5/6, 83%) recommended "acceptance for publication after major revision". One reviewer recommended "resubmission with no guarantee of acceptance". There were no recommendations for rejection.
Current AI large language models are now capable of generating content that passes experienced peer review and is acceptable for publication in a high-impact orthopaedic journal, after revision. There are still many concerns regarding the integration of AI into the process of scientific writing, mainly the tendency of AI to rely on advanced pattern recognition and fabricated or inadequate references.
Level IV.
人工智能(AI)在学术材料的生成和评估中的应用引发了极大关注;然而,人工智能的作用仍不明确。
本研究旨在对先进的人工智能语言模型(ChatGPT - 4,OpenAI)生成的科学讨论质量进行双盲评审,并确定其是否可推荐用于高影响力期刊发表。
将一篇近期发表于高影响力期刊文章的引言、方法和结果部分输入当前的人工智能模型。然后,人工智能应用程序使用标准化提示基于提供的文本生成讨论和结论。六位经验丰富的双盲评审员对这篇混合文章的所有五个部分进行评分。采用单因素方差分析(ANOVA)评估各部分得分之间的显著差异。评审员就文章发表的适宜性给出推荐决定。
人工智能生成了科学讨论和结论。引言部分的中位数得分是80(四分位间距70 - 90),方法部分是77.5(四分位间距70 - 90),结果部分是82.5(四分位间距50 - 90),讨论部分是60(四分位间距40 - 75),结论部分是60(四分位间距40 - 80)。人工智能生成部分的中位数得分略低于其他部分,但差异无统计学意义(p = 0.37)。大多数评审员(5/6,83%)建议“大修后接受发表”。一位评审员建议“重新提交,但不保证接受”。没有拒绝发表的建议。
当前的人工智能大语言模型现在能够生成经过经验丰富的同行评审且经修订后可在高影响力骨科期刊发表的内容。将人工智能融入科学写作过程仍存在许多担忧,主要是人工智能倾向于依赖先进的模式识别以及虚假或不充分的参考文献。
四级。