Hospital Israelita Albert Einstein and Departamento de Neurologia e Neurocirurgia, Universidade Federal de São Paulo, Brazil (G.S.S.).
Section of Cardiovascular Medicine (R.K.), Yale School of Medicine, New Haven, CT.
Stroke. 2024 Oct;55(10):2573-2578. doi: 10.1161/STROKEAHA.124.045012. Epub 2024 Sep 3.
Artificial intelligence (AI) large language models (LLMs) now produce human-like general text and images. LLMs' ability to generate persuasive scientific essays that undergo evaluation under traditional peer review has not been systematically studied. To measure perceptions of quality and the nature of authorship, we conducted a competitive essay contest in 2024 with both human and AI participants. Human authors and 4 distinct LLMs generated essays on controversial topics in stroke care and outcomes research. A panel of Editorial Board members (mostly vascular neurologists), blinded to author identity and with varying levels of AI expertise, rated the essays for quality, persuasiveness, best in topic, and author type. Among 34 submissions (22 human and 12 LLM) scored by 38 reviewers, human and AI essays received mostly similar ratings, though AI essays were rated higher for composition quality. Author type was accurately identified only 50% of the time, with prior LLM experience associated with improved accuracy. In multivariable analyses adjusted for author attributes and essay quality, only persuasiveness was independently associated with odds of a reviewer assigning AI as author type (adjusted odds ratio, 1.53 [95% CI, 1.09-2.16]; =0.01). In conclusion, a group of experienced editorial board members struggled to distinguish human versus AI authorship, with a bias against best in topic for essays judged to be AI generated. Scientific journals may benefit from educating reviewers on the types and uses of AI in scientific writing and developing thoughtful policies on the appropriate use of AI in authoring manuscripts.
人工智能(AI)大型语言模型(LLM)现在可以生成类似人类的通用文本和图像。尚未系统地研究 LLM 生成具有说服力的科学论文的能力,这些论文在传统同行评审下进行评估。为了衡量质量感知和作者身份的性质,我们在 2024 年举办了一场具有人类和 AI 参与者的竞争性征文比赛。人类作者和 4 种不同的 LLM 生成了中风护理和结果研究中具有争议性主题的论文。由编辑委员会成员(主要是血管神经病学家)组成的小组对论文进行了质量、说服力、最佳主题和作者类型的评估,他们对作者身份和 AI 专业知识的了解程度不一。在 38 名评审员对 34 份(22 份人类和 12 份 LLM)的评分中,人类和 AI 论文的评分大多相似,尽管 AI 论文的作文质量评分更高。只有 50%的时间准确识别了作者类型,而具有 LLM 经验与提高准确性相关。在调整了作者属性和论文质量的多变量分析中,只有说服力与 reviewer 将 AI 分配为作者类型的可能性独立相关(调整后的优势比,1.53 [95%CI,1.09-2.16];=0.01)。总之,一组经验丰富的编辑委员会成员难以区分人类与 AI 作者身份,对于被认为是 AI 生成的论文,他们对最佳主题存在偏见。科学期刊可能受益于教育评审员关于 AI 在科学写作中的类型和用途,并制定关于在撰写手稿中适当使用 AI 的深思熟虑的政策。