Ozkara Burak Berksu, Boutet Alexandre, Comstock Bryan A, Van Goethem Johan, Huisman Thierry A G M, Ross Jeffrey S, Saba Luca, Shah Lubdha M, Wintermark Max, Castillo Mauricio
From the Department of Neuroradiology (B.B.O., M.W.), The University of Texas MD Anderson Center, Houston, Texas.
Joint Department of Medical Imaging (A.B.), University of Toronto, Toronto, Ontario, Canada.
AJNR Am J Neuroradiol. 2025 Mar 4;46(3):559-566. doi: 10.3174/ajnr.A8505.
Artificial intelligence is capable of generating complex texts that may be indistinguishable from those written by humans. We aimed to evaluate the ability of GPT-4 to write radiology editorials and to compare these with human-written counterparts, thereby determining their real-world applicability for scientific writing.
Sixteen editorials from 8 journals were included. To generate the artificial intelligence (AI)-written editorials, the summary of 16 human-written editorials was fed into GPT-4. Six experienced editors reviewed the articles. First, an unpaired approach was used. The raters were asked to evaluate the content of each article by using a 1-5 Likert scale across specified metrics. Then, they determined whether the editorials were written by humans or AI. The articles were then evaluated in pairs to determine which article was generated by AI and which should be published. Finally, the articles were analyzed with an AI detector and for plagiarism.
The human-written articles had a median AI probability score of 2.0%, whereas the AI-written articles had 58%. The median similarity score among AI-written articles was 3%. Fifty-eight percent of unpaired articles were correctly classified regarding authorship. Rating accuracy was increased to 70% in the paired setting. AI-written articles received slightly higher scores in most metrics. When stratified by perception, human-written perceived articles were rated higher in most categories. In the paired setting, raters strongly preferred publishing the article they perceived as human-written (82%).
GPT-4 can write high-quality articles that iThenticate does not flag as plagiarized, which may go undetected by editors, and that detection tools can detect to a limited extent. Editors showed a positive bias toward human-written articles.
人工智能能够生成与人类撰写的文本难以区分的复杂文本。我们旨在评估GPT-4撰写放射学社论的能力,并将其与人类撰写的社论进行比较,从而确定其在科学写作中的实际适用性。
纳入了来自8种期刊的16篇社论。为了生成人工智能撰写的社论,将16篇人类撰写社论的摘要输入GPT-4。6位经验丰富的编辑对文章进行评审。首先,采用非配对方法。要求评分者通过在指定指标上使用1-5李克特量表来评估每篇文章的内容。然后,他们确定社论是由人类还是人工智能撰写的。接着将文章进行配对评估,以确定哪篇文章是由人工智能生成的,哪篇应该发表。最后,使用人工智能检测器对文章进行分析并检测是否存在抄袭。
人类撰写的文章的人工智能概率得分中位数为2.0%,而人工智能撰写的文章为58%。人工智能撰写的文章之间的相似度得分中位数为3%。在未配对的文章中,58%的文章在作者身份认定上被正确分类。在配对设置中,评分准确率提高到了70%。在大多数指标上,人工智能撰写的文章得分略高。按感知进行分层时,人类撰写的被感知文章在大多数类别中的评分更高。在配对设置中,评分者强烈倾向于发表他们认为是人类撰写的文章(82%)。
GPT-4可以写出iThenticate未标记为抄袭的高质量文章,编辑可能无法察觉,且检测工具只能在有限程度上进行检测。编辑对人类撰写的文章存在积极偏见。