Suppr超能文献

使用检测器和不知情的人类评审员,将ChatGPT生成的科学摘要与真实摘要进行比较。

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.

作者信息

Gao Catherine A, Howard Frederick M, Markov Nikolay S, Dyer Emma C, Ramesh Siddhi, Luo Yuan, Pearson Alexander T

机构信息

Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.

Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago, IL, USA.

出版信息

NPJ Digit Med. 2023 Apr 26;6(1):75. doi: 10.1038/s41746-023-00819-6.

Abstract

Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, 'GPT-2 Output Detector', with % 'fake' scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% 'fake' [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.

摘要

诸如ChatGPT这样的大语言模型能够生成越来越逼真的文本,然而在科学写作中使用这些模型时,其准确性和完整性方面的信息却尚不明确。我们从五本高影响因子医学期刊收集了五篇研究摘要,并要求ChatGPT根据这些摘要的标题和期刊生成研究摘要。大多数生成的摘要使用人工智能输出检测器“GPT - 2输出检测器”进行检测,“虚假”分数(分数越高表明越有可能是生成的)中位数[四分位间距]为99.98%“虚假”[12.73%,99.98%],而原始摘要的中位数为0.02%[四分位间距0.02%,0.09%]。该人工智能输出检测器的曲线下面积(AUROC)为0.94。当通过抄袭检测网站和iThenticate运行时,生成的摘要得分低于原始摘要(分数越高表明发现的匹配文本越多)。当给出原始摘要和通用摘要的混合样本时,不知情的人类评审员正确识别出68%的生成摘要为由ChatGPT生成,但错误地将14%的原始摘要识别为生成的。评审员表示,区分两者出人意料地困难,尽管他们怀疑是生成的摘要更模糊且更公式化。ChatGPT能写出可信的科学摘要,尽管数据完全是生成的。根据特定出版商的指南,人工智能输出检测器可作为一种编辑工具来帮助维持科学标准。大语言模型在帮助科学写作方面的道德和可接受使用界限仍在讨论中,不同的期刊和会议正在采用不同的政策。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589b/10133283/befb62323f27/41746_2023_819_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验