Suppr超能文献

用写作风格揭示ChatGPT文本。

Unveiling ChatGPT text using writing style.

作者信息

Berriche Lamia, Larabi-Marie-Sainte Souad

机构信息

College of Computer & Information Sciences, Prince Sultan University, Saudi Arabia.

出版信息

Heliyon. 2024 Jun 15;10(12):e32976. doi: 10.1016/j.heliyon.2024.e32976. eCollection 2024 Jun 30.

Abstract

Extensive use of AI-generated texts culminated recently after the advent of large language models. Although the use of AI text generators, such as ChatGPT, is beneficial, it also threatens the academic level as students may resort to it. In this work, we propose a technique leveraging the intrinsic stylometric features of documents to detect ChatGPT-based plagiarism. The stylometric features were normalized and fed to classical classifiers, such as k-Nearest Neighbors, Decision Tree, and Naïve Bayes, as well as ensemble classifiers, such as XGBoost and Stacking. A thorough examination of the classifier was conducted by using Cross-Fold validation, hyperparameters tuning, and multiple training iterations. The results show the efficacy of both classical and ensemble learning classifiers in distinguishing between human and ChatGPT writing styles with a noteworthy performance of XGBoost where 100 % was achieved for accuracy, recall, and precision metrics. Moreover, the proposed XGBoost classifier outperformed the state-of-the-art result on the same dataset and same classifier highlighting the superiority of the proposed feature style extraction method over TF-IDF techniques. The ensemble learning classifiers were also applied to the generated dataset with mixed texts, where paragraphs are written by ChatGPT and humans. The results show that 98 % of the documents were classified correctly as either mixed or human. The last contribution consists in the authorship attribution of the paragraphs of a single document where the accuracy reached 92.3 %.

摘要

在大型语言模型出现之后,人工智能生成文本的广泛应用最近达到了顶峰。尽管使用人工智能文本生成器(如ChatGPT)是有益的,但它也对学术水平构成了威胁,因为学生可能会求助于它。在这项工作中,我们提出了一种利用文档内在文体特征来检测基于ChatGPT的抄袭行为的技术。对文体特征进行了归一化处理,并将其输入到经典分类器(如k近邻、决策树和朴素贝叶斯)以及集成分类器(如XGBoost和堆叠)中。通过使用交叉折叠验证、超参数调整和多次训练迭代,对分类器进行全面检查。结果表明,经典学习分类器和集成学习分类器在区分人类写作风格和ChatGPT写作风格方面都是有效的,其中XGBoost表现出色,在准确率、召回率和精确率指标上均达到了100%。此外,所提出 的XGBoost分类器在相同数据集和相同分类器上的表现优于现有技术成果,突出了所提出的特征风格提取方法相对于TF-IDF技术的优越性。集成学习分类器也被应用于生成的混合文本数据集,其中段落由ChatGPT和人类撰写。结果表明,98%的文档被正确分类为混合文本或人类文本。最后的贡献在于对单个文档段落的作者归属进行判断,准确率达到了92.3%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96e2/11231544/709a2f3d359a/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验