Suppr超能文献

找出机器人:自然语言处理的逆问题。

Spot the bot: the inverse problems of NLP.

作者信息

Gromov Vasilii A, Dang Quynh Nhu, Kogan Alexandra S, Yerbolova Assel

机构信息

HSE University, Moscow, Russia.

出版信息

PeerJ Comput Sci. 2024 Dec 9;10:e2550. doi: 10.7717/peerj-cs.2550. eCollection 2024.

Abstract

This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.

摘要

本文关注区分人工撰写文本和由机器人生成文本的问题。与传统的问题表述不同,传统表述仅关注一种类型的机器人,而我们考虑的是区分任何人撰写的文本与任何机器人生成的文本的问题;这涉及分析语言语义空间的大规模、粗粒度结构。为了构建训练和测试数据集,我们提议不是分离机器人的文本,而是分离机器人本身,因此测试样本包含那些不在训练样本中的机器人(和人)的文本。我们旨在找到高效且通用的特征,而不是仅处理特定类型机器人的复杂分类模型架构。在这项研究中,我们使用聚类(Wishart和K均值,以及模糊变体)和非线性动态技术(熵复杂度度量)为人工撰写和机器人生成的文本导出特征。然后,我们特意使用最简单的分类器(支持向量机、决策树、随机森林)和导出的特征来识别文本是否为人工撰写。大规模模拟显示出良好的分类结果(分类质量超过96%),尽管不同语系的语言结果有所不同。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/706c/11784749/a190f82c5eea/peerj-cs-10-2550-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验