Gromov Vasilii A, Dang Quynh Nhu, Kogan Alexandra S, Yerbolova Assel
HSE University, Moscow, Russia.
PeerJ Comput Sci. 2024 Dec 9;10:e2550. doi: 10.7717/peerj-cs.2550. eCollection 2024.
This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.
本文关注区分人工撰写文本和由机器人生成文本的问题。与传统的问题表述不同,传统表述仅关注一种类型的机器人,而我们考虑的是区分任何人撰写的文本与任何机器人生成的文本的问题;这涉及分析语言语义空间的大规模、粗粒度结构。为了构建训练和测试数据集,我们提议不是分离机器人的文本,而是分离机器人本身,因此测试样本包含那些不在训练样本中的机器人(和人)的文本。我们旨在找到高效且通用的特征,而不是仅处理特定类型机器人的复杂分类模型架构。在这项研究中,我们使用聚类(Wishart和K均值,以及模糊变体)和非线性动态技术(熵复杂度度量)为人工撰写和机器人生成的文本导出特征。然后,我们特意使用最简单的分类器(支持向量机、决策树、随机森林)和导出的特征来识别文本是否为人工撰写。大规模模拟显示出良好的分类结果(分类质量超过96%),尽管不同语系的语言结果有所不同。