找出机器人：自然语言处理的逆问题。

Spot the bot: the inverse problems of NLP.

作者信息

Gromov Vasilii A, Dang Quynh Nhu, Kogan Alexandra S, Yerbolova Assel

机构信息

HSE University, Moscow, Russia.

出版信息

PeerJ Comput Sci. 2024 Dec 9;10:e2550. doi: 10.7717/peerj-cs.2550. eCollection 2024.

DOI:10.7717/peerj-cs.2550

PMID:39896415

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11784749/

Abstract

This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.

摘要

本文关注区分人工撰写文本和由机器人生成文本的问题。与传统的问题表述不同，传统表述仅关注一种类型的机器人，而我们考虑的是区分任何人撰写的文本与任何机器人生成的文本的问题；这涉及分析语言语义空间的大规模、粗粒度结构。为了构建训练和测试数据集，我们提议不是分离机器人的文本，而是分离机器人本身，因此测试样本包含那些不在训练样本中的机器人（和人）的文本。我们旨在找到高效且通用的特征，而不是仅处理特定类型机器人的复杂分类模型架构。在这项研究中，我们使用聚类（Wishart和K均值，以及模糊变体）和非线性动态技术（熵复杂度度量）为人工撰写和机器人生成的文本导出特征。然后，我们特意使用最简单的分类器（支持向量机、决策树、随机森林）和导出的特征来识别文本是否为人工撰写。大规模模拟显示出良好的分类结果（分类质量超过96%），尽管不同语系的语言结果有所不同。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/706c/11784749/a190f82c5eea/peerj-cs-10-2550-g001.jpg

相似文献

Spot the bot: the inverse problems of NLP.找出机器人：自然语言处理的逆问题。

PeerJ Comput Sci. 2024 Dec 9;10:e2550. doi: 10.7717/peerj-cs.2550. eCollection 2024.

An adept approach to ascertain and elude probable social bots attacks on twitter and twitch employing machine learning approach.一种采用机器学习方法来确定并规避推特和Twitch上可能存在的社交机器人攻击的娴熟方法。

MethodsX. 2023 Oct 10;11:102430. doi: 10.1016/j.mex.2023.102430. eCollection 2023 Dec.

Detecting bots in social-networks using node and structural embeddings.使用节点和结构嵌入技术在社交网络中检测机器人程序。

J Big Data. 2023;10(1):119. doi: 10.1186/s40537-023-00796-3. Epub 2023 Jul 19.

Improving Social Bot Detection Through Aid and Training.通过辅助和培训提高社交机器人检测能力。

Hum Factors. 2024 Oct;66(10):2323-2344. doi: 10.1177/00187208231210145. Epub 2023 Nov 14.

Borderline Ovarian Tumors Share Familial Risks with Themselves and Invasive Cancers.交界性卵巢肿瘤与其自身的浸润性癌和家族性风险相关。

Cancer Epidemiol Biomarkers Prev. 2018 Nov;27(11):1358-1363. doi: 10.1158/1055-9965.EPI-18-0503. Epub 2018 Jul 17.

A global comparison of social media bot and human characteristics.社交媒体机器人与人类特征的全球比较。

Sci Rep. 2025 Mar 31;15(1):10973. doi: 10.1038/s41598-025-96372-1.

Borderline ovarian tumors: French guidelines from the CNGOF. Part 1. Epidemiology, biopathology, imaging and biomarkers.交界性卵巢肿瘤：法国 CNGOF 指南。第 1 部分。流行病学、生物病理学、影像学和生物标志物。

J Gynecol Obstet Hum Reprod. 2021 Jan;50(1):101965. doi: 10.1016/j.jogoh.2020.101965. Epub 2020 Nov 4.

Borderline ovarian tumors: Guidelines from the French national college of obstetricians and gynecologists (CNGOF).卵巢交界性肿瘤：法国国家妇产科医师学会（CNGOF）指南

Eur J Obstet Gynecol Reprod Biol. 2021 Jan;256:492-501. doi: 10.1016/j.ejogrb.2020.11.045. Epub 2020 Nov 20.

New sonographic marker of borderline ovarian tumor: microcystic pattern of papillae and solid components.卵巢交界性肿瘤的新超声标志物：乳头及实性成分的微囊状模式。

Ultrasound Obstet Gynecol. 2019 Sep;54(3):395-402. doi: 10.1002/uog.20283. Epub 2019 Aug 8.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

本文引用的文献

SEGCN: a subgraph encoding based graph convolutional network model for social bot detection.SEGCN：一种基于子图编码的用于社交机器人检测的图卷积网络模型。

Sci Rep. 2024 Feb 19;14(1):4122. doi: 10.1038/s41598-024-54809-z.

(Re)shaping online narratives: when bots promote the message of President Trump during his first impeachment.重塑网络叙事：当机器人在特朗普总统首次弹劾期间传播其信息时。

PeerJ Comput Sci. 2022 Apr 15;8:e947. doi: 10.7717/peerj-cs.947. eCollection 2022.

Identifying Twitter users who repost unreliable news sources with linguistic information.利用语言信息识别转发不可靠新闻来源的推特用户。

PeerJ Comput Sci. 2020 Dec 14;6:e325. doi: 10.7717/peerj-cs.325. eCollection 2020.

On the physical origin of linguistic laws and lognormality in speech.论语言规律的物理起源及言语中的对数正态性。

R Soc Open Sci. 2019 Aug 21;6(8):191023. doi: 10.1098/rsos.191023. eCollection 2019 Aug.

Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words.文学文本中的长时记忆：论罕见词的普遍聚类

PLoS One. 2016 Nov 28;11(11):e0164658. doi: 10.1371/journal.pone.0164658. eCollection 2016.

node2vec: Scalable Feature Learning for Networks.节点2向量：网络的可扩展特征学习

KDD. 2016 Aug;2016:855-864. doi: 10.1145/2939672.2939754.

The evolution of the exponent of Zipf's law in language ontogeny.语言个体发生中齐夫定律指数的演变。

PLoS One. 2013;8(3):e53227. doi: 10.1371/journal.pone.0053227. Epub 2013 Mar 13.

On the origin of long-range correlations in texts.文本中长程相关性的起源。

Proc Natl Acad Sci U S A. 2012 Jul 17;109(29):11582-7. doi: 10.1073/pnas.1117723109. Epub 2012 Jul 2.

Distinguishing noise from chaos.区分噪声与混沌。

Phys Rev Lett. 2007 Oct 12;99(15):154102. doi: 10.1103/PhysRevLett.99.154102.

Numerical classification method for deriving natural classes.用于推导自然类别的数值分类方法。

Nature. 1969 Jan 4;221(5175):97-8. doi: 10.1038/221097a0.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

找出机器人：自然语言处理的逆问题。

Spot the bot: the inverse problems of NLP.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献