Suppr超能文献

利用词共现网络动态识别文本作者身份。

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

作者信息

Akimushkin Camilo, Amancio Diego Raphael, Oliveira Osvaldo Novais

机构信息

São Carlos Institute of Physics, University of São Paulo, São Carlos, São Paulo, Brazil.

Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, São Paulo, Brazil.

出版信息

PLoS One. 2017 Jan 26;12(1):e0170527. doi: 10.1371/journal.pone.0170527. eCollection 2017.

Abstract

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.

摘要

在有争议的文档中自动识别作者身份受益于复杂网络理论,因为这种方法不需要人类专业知识或详细的语义知识。对整本书进行建模的网络可用于区分不同来源的文本并理解网络增长机制,但只有少数研究探讨了网络在对小文本片段进行建模以掌握文体特征方面的适用性。在本研究中,我们引入了一种基于表示书面文本的词共现网络动态的方法,对8位作者的80篇文本语料库进行分类。这些文本被分成具有相同语言标记数量的部分,从中为12个拓扑指标创建时间序列。由于所有序列的73%是平稳的(ARIMA(p, 0, q)),其余的是一阶可积的(ARIMA(p, 1, q)),因此可以获得全局网络指标的概率分布。这些指标呈现钟形非高斯分布,因此分布矩被用作学习属性。通过基于Isomap执行的非线性变换的优化监督学习过程,使用K近邻算法对80篇文本中的71篇进行了正确分类,即实现了高达88.75%的作者匹配成功率。因此,网络指标中纯粹的动态波动可以表征作者身份,从而为根据小的演化网络对大文本进行稳健描述铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3499/5268788/238bbf9824b3/pone.0170527.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验