Suppr超能文献

使用集成学习进行作者身份识别。

Authorship identification using ensemble learning.

机构信息

Department of Creative Technologies, PAF Complex, E-9, Air University, Islamabad, Pakistan.

Department of Cyber Security, PAF Complex, E-9, Air University, Islamabad, Pakistan.

出版信息

Sci Rep. 2022 Jun 9;12(1):9537. doi: 10.1038/s41598-022-13690-4.

Abstract

With time, textual data is proliferating, primarily through the publications of articles. With this rapid increase in textual data, anonymous content is also increasing. Researchers are searching for alternative strategies to identify the author of an unknown text. There is a need to develop a system to identify the actual author of unknown texts based on a given set of writing samples. This study presents a novel approach based on ensemble learning, DistilBERT, and conventional machine learning techniques for authorship identification. The proposed approach extracts the valuable characteristics of the author using a count vectorizer and bi-gram Term frequency-inverse document frequency (TF-IDF). An extensive and detailed dataset, "All the news" is used in this study for experimentation. The dataset is divided into three subsets (article1, article2, and article3). We limit the scope of the dataset and selected ten authors in the first scope and 20 authors in the second scope for experimentation. The experimental results of proposed ensemble learning and DistilBERT provide better performance for all the three subsets of the "All the news" dataset. In the first scope, the experimental results prove that the proposed ensemble learning approach from 10 authors provides a better accuracy gain of 3.14% and from DistilBERT 2.44% from the article1 dataset. Similarly, in the second scope from 20 authors, the proposed ensemble learning approach provides a better accuracy gain of 5.25% and from DistilBERT 7.17% from the article1 dataset, which is better than previous state-of-the-art studies.

摘要

随着时间的推移,文本数据呈爆炸式增长,主要是通过文章的发表。随着文本数据的快速增长,匿名内容也在增加。研究人员正在寻找替代策略来识别未知文本的作者。需要开发一种系统,根据给定的一组写作样本,识别未知文本的实际作者。本研究提出了一种基于集成学习、DistilBERT 和传统机器学习技术的新方法用于作者识别。该方法使用计数向量器和二元词频-逆文档频率(TF-IDF)提取作者的有价值特征。本研究使用了一个广泛而详细的数据集“All the news”进行实验。该数据集分为三个子集(article1、article2 和 article3)。我们限制了数据集的范围,并在第一个范围内选择了 10 个作者,在第二个范围内选择了 20 个作者进行实验。所提出的集成学习和 DistilBERT 的实验结果为“All the news”数据集的所有三个子集提供了更好的性能。在第一个范围内,实验结果证明,从 10 个作者中提出的集成学习方法提供了更好的准确性增益 3.14%,从 DistilBERT 中提供了 2.44%的准确性增益从 article1 数据集。同样,在第二个范围内从 20 个作者中,所提出的集成学习方法提供了更好的准确性增益 5.25%,从 DistilBERT 中提供了 7.17%的准确性增益从 article1 数据集,优于以前的最先进的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/30d2/9184563/d9a6c9b715ef/41598_2022_13690_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验