IEEE Trans Cybern. 2019 Jan;49(1):107-121. doi: 10.1109/TCYB.2017.2766189. Epub 2017 Nov 21.
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic n -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.
作者分析(AA)是从文本数据中揭示作者隐藏属性的研究。它根据文本中反映的写作风格提取作者的身份和社会语言特征。该过程对于网络犯罪调查、心理语言学、政治社会化等各个领域都是必不可少的。然而,以前的大多数技术都严重依赖于手动特征工程过程。因此,特征集的选择已被证明与场景或数据集有关。在本文中,为了模仿人类句子组成过程,我们使用神经网络方法,提出将不同类别的语言特征纳入单词的分布式表示中,以便根据未标记的文本学习 AA 的写作风格表示。特别是,所提出的模型允许从每个文档中提取主题、词汇、语法和字符级别的特征向量作为风格计量学。我们使用 Twitter、博客、评论、小说和散文数据集评估我们的方法在作者特征刻画、作者识别和作者验证问题上的性能。实验表明,我们提出的文本表示优于静态风格计量学、动态 n 元组、潜在狄利克雷分配、潜在语义分析、段落向量的分布式记忆模型、段落向量的分布式词袋版本、word2vec 表示和其他基线。