超越主题：深度学习如何提高概率主题建模的可辨别性。

Beyond the topics: how deep learning can improve the discriminability of probabilistic topic modelling.

作者信息

Al Moubayed Noura, McGough Stephen, Awwad Shiekh Hasan Bashar

机构信息

Department of Computer Science, Durham University, Durham, UK.

Department of Computer Science, University of Newcastle upon Tyne, Newcastle, UK.

出版信息

PeerJ Comput Sci. 2020 Jan 27;6:e252. doi: 10.7717/peerj-cs.252. eCollection 2020.

DOI:10.7717/peerj-cs.252

PMID:33816904

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7924555/

Abstract

The article presents a discriminative approach to complement the unsupervised probabilistic nature of topic modelling. The framework transforms the probabilities of the topics per document into class-dependent deep learning models that extract highly discriminatory features suitable for classification. The framework is then used for sentiment analysis with minimum feature engineering. The approach transforms the sentiment analysis problem from the word/document domain to the topics domain making it more robust to noise and incorporating complex contextual information that are not represented otherwise. A stacked denoising autoencoder (SDA) is then used to model the complex relationship among the topics per sentiment with minimum assumptions. To achieve this, a distinct topic model and SDA per sentiment polarity is built with an additional decision layer for classification. The framework is tested on a comprehensive collection of benchmark datasets that vary in sample size, class bias and classification task. A significant improvement to the state of the art is achieved without the need for a sentiment lexica or over-engineered features. A further analysis is carried out to explain the observed improvement in accuracy.

摘要

本文提出了一种判别方法，以补充主题建模中无监督概率性质的不足。该框架将每个文档的主题概率转换为依赖于类别的深度学习模型，这些模型提取适用于分类的高判别性特征。然后，该框架用于最少特征工程的情感分析。该方法将情感分析问题从单词/文档领域转换到主题领域，使其对噪声更具鲁棒性，并纳入了以其他方式无法表示的复杂上下文信息。然后，使用堆叠去噪自动编码器（SDA）以最少的假设对每个情感的主题之间的复杂关系进行建模。为了实现这一点，针对每个情感极性构建了一个独特的主题模型和SDA，并带有一个用于分类的附加决策层。该框架在样本大小、类偏差和分类任务各不相同的综合基准数据集上进行了测试。在无需情感词典或过度设计的特征的情况下，实现了对现有技术水平的显著改进。还进行了进一步分析，以解释观察到的准确性提高情况。