Suppr超能文献

基于贝叶斯信息准则的改进简约主题模型

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.

作者信息

Wang Hang, Miller David

机构信息

Electrical Engineering and Computer Science Department, The Pennsylvania State University, State College, PA 16802, USA.

出版信息

Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.

Abstract

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model-the topic-specific words, document-specific topics, all model parameter values, and the total number of topics-in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for topics-such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM's model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.

摘要

在之前的一项工作中,针对文本语料库提出了一种简约主题模型(PTM)。在该工作中,与潜在狄利克雷分配(LDA)不同,建模为每个主题确定了一组显著词及其特定主题概率,词典中的其余词则由一个通用共享模型解释。此外,在LDA中,原则上每个文档都包含所有主题。相比之下,PTM给出稀疏主题表示,为每个文档确定相关主题的(小)子集。推导了一种定制的贝叶斯信息准则(BIC),平衡模型复杂度和拟合优度,通过最小化BIC以完全无监督的方式共同确定整个模型——特定主题的词、特定文档的主题、所有模型参数值以及主题总数。在本工作中,提出了PTM的几个重要建模和算法(参数学习)扩展。首先,我们使用一种无损编码方案修改BIC目标函数,该方案具有低建模成本,用于描述对主题不显著的词——这些词本质上被视为完全有噪声/无信息的。这种方法增加了PTM的模型稀疏性,这也使得能够选择更多主题且比原始PTM具有更低的BIC成本。其次,在原始PTM模型学习策略中,词切换是顺序更新的,这是短视的且容易找到较差的局部最优解。在这里,我们改为联合优化与同一词(跨主题)对应的所有切换。这种方法在每一步联合优化的参数比原始PTM多得多,原则上应该不太容易找到较差的局部最小值。在几个文档数据集上的结果表明,我们提出的方法在多个性能指标方面优于原始PTM模型,并且给出了比原始PTM更稀疏的主题模型表示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/caba4861cad1/entropy-22-00326-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验