基于贝叶斯信息准则的改进简约主题模型

Wang Hang, Miller David

Electrical Engineering and Computer Science Department, The Pennsylvania State University, State College, PA 16802, USA.

Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model-the topic-specific words, document-specific topics, all model parameter values, and the total number of topics-in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for topics-such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM's model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.

在之前的一项工作中，针对文本语料库提出了一种简约主题模型（PTM）。在该工作中，与潜在狄利克雷分配（LDA）不同，建模为每个主题确定了一组显著词及其特定主题概率，词典中的其余词则由一个通用共享模型解释。此外，在LDA中，原则上每个文档都包含所有主题。相比之下，PTM给出稀疏主题表示，为每个文档确定相关主题的（小）子集。推导了一种定制的贝叶斯信息准则（BIC），平衡模型复杂度和拟合优度，通过最小化BIC以完全无监督的方式共同确定整个模型——特定主题的词、特定文档的主题、所有模型参数值以及主题总数。在本工作中，提出了PTM的几个重要建模和算法（参数学习）扩展。首先，我们使用一种无损编码方案修改BIC目标函数，该方案具有低建模成本，用于描述对主题不显著的词——这些词本质上被视为完全有噪声/无信息的。这种方法增加了PTM的模型稀疏性，这也使得能够选择更多主题且比原始PTM具有更低的BIC成本。其次，在原始PTM模型学习策略中，词切换是顺序更新的，这是短视的且容易找到较差的局部最优解。在这里，我们改为联合优化与同一词（跨主题）对应的所有切换。这种方法在每一步联合优化的参数比原始PTM多得多，原则上应该不太容易找到较差的局部最小值。在几个文档数据集上的结果表明，我们提出的方法在多个性能指标方面优于原始PTM模型，并且给出了比原始PTM更稀疏的主题模型表示。

相似文献

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.

Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.

Identifying Objective and Subjective Words via Topic Modeling.

IEEE Trans Neural Netw Learn Syst. 2018 Mar;29(3):718-730. doi: 10.1109/TNNLS.2016.2626379. Epub 2017 Jan 17.

Weighted Joint Sentiment-Topic Model for Sentiment Analysis Compared to ALGA: Adaptive Lexicon Learning Using Genetic Algorithm.

Comput Intell Neurosci. 2022 Jul 31;2022:7612276. doi: 10.1155/2022/7612276. eCollection 2022.

Knowledge-Based Topic Model for Unsupervised Object Discovery and Localization.

IEEE Trans Image Process. 2018;27(1):50-63. doi: 10.1109/TIP.2017.2718667.

Variations on a theme: Topic modeling of naturalistic driving data.

Proc Hum Factors Ergon Soc Annu Meet. 2014 Sep;58(1):2107-2111. doi: 10.1177/1541931214581443.

Using topic-noise models to generate domain-specific topics across data sources.

Knowl Inf Syst. 2023;65(5):2159-2186. doi: 10.1007/s10115-022-01805-2. Epub 2023 Jan 16.

Link-topic model for biomedical abbreviation disambiguation.

J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.

Using phrases and document metadata to improve topic modeling of clinical reports.

J Biomed Inform. 2016 Jun;61:260-6. doi: 10.1016/j.jbi.2016.04.005. Epub 2016 Apr 21.

Anchor-Free Correlated Topic Modeling.

IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

本文引用的文献

What is the expectation maximization algorithm?

Nat Biotechnol. 2008 Aug;26(8):897-9. doi: 10.1038/nbt1406.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.

Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.

Identifying Objective and Subjective Words via Topic Modeling.

IEEE Trans Neural Netw Learn Syst. 2018 Mar;29(3):718-730. doi: 10.1109/TNNLS.2016.2626379. Epub 2017 Jan 17.

Weighted Joint Sentiment-Topic Model for Sentiment Analysis Compared to ALGA: Adaptive Lexicon Learning Using Genetic Algorithm.

Comput Intell Neurosci. 2022 Jul 31;2022:7612276. doi: 10.1155/2022/7612276. eCollection 2022.

Knowledge-Based Topic Model for Unsupervised Object Discovery and Localization.

IEEE Trans Image Process. 2018;27(1):50-63. doi: 10.1109/TIP.2017.2718667.

Variations on a theme: Topic modeling of naturalistic driving data.

Proc Hum Factors Ergon Soc Annu Meet. 2014 Sep;58(1):2107-2111. doi: 10.1177/1541931214581443.

Using topic-noise models to generate domain-specific topics across data sources.

Knowl Inf Syst. 2023;65(5):2159-2186. doi: 10.1007/s10115-022-01805-2. Epub 2023 Jan 16.

Link-topic model for biomedical abbreviation disambiguation.

J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.

Using phrases and document metadata to improve topic modeling of clinical reports.

J Biomed Inform. 2016 Jun;61:260-6. doi: 10.1016/j.jbi.2016.04.005. Epub 2016 Apr 21.

Anchor-Free Correlated Topic Modeling.

IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

本文引用的文献

What is the expectation maximization algorithm?

Nat Biotechnol. 2008 Aug;26(8):897-9. doi: 10.1038/nbt1406.

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献