• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于贝叶斯信息准则的改进简约主题模型

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.

作者信息

Wang Hang, Miller David

机构信息

Electrical Engineering and Computer Science Department, The Pennsylvania State University, State College, PA 16802, USA.

出版信息

Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.

DOI:10.3390/e22030326
PMID:33286100
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7516783/
Abstract

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model-the topic-specific words, document-specific topics, all model parameter values, and the total number of topics-in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for topics-such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM's model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.

摘要

在之前的一项工作中,针对文本语料库提出了一种简约主题模型(PTM)。在该工作中,与潜在狄利克雷分配(LDA)不同,建模为每个主题确定了一组显著词及其特定主题概率,词典中的其余词则由一个通用共享模型解释。此外,在LDA中,原则上每个文档都包含所有主题。相比之下,PTM给出稀疏主题表示,为每个文档确定相关主题的(小)子集。推导了一种定制的贝叶斯信息准则(BIC),平衡模型复杂度和拟合优度,通过最小化BIC以完全无监督的方式共同确定整个模型——特定主题的词、特定文档的主题、所有模型参数值以及主题总数。在本工作中,提出了PTM的几个重要建模和算法(参数学习)扩展。首先,我们使用一种无损编码方案修改BIC目标函数,该方案具有低建模成本,用于描述对主题不显著的词——这些词本质上被视为完全有噪声/无信息的。这种方法增加了PTM的模型稀疏性,这也使得能够选择更多主题且比原始PTM具有更低的BIC成本。其次,在原始PTM模型学习策略中,词切换是顺序更新的,这是短视的且容易找到较差的局部最优解。在这里,我们改为联合优化与同一词(跨主题)对应的所有切换。这种方法在每一步联合优化的参数比原始PTM多得多,原则上应该不太容易找到较差的局部最小值。在几个文档数据集上的结果表明,我们提出的方法在多个性能指标方面优于原始PTM模型,并且给出了比原始PTM更稀疏的主题模型表示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/2fe5f1c0d0af/entropy-22-00326-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/caba4861cad1/entropy-22-00326-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/7c846244c6f6/entropy-22-00326-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/2fe5f1c0d0af/entropy-22-00326-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/caba4861cad1/entropy-22-00326-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/7c846244c6f6/entropy-22-00326-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a14/7516783/2fe5f1c0d0af/entropy-22-00326-g003.jpg

相似文献

1
Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.基于贝叶斯信息准则的改进简约主题模型
Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.
2
Identifying Objective and Subjective Words via Topic Modeling.通过主题建模识别客观词和主观词。
IEEE Trans Neural Netw Learn Syst. 2018 Mar;29(3):718-730. doi: 10.1109/TNNLS.2016.2626379. Epub 2017 Jan 17.
3
Weighted Joint Sentiment-Topic Model for Sentiment Analysis Compared to ALGA: Adaptive Lexicon Learning Using Genetic Algorithm.加权联合情感-主题模型在情感分析中的应用比较 与 ALGA 相比:基于遗传算法的自适应词汇学习。
Comput Intell Neurosci. 2022 Jul 31;2022:7612276. doi: 10.1155/2022/7612276. eCollection 2022.
4
Knowledge-Based Topic Model for Unsupervised Object Discovery and Localization.基于知识的无监督目标发现和定位主题模型。
IEEE Trans Image Process. 2018;27(1):50-63. doi: 10.1109/TIP.2017.2718667.
5
Variations on a theme: Topic modeling of naturalistic driving data.主题变奏:自然驾驶数据的主题建模
Proc Hum Factors Ergon Soc Annu Meet. 2014 Sep;58(1):2107-2111. doi: 10.1177/1541931214581443.
6
Using topic-noise models to generate domain-specific topics across data sources.使用主题-噪声模型跨数据源生成特定领域的主题。
Knowl Inf Syst. 2023;65(5):2159-2186. doi: 10.1007/s10115-022-01805-2. Epub 2023 Jan 16.
7
Link-topic model for biomedical abbreviation disambiguation.用于生物医学缩写词消歧的链接主题模型
J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.
8
Using phrases and document metadata to improve topic modeling of clinical reports.使用短语和文档元数据改进临床报告的主题建模。
J Biomed Inform. 2016 Jun;61:260-6. doi: 10.1016/j.jbi.2016.04.005. Epub 2016 Apr 21.
9
Anchor-Free Correlated Topic Modeling.无锚点相关主题建模
IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.
10
TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题:基于文本分类的词群分组作为主题及主题评分
Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

本文引用的文献

1
What is the expectation maximization algorithm?期望最大化算法是什么?
Nat Biotechnol. 2008 Aug;26(8):897-9. doi: 10.1038/nbt1406.