• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

无锚点相关主题建模

Anchor-Free Correlated Topic Modeling.

作者信息

Fu Xiao, Huang Kejun, Sidiropoulos Nicholas D, Shi Qingjiang, Hong Mingyi

出版信息

IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.

DOI:10.1109/TPAMI.2018.2827377
PMID:29993625
Abstract

In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has a characteristic anchor word that only appears in that topic. The anchor-word assumption is fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but such statistics need many more samples to obtain reliable estimates, and identifiability still hinges on additional assumptions, such as consecutive words being persistently drawn from the same topic. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.

摘要

在主题建模中,主题的可识别性是一个至关重要的问题。许多主题建模方法都是在每个主题都有一个仅出现在该主题中的特征锚定词这一前提下开发的。锚定词假设在实践中很脆弱,因为单词和术语有多种用途;然而它被普遍采用是因为它能保证可识别性。文献中的补救措施包括使用三阶或更高阶的词共现统计来提出张量分解模型,但这种统计需要更多样本才能获得可靠估计,并且可识别性仍然依赖于额外的假设,比如连续的单词持续从同一主题中抽取。在这项工作中,我们提出了一种使用单词二阶统计的新主题识别标准。即使在锚定词假设被严重违反的情况下,该标准在理论上也能保证识别出潜在主题。我们提出了一种基于交替优化的算法和一种高效的原始对偶算法来处理由此产生的识别问题。前者表现出高性能且完全无参数;后者相对于前者速度提升可达200倍,但需要调整步长且在准确性上稍有牺牲。我们使用了各种真实文本语料库来展示该方法的有效性,在各种评估指标下,所提出的无锚定方法与许多基于锚定词的方法相比都有显著改进。

相似文献

1
Anchor-Free Correlated Topic Modeling.无锚点相关主题建模
IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.
2
Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.基于贝叶斯信息准则的改进简约主题模型
Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.
3
Link-topic model for biomedical abbreviation disambiguation.用于生物医学缩写词消歧的链接主题模型
J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.
4
Latent Topic Text Representation Learning on Statistical Manifolds.统计流形上的潜在主题文本表示学习
IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5643-5654. doi: 10.1109/TNNLS.2018.2808332. Epub 2018 Mar 16.
5
Nonparametric Spherical Topic Modeling with Word Embeddings.基于词嵌入的非参数球面主题模型
Proc Conf Assoc Comput Linguist Meet. 2016 Aug;2016:537-542. doi: 10.18653/v1/P16-2087.
6
Uncovering Flat and Hierarchical Topics by Community Discovery on Word Co-occurrence Network.通过词共现网络上的社区发现来揭示扁平主题和层次主题
Data Sci Eng. 2024;9(1):41-61. doi: 10.1007/s41019-023-00239-2. Epub 2024 Mar 13.
7
Training Lp norm multiple kernel learning in the primal.在原语中训练 Lp 范数多核学习。
Neural Netw. 2013 Oct;46:172-82. doi: 10.1016/j.neunet.2013.05.003. Epub 2013 May 24.
8
Tracking word semantic change in biomedical literature.追踪生物医学文献中的词汇语义变化。
Int J Med Inform. 2018 Jan;109:76-86. doi: 10.1016/j.ijmedinf.2017.11.006. Epub 2017 Nov 13.
9
Part 2. Development of Enhanced Statistical Methods for Assessing Health Effects Associated with an Unknown Number of Major Sources of Multiple Air Pollutants.第2部分。开发增强的统计方法,以评估与多种空气污染物的未知数量主要来源相关的健康影响。
Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):51-113.
10
Structural identifiability of cyclic graphical models of biological networks with latent variables.具有潜在变量的生物网络循环图形模型的结构可识别性。
BMC Syst Biol. 2016 Jun 13;10(1):41. doi: 10.1186/s12918-016-0287-y.

引用本文的文献

1
An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.