Suppr超能文献

无锚点相关主题建模

Anchor-Free Correlated Topic Modeling.

作者信息

Fu Xiao, Huang Kejun, Sidiropoulos Nicholas D, Shi Qingjiang, Hong Mingyi

出版信息

IEEE Trans Pattern Anal Mach Intell. 2019 May;41(5):1056-1071. doi: 10.1109/TPAMI.2018.2827377. Epub 2018 Apr 16.

Abstract

In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has a characteristic anchor word that only appears in that topic. The anchor-word assumption is fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but such statistics need many more samples to obtain reliable estimates, and identifiability still hinges on additional assumptions, such as consecutive words being persistently drawn from the same topic. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.

摘要

在主题建模中,主题的可识别性是一个至关重要的问题。许多主题建模方法都是在每个主题都有一个仅出现在该主题中的特征锚定词这一前提下开发的。锚定词假设在实践中很脆弱,因为单词和术语有多种用途;然而它被普遍采用是因为它能保证可识别性。文献中的补救措施包括使用三阶或更高阶的词共现统计来提出张量分解模型,但这种统计需要更多样本才能获得可靠估计,并且可识别性仍然依赖于额外的假设,比如连续的单词持续从同一主题中抽取。在这项工作中,我们提出了一种使用单词二阶统计的新主题识别标准。即使在锚定词假设被严重违反的情况下,该标准在理论上也能保证识别出潜在主题。我们提出了一种基于交替优化的算法和一种高效的原始对偶算法来处理由此产生的识别问题。前者表现出高性能且完全无参数;后者相对于前者速度提升可达200倍,但需要调整步长且在准确性上稍有牺牲。我们使用了各种真实文本语料库来展示该方法的有效性,在各种评估指标下,所提出的无锚定方法与许多基于锚定词的方法相比都有显著改进。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验