• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

从雷尼熵的角度分析主题模型超参数和正则化器的影响。

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy.

作者信息

Koltcov Sergei, Ignatenko Vera, Boukhers Zeyd, Staab Steffen

机构信息

National Research University Higher School of Economics, Soyuza Pechatnikov Street 16, 190121 St Petersburg, Russia.

Institute for Web Science and Technologies, Universität Koblenz-Landau, Universitätsstrasse 1, 56070 Koblenz, Germany.

出版信息

Entropy (Basel). 2020 Mar 30;22(4):394. doi: 10.3390/e22040394.

DOI:10.3390/e22040394
PMID:33286169
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7516868/
Abstract

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models-Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)-we, first of all, show that the minimum of Renyi entropy coincides with the "true" number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

摘要

主题建模是一种用于对大量文本文件集合进行聚类的流行技术。在主题建模中实现了各种不同类型的正则化。在本文中,我们提出了一种新颖的方法来分析不同正则化类型对主题建模结果的影响。基于雷尼熵,该方法受到统计物理学概念的启发,其中集合的推断主题结构可被视为处于非平衡状态的信息统计系统。通过在四个模型上测试我们的方法——概率潜在语义分析(pLSA)、主题模型的加法正则化(BigARTM)、使用吉布斯采样的潜在狄利克雷分配(LDA)、使用变分推断的LDA(VLDA)——我们首先表明,雷尼熵的最小值与两个标记集合中确定的“真实”主题数量一致。同时,我们发现分层狄利克雷过程(HDP)模型作为一种众所周知的主题数量优化方法未能检测到这样的最优值。接下来,我们证明BigARTM中正则化系数的大值会使熵的最小值从主题数量最优值显著偏移,而对于使用吉布斯采样的LDA中的超参数则未观察到这种效应。我们得出结论,正则化可能会给主题模型引入需要进一步研究的不可预测的扭曲。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a2/7516868/68685f5d9d7f/entropy-22-00394-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a2/7516868/0fff6012b37d/entropy-22-00394-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a2/7516868/68685f5d9d7f/entropy-22-00394-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a2/7516868/0fff6012b37d/entropy-22-00394-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99a2/7516868/68685f5d9d7f/entropy-22-00394-g007.jpg

相似文献

1
Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy.从雷尼熵的角度分析主题模型超参数和正则化器的影响。
Entropy (Basel). 2020 Mar 30;22(4):394. doi: 10.3390/e22040394.
2
Estimating Topic Modeling Performance with Sharma-Mittal Entropy.用夏尔马-米塔尔熵估计主题建模性能。
Entropy (Basel). 2019 Jul 5;21(7):660. doi: 10.3390/e21070660.
3
Analysis and tuning of hierarchical topic models based on Renyi entropy approach.基于雷尼熵方法的层次主题模型分析与调优
PeerJ Comput Sci. 2021 Jul 29;7:e608. doi: 10.7717/peerj-cs.608. eCollection 2021.
4
Renormalization Analysis of Topic Models.主题模型的重归一化分析
Entropy (Basel). 2020 May 16;22(5):556. doi: 10.3390/e22050556.
5
Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。
Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.
6
Web content topic modeling using LDA and HTML tags.使用潜在狄利克雷分配(LDA)和HTML标签的网页内容主题建模
PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.
7
Predicting protein-protein relationships from literature using latent topics.利用潜在主题从文献中预测蛋白质-蛋白质关系。
Genome Inform. 2009 Oct;23(1):3-12.
8
An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.
9
Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics.
PeerJ Comput Sci. 2024 Jan 3;10:e1758. doi: 10.7717/peerj-cs.1758. eCollection 2024.
10
Learning topic models by belief propagation.通过信念传播学习主题模型。
IEEE Trans Pattern Anal Mach Intell. 2013 May;35(5):1121-34. doi: 10.1109/TPAMI.2012.185.

引用本文的文献

1
Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics.
PeerJ Comput Sci. 2024 Jan 3;10:e1758. doi: 10.7717/peerj-cs.1758. eCollection 2024.
2
Analysis and tuning of hierarchical topic models based on Renyi entropy approach.基于雷尼熵方法的层次主题模型分析与调优
PeerJ Comput Sci. 2021 Jul 29;7:e608. doi: 10.7717/peerj-cs.608. eCollection 2021.
3
Renormalization Analysis of Topic Models.主题模型的重归一化分析

本文引用的文献

1
Estimating Topic Modeling Performance with Sharma-Mittal Entropy.用夏尔马-米塔尔熵估计主题建模性能。
Entropy (Basel). 2019 Jul 5;21(7):660. doi: 10.3390/e21070660.
2
A heuristic approach to determine an appropriate number of topics in topic modeling.一种用于确定主题建模中合适主题数量的启发式方法。
BMC Bioinformatics. 2015;16 Suppl 13(Suppl 13):S8. doi: 10.1186/1471-2105-16-S13-S8. Epub 2015 Sep 25.
3
Stochastic relaxation, gibbs distributions, and the bayesian restoration of images.随机松弛,吉布斯分布,以及贝叶斯图像恢复。
Entropy (Basel). 2020 May 16;22(5):556. doi: 10.3390/e22050556.
IEEE Trans Pattern Anal Mach Intell. 1984 Jun;6(6):721-41. doi: 10.1109/tpami.1984.4767596.
4
Finding scientific topics.寻找科学主题。
Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5228-35. doi: 10.1073/pnas.0307752101. Epub 2004 Feb 10.
5
Statistical mechanics and phase transitions in clustering.聚类中的统计力学与相变
Phys Rev Lett. 1990 Aug 20;65(8):945-948. doi: 10.1103/PhysRevLett.65.945.