Koltcov Sergei, Ignatenko Vera, Boukhers Zeyd, Staab Steffen
National Research University Higher School of Economics, Soyuza Pechatnikov Street 16, 190121 St Petersburg, Russia.
Institute for Web Science and Technologies, Universität Koblenz-Landau, Universitätsstrasse 1, 56070 Koblenz, Germany.
Entropy (Basel). 2020 Mar 30;22(4):394. doi: 10.3390/e22040394.
Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models-Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)-we, first of all, show that the minimum of Renyi entropy coincides with the "true" number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.
主题建模是一种用于对大量文本文件集合进行聚类的流行技术。在主题建模中实现了各种不同类型的正则化。在本文中,我们提出了一种新颖的方法来分析不同正则化类型对主题建模结果的影响。基于雷尼熵,该方法受到统计物理学概念的启发,其中集合的推断主题结构可被视为处于非平衡状态的信息统计系统。通过在四个模型上测试我们的方法——概率潜在语义分析(pLSA)、主题模型的加法正则化(BigARTM)、使用吉布斯采样的潜在狄利克雷分配(LDA)、使用变分推断的LDA(VLDA)——我们首先表明,雷尼熵的最小值与两个标记集合中确定的“真实”主题数量一致。同时,我们发现分层狄利克雷过程(HDP)模型作为一种众所周知的主题数量优化方法未能检测到这样的最优值。接下来,我们证明BigARTM中正则化系数的大值会使熵的最小值从主题数量最优值显著偏移,而对于使用吉布斯采样的LDA中的超参数则未观察到这种效应。我们得出结论,正则化可能会给主题模型引入需要进一步研究的不可预测的扭曲。