Koltcov Sergei, Ignatenko Vera
Laboratory for Social and Cognitive Informatics, National Research University Higher School of Economics, 55/2 Sedova St., 192148 St. Petersburg, Russia.
Entropy (Basel). 2020 May 16;22(5):556. doi: 10.3390/e22050556.
In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation-Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality.
在实践中,要构建大数据的机器学习模型,需要调整模型参数。参数调整过程涉及极其耗时且计算成本高昂的网格搜索。然而,统计物理理论提供了一些技术,使我们能够优化这一过程。本文表明,主题建模输出的一个函数在聚类数量变化时表现出自相似行为。这种行为允许使用重整化技术。重整化过程与雷尼熵方法相结合,可以快速搜索最优主题数量。在本文中,针对概率潜在语义分析(pLSA)、具有变分期望最大化算法的潜在狄利克雷分配模型(VLDA)以及具有颗粒吉布斯采样过程的潜在狄利克雷分配模型(GLDA)开发了重整化过程。实验在两个已知主题数量的不同语言测试数据集以及一个主题数量未知的未标记测试数据集上进行。本文表明,重整化过程能够以比网格搜索快至少30倍的速度找到最优主题数量的近似值,且质量不会有显著损失。