用夏尔马-米塔尔熵估计主题建模性能。

Estimating Topic Modeling Performance with Sharma-Mittal Entropy.

作者信息

Koltcov Sergei, Ignatenko Vera, Koltsova Olessia

机构信息

St. Petersburg School of Physics, Mathematics, and Computer Science, National Research University Higher School of Economics, Kantemirovskaya Ulitsa, 3A, St. Petersburg 194100, Russia.

出版信息

Entropy (Basel). 2019 Jul 5;21(7):660. doi: 10.3390/e21070660.

DOI:10.3390/e21070660

PMID:33267374

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7515157/

Abstract

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma-Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma-Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.

摘要

主题建模是一种用于对文本文件进行聚类的常用方法。然而，当前的工具存在许多未解决的问题，例如不稳定性以及缺乏选择模型参数值的标准。在这项工作中，我们提出了一种方法，以部分解决优化模型参数的问题，同时兼顾语义稳定性。我们的方法受到统计物理学概念的启发，并基于夏尔马 - 米塔尔熵。我们在两个模型上测试了我们的方法：概率潜在语义分析（pLSA）和采用吉布斯采样的潜在狄利克雷分配（LDA），以及在两个不同语言的数据集上进行测试。我们将我们的方法与一些标准指标进行比较，每个标准指标只能考虑我们感兴趣的一个参数。我们证明，夏尔马 - 米塔尔熵是一种方便的工具，可用于选择主题数量和超参数值，同时控制语义稳定性，而现有的指标都无法做到这一点。此外，我们表明统计物理学的概念可用于为机器学习的理论构建做出贡献，机器学习是一个快速发展的领域，目前缺乏一致的理论基础。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用夏尔马-米塔尔熵估计主题建模性能。

Estimating Topic Modeling Performance with Sharma-Mittal Entropy.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

用夏尔马-米塔尔熵估计主题建模性能。

Estimating Topic Modeling Performance with Sharma-Mittal Entropy.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献