Batmanghelich Kayhan, Saeedi Ardavan, Narasimhan Karthik, Gershman Sam
CSAIL, MIT.
Harvard University.
Proc Conf Assoc Comput Linguist Meet. 2016 Aug;2016:537-542. doi: 10.18653/v1/P16-2087.
Traditional topic models do not account for semantic regularities in language. Recent distributional representations of words exhibit semantic consistency over directional metrics such as cosine similarity. However, neither categorical nor Gaussian observational distributions used in existing topic models are appropriate to leverage such correlations. In this paper, we propose to use the von Mises-Fisher distribution to model the density of words over a unit sphere. Such a representation is well-suited for directional data. We use a Hierarchical Dirichlet Process for our base topic model and propose an efficient inference algorithm based on Stochastic Variational Inference. This model enables us to naturally exploit the semantic structures of word embeddings while flexibly discovering the number of topics. Experiments demonstrate that our method outperforms competitive approaches in terms of topic coherence on two different text corpora while offering efficient inference.
传统的主题模型没有考虑语言中的语义规律。最近的词分布表示在诸如余弦相似度等方向度量上表现出语义一致性。然而,现有主题模型中使用的分类观测分布和高斯观测分布都不适用于利用这种相关性。在本文中,我们建议使用冯·米塞斯-费舍尔分布来对单位球面上词的密度进行建模。这样的表示非常适合方向数据。我们将分层狄利克雷过程用于我们的基础主题模型,并基于随机变分推断提出了一种高效的推理算法。该模型使我们能够自然地利用词嵌入的语义结构,同时灵活地发现主题数量。实验表明,我们的方法在两个不同的文本语料库上的主题连贯性方面优于竞争方法,同时提供了高效的推理。