Griffiths Thomas L, Steyvers Mark
Department of Psychology, Stanford University, Stanford, CA 94305, USA.
Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5228-35. doi: 10.1073/pnas.0307752101. Epub 2004 Feb 10.
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.
识别文档内容的第一步是确定该文档涉及哪些主题。我们描述了一种由Blei、Ng和Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993 - 1022] 提出的文档生成模型,其中每个文档通过选择主题上的分布,然后根据此分布从所选主题中选择文档中的每个单词来生成。然后,我们提出一种马尔可夫链蒙特卡罗算法用于此模型的推理。我们使用该算法通过贝叶斯模型选择来确定主题数量,从而分析美国国家科学院院刊(PNAS)的摘要。我们表明,提取的主题捕捉到了数据中有意义的结构,与文章作者提供的类别指定一致,并概述了此分析的进一步应用,包括通过检查时间动态来识别“热门话题”以及为摘要添加标签以说明语义内容。