IEEE Trans Pattern Anal Mach Intell. 2018 Mar;40(3):726-739. doi: 10.1109/TPAMI.2017.2682085. Epub 2017 Mar 15.
Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on variational approximation or Monte Carlo sampling, which often suffers from the local minimum defect. Spectral methods have been applied to learn unsupervised topic models, such as latent Dirichlet allocation (LDA), with provable guarantees. This paper investigates the possibility of applying spectral methods to recover the parameters of supervised LDA (sLDA). We first present a two-stage spectral method, which recovers the parameters of LDA followed by a power update method to recover the regression model parameters. Then, we further present a single-phase spectral algorithm to jointly recover the topic distribution matrix as well as the regression weights. Our spectral algorithms are provably correct and computationally efficient. We prove a sample complexity bound for each algorithm and subsequently derive a sufficient condition for the identifiability of sLDA. Thorough experiments on synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the spectral algorithms. In fact, our results on a large-scale review rating dataset demonstrate that our single-phase spectral algorithm alone gets comparable or even better performance than state-of-the-art methods, while previous work on spectral methods has rarely reported such promising performance.
监督主题模型同时对大型文档集合的潜在主题结构和与每个文档相关联的响应变量进行建模。现有的推理方法基于变分逼近或蒙特卡罗抽样,这往往存在局部最小值的缺陷。谱方法已被应用于学习无监督主题模型,例如潜在狄利克雷分配(LDA),并具有可证明的保证。本文研究了应用谱方法恢复监督 LDA(sLDA)参数的可能性。我们首先提出了一种两阶段谱方法,该方法首先恢复 LDA 的参数,然后使用幂更新方法恢复回归模型参数。然后,我们进一步提出了一种单阶段谱算法来联合恢复主题分布矩阵和回归权重。我们的谱算法是可证明正确的,并且计算效率高。我们为每个算法证明了一个样本复杂度界,然后推导出 sLDA 的可识别性的充分条件。在合成和真实数据集上的彻底实验验证了理论,并证明了谱算法的实际有效性。实际上,我们在大规模评论评分数据集上的结果表明,我们的单阶段谱算法单独获得的性能可与最先进的方法相媲美,甚至更好,而之前关于谱方法的工作很少报告如此有前景的性能。