Aso Tatsuya, Eguchi Koji
Department of Computer Science and Systems Engineering, Kobe University, 1-1 Rokkoudai, Nada-ku, Kobe 657-8501, Japan.
Genome Inform. 2009 Oct;23(1):3-12.
This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model. We also apply probabilistic Latent Semantic Analysis (pLSA) as a baseline for comparison, and compare them from the viewpoints of log-likelihood, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the others, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.
本文研究了应用统计主题模型来提取和预测生物实体之间的关系,特别是蛋白质提及之间的关系。统计主题模型——潜在狄利克雷分配(LDA)很有前景;然而,尚未针对此类任务对其进行研究。在本文中,我们应用最先进的塌缩变分贝叶斯推理和吉布斯采样推理来估计LDA模型。我们还应用概率潜在语义分析(pLSA)作为比较的基线,并从对数似然、分类准确率和检索效率的角度对它们进行比较。我们通过实验证明,塌缩变分LDA比其他方法能给出更好的结果,特别是在蛋白质-蛋白质关系预测任务的分类准确率和检索效率方面。