Department of Computer Science and Engineering, IIT Madras, Chennai-600 036.
Bioinformatics. 2011 Jul 1;27(13):i61-8. doi: 10.1093/bioinformatics/btr249.
With rapidly expanding protein structure databases, efficiently retrieving structures similar to a given protein is an important problem. It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures and (ii) effective framework to address different retrieval requirements. Recently, researchers proposed vector space model of proteins using bag of fragments representation (FragBag), which corresponds to the basic information retrieval model.
In this article, we propose an improved representation of protein structures using latent dirichlet allocation topic model. Another important requirement is to retrieve proteins, whether they are either close or remote homologs. In order to meet diverse objectives, we propose multi-viewpoint based framework that combines multiple representations and retrieval techniques. We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers. The results indicate that the proposed techniques outperform state-of-the-art methods.
http://www.cse.iitm.ac.in/~ashishvt/research/protein-lda/.
随着蛋白质结构数据库的迅速扩展,有效地检索与给定蛋白质相似的结构是一个重要的问题。它涉及两个主要问题:(i)有效的蛋白质结构表示,该表示捕获片段之间的固有关系,并促进结构之间的有效比较;(ii)解决不同检索要求的有效框架。最近,研究人员提出了使用碎片袋表示法(FragBag)的蛋白质向量空间模型,这对应于基本的信息检索模型。
在本文中,我们提出了一种使用潜在狄利克雷分配主题模型的蛋白质结构的改进表示。另一个重要的要求是检索蛋白质,无论它们是近亲还是远亲同源物。为了满足不同的目标,我们提出了基于多视点的框架,该框架结合了多种表示和检索技术。我们在 Kolodny 等人开发的基准数据集上比较了所提出的表示和检索框架。结果表明,所提出的技术优于最先进的方法。
http://www.cse.iitm.ac.in/~ashishvt/research/protein-lda/。