Cui Xuefeng, Lu Zhiwu, Wang Sheng, Jing-Yan Wang Jim, Gao Xin
King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia.
Beijing Key Laboratory of Big Data Management and Analysis Methods, School of Information, Renmin University of China, Beijing 100872, China.
Bioinformatics. 2016 Jun 15;32(12):i332-i340. doi: 10.1093/bioinformatics/btw271.
Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information.
We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration.
We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.
Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx
Supplementary data are available at Bioinformatics online.
蛋白质同源性检测是计算生物学中的一个基本问题,是预测蛋白质结构和理解蛋白质功能不可或缺的一步。尽管近几十年来在序列比对、穿线法和无比对方法方面取得了进展,但蛋白质同源性检测仍然是一个具有挑战性的开放问题。最近,试图在蛋白质结构空间中寻找传递路径的网络方法证明了纳入结构空间网络信息的重要性。然而,当前的方法将序列空间和结构空间合并为一个单一空间,因此在组合不同信息源时引入了不一致性。
我们提出了一种基于跨模态学习的新型基于网络的蛋白质同源性检测方法CMsearch。CMsearch不是探索由序列和结构空间信息混合构建的单个网络,而是构建两个单独的网络来表示序列空间和结构空间。然后,它通过同时考虑序列信息、结构信息、序列空间信息和结构空间信息来学习序列-结构相关性。
我们通过查询所有8332个PDB40蛋白质,在蛋白质同源性检测和蛋白质结构预测这两个具有挑战性的任务上测试了CMsearch。我们的结果表明,CMsearch对用于定义序列和结构空间的相似性度量不敏感。通过使用HMM-HMM比对作为序列相似性度量,CMsearch明显优于现有的同源性检测方法和基于模板的蛋白质结构预测方法(这些方法在蛋白质结构预测竞赛CASP中获胜)。
我们的程序可从http://sfb.kaust.edu.sa/Pages/Software.aspx免费下载。
补充数据可在《生物信息学》在线获取。