概率多类多核学习:用于蛋白质折叠识别和远程同源性检测
Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection.
作者信息
Damoulas Theodoros, Girolami Mark A
机构信息
Department of Computing Science, University of Glasgow, S. A. W. Building, G12 8QQ, UK.
出版信息
Bioinformatics. 2008 May 15;24(10):1264-70. doi: 10.1093/bioinformatics/btn112. Epub 2008 Mar 31.
MOTIVATION
The problems of protein fold recognition and remote homology detection have recently attracted a great deal of interest as they represent challenging multi-feature multi-class problems for which modern pattern recognition methods achieve only modest levels of performance. As with many pattern recognition problems, there are multiple feature spaces or groups of attributes available, such as global characteristics like the amino-acid composition (C), predicted secondary structure (S), hydrophobicity (H), van der Waals volume (V), polarity (P), polarizability (Z), as well as attributes derived from local sequence alignment such as the Smith-Waterman scores. This raises the need for a classification method that is able to assess the contribution of these potentially heterogeneous object descriptors while utilizing such information to improve predictive performance. To that end, we offer a single multi-class kernel machine that informatively combines the available feature groups and, as is demonstrated in this article, is able to provide the state-of-the-art in performance accuracy on the fold recognition problem. Furthermore, the proposed approach provides some insight by assessing the significance of recently introduced protein features and string kernels. The proposed method is well-founded within a Bayesian hierarchical framework and a variational Bayes approximation is derived which allows for efficient CPU processing times.
RESULTS
The best performance which we report on the SCOP PDB-40D benchmark data-set is a 70% accuracy by combining all the available feature groups from global protein characteristics but also including sequence-alignment features. We offer an 8% improvement on the best reported performance that combines multi-class k-nn classifiers while at the same time reducing computational costs and assessing the predictive power of the various available features. Furthermore, we examine the performance of our methodology on the SCOP 1.53 benchmark data-set that simulates remote homology detection and examine the combination of various state-of-the-art string kernels that have recently been proposed.
动机
蛋白质折叠识别和远程同源性检测问题近来引起了广泛关注,因为它们代表了具有挑战性的多特征多类别问题,现代模式识别方法在这些问题上的表现仅处于中等水平。与许多模式识别问题一样,存在多个特征空间或属性组,例如氨基酸组成(C)、预测二级结构(S)、疏水性(H)、范德华体积(V)、极性(P)、极化率(Z)等全局特征,以及源自局部序列比对的属性,如史密斯-沃特曼得分。这就需要一种分类方法,能够评估这些潜在异质的对象描述符的贡献,同时利用这些信息来提高预测性能。为此,我们提供了一种单一的多类别核机器,它能有效地组合可用的特征组,并且如本文所示,能够在折叠识别问题上提供最先进的性能准确性。此外,所提出的方法通过评估最近引入的蛋白质特征和字符串核的重要性提供了一些见解。所提出的方法在贝叶斯层次框架内有充分的依据,并推导了变分贝叶斯近似,这使得能够实现高效的CPU处理时间。
结果
我们在SCOP PDB - 40D基准数据集上报告的最佳性能是通过结合来自全局蛋白质特征的所有可用特征组(还包括序列比对特征)达到了70%的准确率。我们比结合多类别k近邻分类器报告的最佳性能提高了8%,同时降低了计算成本,并评估了各种可用特征的预测能力。此外,我们在模拟远程同源性检测的SCOP 1.53基准数据集上检验了我们方法的性能,并研究了最近提出的各种最先进的字符串核的组合。