Lin Ken-Li, Lin Chun-Yuan, Huang Chuen-Der, Chang Hsiu-Ming, Yang Chiao-Yun, Lin Chin-Teng, Tang Chuan Yi, Hsu D Frank
Department of Electrical and Control Engineering, National Chiao-Tung University, Hsin-chu, Taiwan and Computer Center of Chung Hua University, Hsin-chu, Taiwan.
IEEE Trans Nanobioscience. 2007 Jun;6(2):186-96. doi: 10.1109/tnb.2007.897482.
The classification of protein structures is essential for their function determination in bioinformatics. At present, a reasonably high rate of prediction accuracy has been achieved in classifying proteins into four classes in the SCOP database according to their primary amino acid sequences. However, for further classification into fine-grained folding categories, especially when the number of possible folding patterns as those defined in the SCOP database is large, it is still quite a challenge. In our previous work, we have proposed a two-level classification strategy called hierarchical learning architecture (HLA) using neural networks and two indirect coding features to differentiate proteins according to their classes and folding patterns, which achieved an accuracy rate of 65.5%. In this paper, we use a combinatorial fusion technique to facilitate feature selection and combination for improving predictive accuracy in protein structure classification. When applying various criteria in combinatorial fusion to the protein fold prediction approach using neural networks with HLA and the radial basis function network (RBFN), the resulting classification has an overall prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories. These rates are significantly higher than the accuracy rate of 56.5% previously obtained by Ding and Dubchak. Our results demonstrate that data fusion is a viable method for feature selection and combination in the prediction and classification of protein structure.
在生物信息学中,蛋白质结构分类对于确定其功能至关重要。目前,根据蛋白质的一级氨基酸序列将其在SCOP数据库中分为四类,已实现了相当高的预测准确率。然而,对于进一步细分为精细的折叠类别,特别是当SCOP数据库中定义的可能折叠模式数量很大时,仍然是一个相当大的挑战。在我们之前的工作中,我们提出了一种称为层次学习架构(HLA)的两级分类策略,使用神经网络和两种间接编码特征根据蛋白质的类别和折叠模式对其进行区分,准确率达到了65.5%。在本文中,我们使用组合融合技术来促进特征选择和组合,以提高蛋白质结构分类的预测准确率。当将组合融合中的各种标准应用于使用具有HLA的神经网络和径向基函数网络(RBFN)的蛋白质折叠预测方法时,对于四类的总体预测准确率为87%,对于27个折叠类别的预测准确率为69.6%。这些准确率显著高于丁和杜布恰克之前获得的56.5%的准确率。我们的结果表明,数据融合是蛋白质结构预测和分类中特征选择和组合的一种可行方法。