Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan, Republic of China.
BMC Genomics. 2009 Dec 3;10 Suppl 3(Suppl 3):S22. doi: 10.1186/1471-2164-10-S3-S22.
Proteins are dynamic macromolecules which may undergo conformational transitions upon changes in environment. As it has been observed in laboratories that protein flexibility is correlated to essential biological functions, scientists have been designing various types of predictors for identifying structurally flexible regions in proteins. In this respect, there are two major categories of predictors. One category of predictors attempts to identify conformationally flexible regions through analysis of protein tertiary structures. Another category of predictors works completely based on analysis of the polypeptide sequences. As the availability of protein tertiary structures is generally limited, the design of predictors that work completely based on sequence information is crucial for advances of molecular biology research.
In this article, we propose a novel approach to design a sequence-based predictor for identifying conformationally ambivalent regions in proteins. The novelty in the design stems from incorporating two classifiers based on two distinctive supervised learning algorithms that provide complementary prediction powers. Experimental results show that the overall performance delivered by the hybrid predictor proposed in this article is superior to the performance delivered by the existing predictors. Furthermore, the case study presented in this article demonstrates that the proposed hybrid predictor is capable of providing the biologists with valuable clues about the functional sites in a protein chain. The proposed hybrid predictor provides the users with two optional modes, namely, the high-sensitivity mode and the high-specificity mode. The experimental results with an independent testing data set show that the proposed hybrid predictor is capable of delivering sensitivity of 0.710 and specificity of 0.608 under the high-sensitivity mode, while delivering sensitivity of 0.451 and specificity of 0.787 under the high-specificity mode.
Though experimental results show that the hybrid approach designed to exploit the complementary prediction powers of distinctive supervised learning algorithms works more effectively than conventional approaches, there exists a large room for further improvement with respect to the achieved performance. In this respect, it is of interest to investigate the effects of exploiting additional physiochemical properties that are related to conformational ambivalence. Furthermore, it is of interest to investigate the effects of incorporating lately-developed machine learning approaches, e.g. the random forest design and the multi-stage design. As conformational transition plays a key role in carrying out several essential types of biological functions, the design of more advanced predictors for identifying conformationally ambivalent regions in proteins deserves our continuous attention.
蛋白质是动态的大分子,其构象在环境变化时可能会发生转变。由于在实验室中观察到蛋白质的柔韧性与基本的生物功能相关,科学家们一直在设计各种类型的预测器,以识别蛋白质中的结构柔性区域。在这方面,有两种主要的预测器类别。一类预测器试图通过分析蛋白质的三级结构来识别构象柔性区域。另一类预测器则完全基于多肽序列的分析。由于蛋白质三级结构的可用性通常有限,因此设计完全基于序列信息的预测器对于推进分子生物学研究至关重要。
在本文中,我们提出了一种新的方法,用于设计一种基于序列的预测器,以识别蛋白质中的构象不确定区域。该设计的新颖之处在于,它结合了两种基于两种不同监督学习算法的分类器,提供了互补的预测能力。实验结果表明,本文提出的混合预测器的整体性能优于现有的预测器。此外,本文的案例研究表明,所提出的混合预测器能够为生物学家提供有关蛋白质链中功能位点的有价值线索。所提出的混合预测器为用户提供了两种可选模式,即高灵敏度模式和高特异性模式。使用独立测试数据集的实验结果表明,在高灵敏度模式下,该混合预测器的灵敏度为 0.710,特异性为 0.608,而在高特异性模式下,灵敏度为 0.451,特异性为 0.787。
尽管实验结果表明,利用不同监督学习算法的互补预测能力的混合方法比传统方法更有效,但在已取得的性能方面仍有很大的改进空间。在这方面,研究利用与构象不确定性相关的其他物理化学性质的效果很有意义。此外,研究利用最近开发的机器学习方法,例如随机森林设计和多阶段设计的效果也很有意义。由于构象转变在执行几种基本类型的生物功能中起着关键作用,因此设计更先进的预测器来识别蛋白质中的构象不确定区域值得我们持续关注。