Jiang Rui, Yang Hua, Sun Fengzhu, Chen Ting
Molecular and Computational Biology, University of Southern California, MCB201, 1050 Childs way, Los Angeles, CA 90089-2910, USA.
BMC Bioinformatics. 2006 Sep 19;7:417. doi: 10.1186/1471-2105-7-417.
Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge.
To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest.
The prediction methods using the proposed feature set can achieve larger AUC (the area under the ROC curve), smaller BER (the balanced error rate), and larger MCC (the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method (PRIM), revealing that the strategy is more effective in inducing interpretable rules.
了解氨基酸替换如何影响蛋白质功能对于蛋白质研究及其在疾病中的意义至关重要。尽管已经开发出利用蛋白质的序列、三维结构和进化特性来预测氨基酸替换潜在影响的方法,但这些方法的应用受到特征复杂性和蛋白质结构信息可用性的限制。另一个局限性是预测结果难以用物理化学原理和生物学知识进行解释。
为克服这些局限性,我们提出了一种使用氨基酸物理化学性质、蛋白质进化谱和蛋白质序列信息的新型特征集。我们将支持向量机和随机森林与该特征集应用于大肠杆菌乳糖阻遏物和噬菌体T4溶菌酶中发生的实验性氨基酸替换,以及广泛人类蛋白质中注释的氨基酸替换。结果表明,所提出的特征集优于现有特征集。为探索氨基酸替换背后的物理化学原理,我们设计了一种模拟退火凸点搜索策略,以自动提取氨基酸替换的可解释规则。我们将该策略应用于注释的人类氨基酸替换,并成功提取了几条与当前生物学知识一致或为理解氨基酸替换提供新见解的规则。当应用于未分类数据时,这些规则可以覆盖很大一部分样本,并且大多数被覆盖的样本与支持向量机或随机森林做出的预测显示出良好的一致性。
使用所提出特征集的预测方法比使用已发表特征集的方法能够实现更大的AUC(ROC曲线下面积)、更小的BER(平衡错误率)和更大的MCC(马修斯相关系数),这表明我们的特征集优于现有特征集。通过模拟退火凸点搜索策略提取的规则具有与患者规则归纳方法(PRIM)提取的规则相当的覆盖率和准确性,但具有更好的可解释性,这表明该策略在归纳可解释规则方面更有效。