Block Peter, Paern Juri, Hüllermeier Eyke, Sanschagrin Paul, Sotriffer Christoph A, Klebe Gerhard
Department of Pharmaceutical Chemistry, Philipps-University Marburg, Marburg, Germany.
Proteins. 2006 Nov 15;65(3):607-22. doi: 10.1002/prot.21104.
Analyzing protein-protein interactions at the atomic level is critical for our understanding of the principles governing the interactions involved in protein-protein recognition. For this purpose, descriptors explaining the nature of different protein-protein complexes are desirable. In this work, the authors introduced Epic Protein Interface Classification as a framework handling the preparation, processing, and analysis of protein-protein complexes for classification with machine learning algorithms. We applied four different machine learning algorithms: Support Vector Machines, C4.5 Decision Trees, K Nearest Neighbors, and Naïve Bayes algorithm in combination with three feature selection methods, Filter (Relief F), Wrapper, and Genetic Algorithms, to extract discriminating features from the protein-protein complexes. To compare protein-protein complexes to each other, the authors represented the physicochemical characteristics of their interfaces in four different ways, using two different atomic contact vectors, DrugScore pair potential vectors and SFCscore descriptor vectors. We classified two different datasets: (A) 172 protein-protein complexes comprising 96 monomers, forming contacts enforced by the crystallographic packing environment (crystal contacts), and 76 biologically functional homodimer complexes; (B) 345 protein-protein complexes containing 147 permanent complexes and 198 transient complexes. We were able to classify up to 94.8% of the packing enforced/functional and up to 93.6% of the permanent/transient complexes correctly. Furthermore, we were able to extract relevant features from the different protein-protein complexes and introduce an approach for scoring the importance of the extracted features.
在原子水平上分析蛋白质 - 蛋白质相互作用对于我们理解蛋白质 - 蛋白质识别中相互作用的原理至关重要。为此,需要能够解释不同蛋白质 - 蛋白质复合物性质的描述符。在这项工作中,作者引入了Epic蛋白质界面分类作为一个框架,用于处理蛋白质 - 蛋白质复合物的制备、处理和分析,以便使用机器学习算法进行分类。我们应用了四种不同的机器学习算法:支持向量机、C4.5决策树、K近邻算法和朴素贝叶斯算法,并结合三种特征选择方法,即过滤法(Relief F)、包装法和遗传算法,从蛋白质 - 蛋白质复合物中提取区分特征。为了相互比较蛋白质 - 蛋白质复合物,作者用四种不同方式表示其界面的物理化学特征,使用两种不同的原子接触向量、DrugScore对势向量和SFCscore描述符向量。我们对两个不同的数据集进行了分类:(A)172个蛋白质 - 蛋白质复合物,包括96个单体,形成由晶体堆积环境强制的接触(晶体接触),以及76个具有生物学功能的同二聚体复合物;(B)345个蛋白质 - 蛋白质复合物,包含147个永久复合物和