仅通过一种新的集成学习方法从序列预测蛋白质结合热点

Protein binding hot spots prediction from sequence only by a new ensemble learning method.

作者信息

Hu Shan-Shan, Chen Peng, Wang Bing, Li Jinyan

机构信息

School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China.

Institute of Health Sciences, Anhui University, Hefei, 230601, Anhui, China.

出版信息

Amino Acids. 2017 Oct;49(10):1773-1785. doi: 10.1007/s00726-017-2474-6. Epub 2017 Aug 1.

DOI:10.1007/s00726-017-2474-6

PMID:28766075

Abstract

UNLABELLED

Hot spots are interfacial core areas of binding proteins, which have been applied as targets in drug design. Experimental methods are costly in both time and expense to locate hot spot areas. Recently, in-silicon computational methods have been widely used for hot spot prediction through sequence or structure characterization. As the structural information of proteins is not always solved, and thus hot spot identification from amino acid sequences only is more useful for real-life applications. This work proposes a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction. The model consists of 83 classifiers involving the IBk (Instance-based k means) algorithm, where instances are encoded by important properties extracted from a total of 544 properties in the AAindex1 (Amino Acid Index) database. Then top-performance classifiers are selected to form an ensemble by a majority voting technique. The ensemble classifier outperforms the state-of-the-art computational methods, yielding an F1 score of 0.80 on the benchmark binding interface database (BID) test set.

AVAILABILITY

http://www2.ahu.edu.cn/pchen/web/HotspotEC.htm .

摘要

未标注

热点是结合蛋白的界面核心区域，已被用作药物设计的靶点。实验方法在定位热点区域方面在时间和费用上都很昂贵。最近，基于硅的计算方法已被广泛用于通过序列或结构表征进行热点预测。由于蛋白质的结构信息并非总能得到解析，因此仅从氨基酸序列识别热点在实际应用中更有用。这项工作提出了一种新的基于序列的模型，该模型将物理化学特征与氨基酸序列的相对可及表面积相结合用于热点预测。该模型由83个涉及IBk（基于实例的k均值）算法的分类器组成，其中实例由从AAindex1（氨基酸索引）数据库中的总共544个属性中提取的重要属性进行编码。然后通过多数投票技术选择性能最佳的分类器以形成一个集成。该集成分类器优于当前最先进的计算方法，在基准结合界面数据库（BID）测试集上的F1分数为0.80。