在图形模型中结合特征以预测蛋白质结合位点。

Combining features in a graphical model to predict protein binding sites.

作者信息

Wierschin Torsten, Wang Keyu, Welter Marlon, Waack Stephan, Stanke Mario

机构信息

Institute of Mathematics and Computer Science, University of Greifswald, 17487, Greifswald, Germany.

出版信息

Proteins. 2015 May;83(5):844-52. doi: 10.1002/prot.24775. Epub 2015 Mar 14.

DOI:10.1002/prot.24775

PMID:25663045

Abstract

Large efforts have been made in classifying residues as binding sites in proteins using machine learning methods. The prediction task can be translated into the computational challenge of assigning each residue the label binding site or non-binding site. Observational data comes from various possibly highly correlated sources. It includes the structure of the protein but not the structure of the complex. The model class of conditional random fields (CRFs) has previously successfully been used for protein binding site prediction. Here, a new CRF-approach is presented that models the dependencies of residues using a general graphical structure defined as a neighborhood graph and thus our model makes fewer independence assumptions on the labels than sequential labeling approaches. A novel node feature "change in free energy" is introduced into the model, which is then denoted by ΔF-CRF. Parameters are trained with an online large-margin algorithm. Using the standard feature class relative accessible surface area alone, the general graph-structure CRF already achieves higher prediction accuracy than the linear chain CRF of Li et al. ΔF-CRF performs significantly better on a large range of false positive rates than the support-vector-machine-based program PresCont of Zellner et al. on a homodimer set containing 128 chains. ΔF-CRF has a broader scope than PresCont since it is not constrained to protein subgroups and requires no multiple sequence alignment. The improvement is attributed to the advantageous combination of the novel node feature with the standard feature and to the adopted parameter training method.

摘要

人们已经付出了巨大努力，使用机器学习方法将蛋白质中的残基分类为结合位点。预测任务可以转化为给每个残基分配“结合位点”或“非结合位点”标签的计算挑战。观测数据来自各种可能高度相关的来源。它包括蛋白质的结构，但不包括复合物的结构。条件随机场（CRF）模型类别此前已成功用于蛋白质结合位点预测。在此，提出了一种新的CRF方法，该方法使用定义为邻域图的通用图形结构对残基的依赖性进行建模，因此我们的模型在标签上做出的独立性假设比顺序标记方法更少。一种新颖的节点特征“自由能变化”被引入到模型中，该模型随后被称为ΔF-CRF。参数使用在线大间隔算法进行训练。仅使用标准特征类相对可及表面积，通用图结构CRF已经比Li等人的线性链CRF实现了更高的预测准确率。在包含128条链的同二聚体集上，ΔF-CRF在大范围的误报率上比Zellner等人基于支持向量机的程序PresCont表现得显著更好。ΔF-CRF的适用范围比PresCont更广，因为它不受限于蛋白质亚组，并且不需要多序列比对。这种改进归因于新颖节点特征与标准特征的有利组合以及所采用的参数训练方法。