Bioinformatics and Modeling, GIGA-Research, Department of Electrical Engineering and Computer Science, Montefiore Institute, University of Liege, Liege, Belgium.
PLoS One. 2013;8(2):e56621. doi: 10.1371/journal.pone.0056621. Epub 2013 Feb 15.
Disulfide bridges strongly constrain the native structure of many proteins and predicting their formation is therefore a key sub-problem of protein structure and function inference. Most recently proposed approaches for this prediction problem adopt the following pipeline: first they enrich the primary sequence with structural annotations, second they apply a binary classifier to each candidate pair of cysteines to predict disulfide bonding probabilities and finally, they use a maximum weight graph matching algorithm to derive the predicted disulfide connectivity pattern of a protein. In this paper, we adopt this three step pipeline and propose an extensive study of the relevance of various structural annotations and feature encodings. In particular, we consider five kinds of structural annotations, among which three are novel in the context of disulfide bridge prediction. So as to be usable by machine learning algorithms, these annotations must be encoded into features. For this purpose, we propose four different feature encodings based on local windows and on different kinds of histograms. The combination of structural annotations with these possible encodings leads to a large number of possible feature functions. In order to identify a minimal subset of relevant feature functions among those, we propose an efficient and interpretable feature function selection scheme, designed so as to avoid any form of overfitting. We apply this scheme on top of three supervised learning algorithms: k-nearest neighbors, support vector machines and extremely randomized trees. Our results indicate that the use of only the PSSM (position-specific scoring matrix) together with the CSP (cysteine separation profile) are sufficient to construct a high performance disulfide pattern predictor and that extremely randomized trees reach a disulfide pattern prediction accuracy of [Formula: see text] on the benchmark dataset SPX[Formula: see text], which corresponds to [Formula: see text] improvement over the state of the art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3CysBridges.
二硫键强烈约束许多蛋白质的天然结构,因此预测其形成是蛋白质结构和功能推断的关键子问题。最近提出的大多数用于解决该预测问题的方法都采用以下流程:首先,他们用结构注释丰富原始序列;其次,他们对每个候选半胱氨酸对应用二进制分类器来预测二硫键结合概率;最后,他们使用最大权重图匹配算法来推导出蛋白质的预测二硫键连接模式。在本文中,我们采用了这三个步骤的流程,并对各种结构注释和特征编码的相关性进行了广泛的研究。特别是,我们考虑了五种结构注释,其中三种在二硫键预测的背景下是新颖的。为了使这些注释能够被机器学习算法使用,它们必须被编码为特征。为此,我们基于局部窗口和不同类型的直方图提出了四种不同的特征编码。结构注释与这些可能的编码相结合会导致大量可能的特征函数。为了在这些特征函数中识别出相关的最小特征函数子集,我们提出了一种高效且可解释的特征函数选择方案,旨在避免任何形式的过拟合。我们将此方案应用于三种监督学习算法:k-近邻、支持向量机和极端随机树。我们的结果表明,仅使用 PSSM(位置特异性评分矩阵)和 CSP(半胱氨酸分离分布)就足以构建高性能的二硫键模式预测器,并且极端随机树在基准数据集 SPX[Formula: see text]上达到了 [Formula: see text]的二硫键模式预测准确率,这比最新技术提高了 [Formula: see text]。一个网络应用程序可在 http://m24.giga.ulg.ac.be:81/x3CysBridges 上获得。