Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany.
PLoS One. 2013 May 17;8(5):e62732. doi: 10.1371/journal.pone.0062732. Print 2013.
Src homology 2 (SH2) domains are the largest family of the peptide-recognition modules (PRMs) that bind to phosphotyrosine containing peptides. Knowledge about binding partners of SH2-domains is key for a deeper understanding of different cellular processes. Given the high binding specificity of SH2, in-silico ligand peptide prediction is of great interest. Currently however, only a few approaches have been published for the prediction of SH2-peptide interactions. Their main shortcomings range from limited coverage, to restrictive modeling assumptions (they are mainly based on position specific scoring matrices and do not take into consideration complex amino acids inter-dependencies) and high computational complexity. We propose a simple yet effective machine learning approach for a large set of known human SH2 domains. We used comprehensive data from micro-array and peptide-array experiments on 51 human SH2 domains. In order to deal with the high data imbalance problem and the high signal-to-noise ration, we casted the problem in a semi-supervised setting. We report competitive predictive performance w.r.t. state-of-the-art. Specifically we obtain 0.83 AUC ROC and 0.93 AUC PR in comparison to 0.71 AUC ROC and 0.87 AUC PR previously achieved by the position specific scoring matrices (PSSMs) based SMALI approach. Our work provides three main contributions. First, we showed that better models can be obtained when the information on the non-interacting peptides (negative examples) is also used. Second, we improve performance when considering high order correlations between the ligand positions employing regularization techniques to effectively avoid overfitting issues. Third, we developed an approach to tackle the data imbalance problem using a semi-supervised strategy. Finally, we performed a genome-wide prediction of human SH2-peptide binding, uncovering several findings of biological relevance. We make our models and genome-wide predictions, for all the 51 SH2-domains, freely available to the scientific community under the following URLs: http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz and http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions.tar.gz, respectively.
Src 同源 2(SH2)结构域是结合含有磷酸酪氨酸肽的肽识别模块(PRMs)的最大家族。了解 SH2 结构域的结合伙伴对于深入了解不同的细胞过程至关重要。鉴于 SH2 的高结合特异性,计算机配体肽预测具有重要意义。然而,目前只有少数几种方法被用于预测 SH2-肽相互作用。它们的主要缺点包括覆盖范围有限、建模假设受限(主要基于位置特异性评分矩阵,不考虑复杂氨基酸的相互依赖关系)以及计算复杂度高。我们为一大组已知的人类 SH2 结构域提出了一种简单而有效的机器学习方法。我们使用了 51 个人类 SH2 结构域的微阵列和肽阵列实验的综合数据。为了处理高数据不平衡问题和高信噪比问题,我们将问题转化为半监督设置。与基于位置特异性评分矩阵(PSSMs)的 SMALI 方法之前实现的 0.71 AUC ROC 和 0.87 AUC PR 相比,我们报告了具有竞争力的预测性能。具体来说,我们分别获得了 0.83 AUC ROC 和 0.93 AUC PR。我们的工作有三个主要贡献。首先,我们表明当也使用非相互作用肽(阴性对照)的信息时,可以获得更好的模型。其次,我们通过使用正则化技术考虑配体位置之间的高阶相关性,提高了性能,从而有效地避免了过拟合问题。第三,我们开发了一种使用半监督策略解决数据不平衡问题的方法。最后,我们对人类 SH2-肽结合进行了全基因组预测,揭示了一些具有生物学相关性的发现。我们将所有 51 个 SH2 结构域的模型和全基因组预测免费提供给科学界,网址分别为:http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz 和 http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions.tar.gz。