基于简单序列的核函数不能预测蛋白质-蛋白质相互作用。

Simple sequence-based kernels do not predict protein-protein interactions.

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.

出版信息

Bioinformatics. 2010 Oct 15;26(20):2610-4. doi: 10.1093/bioinformatics/btq483. Epub 2010 Aug 27.

DOI:10.1093/bioinformatics/btq483

PMID:20801913

Abstract

MOTIVATION

A number of methods have been reported that predict protein-protein interactions (PPIs) with high accuracy using only simple sequence-based features such as amino acid 3mer content. This is surprising, given that many protein interactions have high specificity that depends on detailed atomic recognition between physiochemically complementary surfaces. Are the reported high accuracies realistic?

RESULTS

We find that the reported accuracies of the predictions are significantly over-estimated, and strongly dependent on the structure of the training and testing datasets used. The choice of which protein pairs are deemed as non-interactions in the training data has a variable impact on the accuracy estimates, and the accuracies can be artificially inflated by a bias towards dominant samples in the positive data which result from the presence of hub proteins in the protein interaction network. To address this bias, we propose a positive set-specific method to create a 'balanced' negative set maintaining the degree distribution for each protein, leading to the conclusion that simple sequence-based features contain insufficient information to be useful for predicting PPIs, but that protein domain-based features have some predictive value.

AVAILABILITY

Our method, named 'BRS-nonint', is available at http://www.bioinformatics.leeds.ac.uk/BRS-nonint/. All the datasets used in this study are derived from publicly available data, and are available at http://www.bioinformatics.leeds.ac.uk/BRS-nonint/PPI_RandomBalance.html

CONTACT

maozuguo@hit.edu.cn; d.r.westhead@leeds.ac.uk.

摘要

动机

已经有许多方法被报道，可以仅使用基于简单序列的特征（如氨基酸 3 -mer 含量），以高精度预测蛋白质-蛋白质相互作用（PPIs）。这令人惊讶，因为许多蛋白质相互作用具有高度特异性，这取决于物理化学互补表面之间的详细原子识别。报道的高精度是否现实？

结果

我们发现，预测的报告精度被严重高估，并且强烈依赖于所使用的训练和测试数据集的结构。在训练数据中，哪些蛋白质对被认为是非相互作用的选择对精度估计有可变的影响，并且通过正数据中优势样本的偏差，即蛋白质相互作用网络中存在中心蛋白质，精度可以人为地膨胀。为了解决这个偏差，我们提出了一种针对正集的方法来创建一个“平衡”的负集，同时保持每个蛋白质的度分布，得出的结论是，基于简单序列的特征包含的信息不足以用于预测 PPIs，但基于蛋白质结构域的特征具有一定的预测价值。

可用性

我们的方法名为“BRS-nonint”，可在 http://www.bioinformatics.leeds.ac.uk/BRS-nonint/ 上获得。本研究中使用的所有数据集均源自公开可用的数据，并可在 http://www.bioinformatics.leeds.ac.uk/BRS-nonint/PPI_RandomBalance.html 上获得。

联系信息

maozuguo@hit.edu.cn; d.r.westhead@leeds.ac.uk。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于简单序列的核函数不能预测蛋白质-蛋白质相互作用。

Simple sequence-based kernels do not predict protein-protein interactions.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系信息

相似文献

引用本文的文献

基于简单序列的核函数不能预测蛋白质-蛋白质相互作用。

Simple sequence-based kernels do not predict protein-protein interactions.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系信息

相似文献

引用本文的文献