School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.
Bioinformatics. 2010 Oct 15;26(20):2610-4. doi: 10.1093/bioinformatics/btq483. Epub 2010 Aug 27.
A number of methods have been reported that predict protein-protein interactions (PPIs) with high accuracy using only simple sequence-based features such as amino acid 3mer content. This is surprising, given that many protein interactions have high specificity that depends on detailed atomic recognition between physiochemically complementary surfaces. Are the reported high accuracies realistic?
We find that the reported accuracies of the predictions are significantly over-estimated, and strongly dependent on the structure of the training and testing datasets used. The choice of which protein pairs are deemed as non-interactions in the training data has a variable impact on the accuracy estimates, and the accuracies can be artificially inflated by a bias towards dominant samples in the positive data which result from the presence of hub proteins in the protein interaction network. To address this bias, we propose a positive set-specific method to create a 'balanced' negative set maintaining the degree distribution for each protein, leading to the conclusion that simple sequence-based features contain insufficient information to be useful for predicting PPIs, but that protein domain-based features have some predictive value.
Our method, named 'BRS-nonint', is available at http://www.bioinformatics.leeds.ac.uk/BRS-nonint/. All the datasets used in this study are derived from publicly available data, and are available at http://www.bioinformatics.leeds.ac.uk/BRS-nonint/PPI_RandomBalance.html
已经有许多方法被报道,可以仅使用基于简单序列的特征(如氨基酸 3 -mer 含量),以高精度预测蛋白质-蛋白质相互作用(PPIs)。这令人惊讶,因为许多蛋白质相互作用具有高度特异性,这取决于物理化学互补表面之间的详细原子识别。报道的高精度是否现实?
我们发现,预测的报告精度被严重高估,并且强烈依赖于所使用的训练和测试数据集的结构。在训练数据中,哪些蛋白质对被认为是非相互作用的选择对精度估计有可变的影响,并且通过正数据中优势样本的偏差,即蛋白质相互作用网络中存在中心蛋白质,精度可以人为地膨胀。为了解决这个偏差,我们提出了一种针对正集的方法来创建一个“平衡”的负集,同时保持每个蛋白质的度分布,得出的结论是,基于简单序列的特征包含的信息不足以用于预测 PPIs,但基于蛋白质结构域的特征具有一定的预测价值。
我们的方法名为“BRS-nonint”,可在 http://www.bioinformatics.leeds.ac.uk/BRS-nonint/ 上获得。本研究中使用的所有数据集均源自公开可用的数据,并可在 http://www.bioinformatics.leeds.ac.uk/BRS-nonint/PPI_RandomBalance.html 上获得。