Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15232, USA.
Molecules. 2021 Dec 22;27(1):41. doi: 10.3390/molecules27010041.
Protein-protein interactions (PPIs) perform various functions and regulate processes throughout cells. Knowledge of the full network of PPIs is vital to biomedical research, but most of the PPIs are still unknown. As it is infeasible to discover all of them experimentally due to technical and resource limitations, computational prediction of PPIs is essential and accurately assessing the performance of algorithms is required before further application or translation. However, many published methods compose their evaluation datasets incorrectly, using a higher proportion of positive class data than occuring naturally, leading to exaggerated performance. We re-implemented various published algorithms and evaluated them on datasets with realistic data compositions and found that their performance is overstated in original publications; with several methods outperformed by our control models built on 'illogical' and random number features. We conclude that these methods are influenced by an over-characterization of some proteins in the literature and due to scale-free nature of PPI network and that they fail when tested on all possible protein pairs. Additionally, we found that sequence-only-based algorithms performed worse than those that employ functional and expression features. We present a benchmark evaluation of many published algorithms for PPI prediction. The source code of our implementations and the benchmark datasets created here are made available in open source.
蛋白质-蛋白质相互作用(PPIs)在细胞中执行各种功能并调节各种过程。了解完整的 PPI 网络对于生物医学研究至关重要,但大多数 PPI 仍然未知。由于技术和资源的限制,实验上发现所有 PPI 是不切实际的,因此计算预测 PPI 是必要的,并且在进一步应用或翻译之前需要准确评估算法的性能。然而,许多已发表的方法在构建评估数据集时不正确,使用的正类数据比例高于自然发生的比例,从而导致性能被夸大。我们重新实现了各种已发表的算法,并在具有真实数据组成的数据集上对其进行了评估,发现它们在原始出版物中的性能被夸大了;其中一些方法的性能甚至不如我们基于“不合理”和随机数特征构建的对照模型。我们得出的结论是,这些方法受到文献中某些蛋白质过度特征化的影响,以及由于 PPI 网络的无标度性质,当对所有可能的蛋白质对进行测试时,它们会失败。此外,我们发现仅基于序列的算法的性能不如那些利用功能和表达特征的算法。我们对许多用于 PPI 预测的已发表算法进行了基准评估。我们的实现的源代码和这里创建的基准数据集都以开源的形式提供。