Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom.
British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom.
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae012.
Protein-protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools, based on classic machine learning, have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and discrepancies between algorithms that remain unexplained.
To better understand the underlying inference mechanisms that underpin these models, we designed an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use it to shed light on the impact of network topology and how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specializes in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results between both human and S. cerevisiae data and demonstrate that models using functional genomics are better suited to PPI prediction across species. With rapidly increasing amounts of sequence and functional genomics data, our study provides a principled foundation for future construction, comparison, and application of PPI networks.
The code and data are available on GitHub: https://github.com/Llannelongue/B4PPI.
蛋白质-蛋白质相互作用 (PPIs) 对于理解生物途径及其在发育和疾病中的作用至关重要。基于经典机器学习的计算工具在预测蛋白质相互作用方面取得了成功,但由于缺乏一致和可靠的框架,导致网络模型难以比较,算法之间的差异也无法解释。
为了更好地理解这些模型所依据的基本推理机制,我们设计了一个开源框架进行基准测试,该框架考虑了一系列生物学和统计学陷阱,同时促进了可重复性。我们使用它来阐明网络拓扑结构的影响以及不同算法如何处理高度连接的蛋白质。通过研究基于功能基因组学和基于序列的人类蛋白质相互作用模型,我们展示了它们的互补性,因为前者在孤立蛋白质上表现最佳,而后者则专门处理涉及枢纽的相互作用。我们还表明,算法设计对功能基因组数据的性能影响很小。我们在人类和 S. cerevisiae 数据之间复制了我们的结果,并表明使用功能基因组学的模型更适合跨物种的蛋白质相互作用预测。随着越来越多的序列和功能基因组数据的出现,我们的研究为未来构建、比较和应用蛋白质相互作用网络提供了一个有原则的基础。
代码和数据可在 GitHub 上获得:https://github.com/Llannelongue/B4PPI。