Suppr超能文献

机器学习模型在蛋白质-蛋白质相互作用网络中的陷阱。

Pitfalls of machine learning models for protein-protein interaction networks.

机构信息

Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom.

British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom.

出版信息

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae012.

Abstract

MOTIVATION

Protein-protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools, based on classic machine learning, have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and discrepancies between algorithms that remain unexplained.

RESULTS

To better understand the underlying inference mechanisms that underpin these models, we designed an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use it to shed light on the impact of network topology and how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specializes in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results between both human and S. cerevisiae data and demonstrate that models using functional genomics are better suited to PPI prediction across species. With rapidly increasing amounts of sequence and functional genomics data, our study provides a principled foundation for future construction, comparison, and application of PPI networks.

AVAILABILITY AND IMPLEMENTATION

The code and data are available on GitHub: https://github.com/Llannelongue/B4PPI.

摘要

动机

蛋白质-蛋白质相互作用 (PPIs) 对于理解生物途径及其在发育和疾病中的作用至关重要。基于经典机器学习的计算工具在预测蛋白质相互作用方面取得了成功,但由于缺乏一致和可靠的框架,导致网络模型难以比较,算法之间的差异也无法解释。

结果

为了更好地理解这些模型所依据的基本推理机制,我们设计了一个开源框架进行基准测试,该框架考虑了一系列生物学和统计学陷阱,同时促进了可重复性。我们使用它来阐明网络拓扑结构的影响以及不同算法如何处理高度连接的蛋白质。通过研究基于功能基因组学和基于序列的人类蛋白质相互作用模型,我们展示了它们的互补性,因为前者在孤立蛋白质上表现最佳,而后者则专门处理涉及枢纽的相互作用。我们还表明,算法设计对功能基因组数据的性能影响很小。我们在人类和 S. cerevisiae 数据之间复制了我们的结果,并表明使用功能基因组学的模型更适合跨物种的蛋白质相互作用预测。随着越来越多的序列和功能基因组数据的出现,我们的研究为未来构建、比较和应用蛋白质相互作用网络提供了一个有原则的基础。

可用性和实现

代码和数据可在 GitHub 上获得:https://github.com/Llannelongue/B4PPI。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11b9/10868344/f5c995af1a03/btae012f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验