Suppr超能文献

负采样策略会影响利用机器学习对无标度生物分子网络相互作用的预测。

Negative sampling strategies impact the prediction of scale-free biomolecular network interactions with machine learning.

作者信息

Li Pengpai, Shao Bowen, Zhao Guoqing, Liu Zhi-Ping

机构信息

Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China.

National Center for Applied Mathematics, Shandong University, Jinan, 250100, Shandong, China.

出版信息

BMC Biol. 2025 May 9;23(1):123. doi: 10.1186/s12915-025-02231-w.

Abstract

BACKGROUND

Understanding protein-molecular interaction is crucial for unraveling the mechanisms underlying diverse biological processes. Machine learning (ML) techniques have been extensively employed in predicting these interactions and have garnered substantial research focus. Previous studies have predominantly centered on improving model performance through novel and efficient ML approaches, often resulting in overoptimistic predictive estimates. However, these advancements frequently neglect the inherent biases stemming from network properties, particularly in biological contexts.

RESULTS

In this study, we examined the biases inherent in ML models during the learning and prediction of protein-molecular interactions, particularly those arising from the scale-free property of biological networks-a characteristic where in a few nodes have many connections while most have very few. Our comprehensive analysis across diverse tasks, datasets, and ML methods provides compelling evidence of these biases. We discovered that the training and evaluation of ML models are profoundly influenced by network topology, potentially distorting model performance assessments. To mitigate this issue, we propose the degree distribution balanced (DDB) sampling strategy, a straightforward yet potent approach that alleviates biases stemming from network properties. This method further underscores the limitations of certain ML models in learning protein-molecular interactions solely from intrinsic molecular features.

CONCLUSIONS

Our findings present a novel perspective for assessing the performance of ML models in inferring protein-molecular interactions with greater fairness. By addressing biases introduced by network properties, the DDB sampling approach provides a more balanced and precise assessment of model capabilities. These insights hold the potential to bolster the reliability of ML models in bioinformatics, fostering a more stringent evaluation framework for predicting protein-molecular interactions.

摘要

背景

理解蛋白质-分子相互作用对于揭示各种生物过程背后的机制至关重要。机器学习(ML)技术已被广泛用于预测这些相互作用,并获得了大量的研究关注。先前的研究主要集中在通过新颖且高效的ML方法来提高模型性能,这往往导致预测估计过于乐观。然而,这些进展常常忽略了源于网络属性的内在偏差,特别是在生物学背景下。

结果

在本研究中,我们研究了ML模型在学习和预测蛋白质-分子相互作用过程中固有的偏差,特别是那些源于生物网络无标度特性的偏差——即少数节点有许多连接而大多数节点连接很少的特征。我们对各种任务、数据集和ML方法进行的全面分析为这些偏差提供了有力证据。我们发现ML模型的训练和评估受到网络拓扑的深刻影响,这可能会扭曲模型性能评估。为了缓解这个问题,我们提出了度分布平衡(DDB)采样策略,这是一种简单而有效的方法,可以减轻源于网络属性的偏差。该方法进一步强调了某些ML模型仅从内在分子特征学习蛋白质-分子相互作用的局限性。

结论

我们的研究结果为更公平地评估ML模型在推断蛋白质-分子相互作用方面的性能提供了一个新的视角。通过解决网络属性引入的偏差,DDB采样方法对模型能力提供了更平衡和精确的评估。这些见解有可能提高ML模型在生物信息学中的可靠性,促进一个更严格的预测蛋白质-分子相互作用的评估框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5fe/12065207/0ece91ec435b/12915_2025_2231_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验