Pitarch Borja, Pazos Florencio
Computational Systems Biology Group, National Center for Biotechnology (CNB-CSIC), 28049 Madrid, Spain.
Molecules. 2025 Jan 7;30(2):214. doi: 10.3390/molecules30020214.
Knowing which residues of a protein are important for its function is of paramount importance for understanding the molecular basis of this function and devising ways of modifying it for medical or biotechnological applications. Due to the difficulty in detecting these residues experimentally, prediction methods are essential to cope with the sequence deluge that is filling databases with uncharacterized protein sequences. Deep learning approaches are especially well suited for this task due to the large amounts of protein sequences for training them, the trivial codification of this sequence data to feed into these systems, and the intrinsic sequential nature of the data that makes them suitable for language models. As a consequence, deep learning-based approaches are being applied to the prediction of different types of functional sites and regions in proteins. This review aims to give an overview of the current landscape of methodologies so that interested users can have an idea of which kind of approaches are available for their proteins of interest. We also try to give an idea of how these systems work, as well as explain their limitations and high dependence on the training set so that users are aware of the quality of expected results.
了解蛋白质的哪些残基对其功能至关重要,对于理解该功能的分子基础以及设计针对医学或生物技术应用对其进行修饰的方法至关重要。由于通过实验检测这些残基存在困难,预测方法对于应对使数据库中充斥着未表征蛋白质序列的序列洪流至关重要。深度学习方法特别适合这项任务,这是因为有大量蛋白质序列可用于训练它们,将这些序列数据编码输入这些系统很简单,而且数据的内在序列性质使其适用于语言模型。因此,基于深度学习的方法正被应用于预测蛋白质中不同类型的功能位点和区域。本综述旨在概述当前的方法概况,以便感兴趣的用户能够了解针对他们感兴趣的蛋白质有哪些可用方法。我们还试图说明这些系统的工作方式,并解释它们的局限性以及对训练集的高度依赖性,以便用户了解预期结果的质量。