Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran.
Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran.
Comput Biol Med. 2024 Sep;179:108815. doi: 10.1016/j.compbiomed.2024.108815. Epub 2024 Jul 11.
Predicting protein structure is both fascinating and formidable, playing a crucial role in structure-based drug discovery and unraveling diseases with elusive origins. The Critical Assessment of Protein Structure Prediction (CASP) serves as a biannual battleground where global scientists converge to untangle the intricate relationships within amino acid chains. Two primary methods, Template-Based Modeling (TBM) and Template-Free (TF) strategies, dominate protein structure prediction. The trend has shifted towards Template-Free predictions due to their broader sequence coverage with fewer templates. The predictive process can be broadly classified into contact map, binned-distance, and real-valued distance predictions, each with distinctive strengths and limitations manifested through tailored loss functions. We have also introduced revolutionary end-to-end, and all-atom diffusion-based techniques that have transformed protein structure predictions. Recent advancements in deep learning techniques have significantly improved prediction accuracy, although the effectiveness is contingent upon the quality of input features derived from natural bio-physiochemical attributes and Multiple Sequence Alignments (MSA). Hence, the generation of high-quality MSA data holds paramount importance in harnessing informative input features for enhanced prediction outcomes. Remarkable successes have been achieved in protein structure prediction accuracy, however not enough for what structural knowledge was intended to, which implies need for development in some other aspects of the predictions. In this regard, scientists have opened other frontiers for protein structural prediction. The utilization of subsampling in multiple sequence alignment (MSA) and protein language modeling appears to be particularly promising in enhancing the accuracy and efficiency of predictions, ultimately aiding in drug discovery efforts. The exploration of predicting protein complex structure also opens up exciting opportunities to deepen our knowledge of molecular interactions and design therapeutics that are more effective. In this article, we have discussed the vicissitudes that the scientists have gone through to improve prediction accuracy, and examined the effective policies in predicting from different aspects, including the construction of high quality MSA, providing informative input features, and progresses in deep learning approaches. We have also briefly touched upon transitioning from predicting single-chain protein structures to predicting protein complex structures. Our findings point towards promoting open research environments to support the objectives of protein structure prediction.
预测蛋白质结构既令人着迷又极具挑战性,它在基于结构的药物发现和揭示起源不明的疾病方面发挥着关键作用。蛋白质结构预测的关键评估 (Critical Assessment of Protein Structure Prediction, CASP) 是一个两年一度的战场,全球科学家在此汇聚一堂,试图解开氨基酸链内部错综复杂的关系。两种主要的方法,基于模板的建模 (Template-Based Modeling, TBM) 和无模板 (Template-Free, TF) 策略,主导着蛋白质结构预测。由于 TF 策略具有更广泛的序列覆盖范围和更少的模板,因此其趋势已经转向 TF 预测。预测过程可以大致分为接触图、分箱距离和实值距离预测,每种方法都有其独特的优势和局限性,通过定制的损失函数来体现。我们还引入了革命性的端到端和全原子扩散技术,这些技术彻底改变了蛋白质结构预测。深度学习技术的最新进展显著提高了预测准确性,尽管其有效性取决于从自然生物物理属性和多重序列比对 (Multiple Sequence Alignments, MSA) 中得出的输入特征的质量。因此,生成高质量的 MSA 数据对于利用信息丰富的输入特征以获得更好的预测结果至关重要。在蛋白质结构预测准确性方面已经取得了显著的成功,但还不足以满足结构知识的预期,这意味着需要在预测的其他方面进行发展。在这方面,科学家们已经为蛋白质结构预测开辟了其他前沿领域。在多重序列比对 (MSA) 和蛋白质语言建模中使用子采样似乎特别有希望提高预测的准确性和效率,最终有助于药物发现工作。预测蛋白质复合物结构的探索也为加深我们对分子相互作用的理解和设计更有效的治疗方法开辟了令人兴奋的机会。在本文中,我们讨论了科学家们为提高预测准确性所经历的曲折,并从不同方面检查了有效的预测策略,包括构建高质量的 MSA、提供信息丰富的输入特征以及在深度学习方法方面的进展。我们还简要探讨了从预测单链蛋白质结构向预测蛋白质复合物结构的转变。我们的研究结果表明,需要促进开放的研究环境,以支持蛋白质结构预测的目标。