利用多序列比对增强和预训练语言模型提高同源蛋白不足的结构相关预测。

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.

机构信息

School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.

School of Computer Science and Engineering, Central South University, Changsha 410083, China.

出版信息

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad217.

DOI:10.1093/bib/bbad217

PMID:37321965

Abstract

UNLABELLED

In recent years, protein structure problems have become a hotspot for understanding protein folding and function mechanisms. It has been observed that most of the protein structure works rely on and benefit from co-evolutionary information obtained by multiple sequence alignment (MSA). As an example, AlphaFold2 (AF2) is a typical MSA-based protein structure tool which is famous for its high accuracy. As a consequence, these MSA-based methods are limited by the quality of the MSAs. Especially for orphan proteins that have no homologous sequence, AlphaFold2 performs unsatisfactorily as MSA depth decreases, which may pose a barrier to its widespread application in protein mutation and design problems in which there are no rich homologous sequences and rapid prediction is needed. In this paper, we constructed two standard datasets for orphan and de novo proteins which have insufficient/none homology information, called Orphan62 and Design204, respectively, to fairly evaluate the performance of the various methods in this case. Then, depending on whether or not utilizing scarce MSA information, we summarized two approaches, MSA-enhanced and MSA-free methods, to effectively solve the issue without sufficient MSAs. MSA-enhanced model aims to improve poor MSA quality from the data source by knowledge distillation and generation models. MSA-free model directly learns the relationship between residues on enormous protein sequences from pre-trained models, bypassing the step of extracting the residue pair representation from MSA. Next, we evaluated the performance of four MSA-free methods (trRosettaX-Single, TRFold, ESMFold and ProtT5) and MSA-enhanced (Bagging MSA) method compared with a traditional MSA-based method AlphaFold2, in two protein structure-related prediction tasks, respectively. Comparison analyses show that trRosettaX-Single and ESMFold which belong to MSA-free method can achieve fast prediction ($\sim! 40$s) and comparable performance compared with AF2 in tertiary structure prediction, especially for short peptides, $\alpha $-helical segments and targets with few homologous sequences. Bagging MSA utilizing MSA enhancement improves the accuracy of our trained base model which is an MSA-based method when poor homology information exists in secondary structure prediction. Our study provides biologists an insight of how to select rapid and appropriate prediction tools for enzyme engineering and peptide drug development.

CONTACT

guofei@csu.edu.cn, jj.tang@siat.ac.cn.

摘要

未加标签

近年来，蛋白质结构问题已成为理解蛋白质折叠和功能机制的热点。人们已经观察到，大多数蛋白质结构工作都依赖于并受益于通过多重序列比对（MSA）获得的共进化信息。例如，AlphaFold2（AF2）是一种典型的基于 MSA 的蛋白质结构工具，以其高精度而闻名。因此，这些基于 MSA 的方法受到 MSA 质量的限制。特别是对于没有同源序列的孤儿蛋白，当 MSA 深度降低时，AlphaFold2 的表现并不令人满意，这可能会阻碍其在蛋白质突变和设计问题中的广泛应用，在这些问题中，没有丰富的同源序列，需要快速预测。在本文中，我们构建了两个分别用于孤儿和从头蛋白质的标准数据集，这些蛋白质的同源信息不足/没有同源信息，分别称为 Orphan62 和 Design204，以公平评估各种方法在这种情况下的性能。然后，根据是否利用稀缺的 MSA 信息，我们总结了两种方法，MSA 增强和 MSA 自由方法，以有效地解决没有足够 MSA 的问题。MSA 增强模型旨在通过知识蒸馏和生成模型从数据源改善较差的 MSA 质量。MSA 自由模型直接从预先训练的模型学习大量蛋白质序列上残基之间的关系，绕过从 MSA 中提取残基对表示的步骤。接下来，我们评估了四种 MSA 自由方法（trRosettaX-Single、TRFold、ESMFold 和 ProtT5）和 MSA 增强（Bagging MSA）方法的性能，与传统的基于 MSA 的方法 AlphaFold2 相比，在两个与蛋白质结构相关的预测任务中，分别。比较分析表明，属于 MSA 自由方法的 trRosettaX-Single 和 ESMFold 可以实现快速预测（$\sim! 40$s），并且在三级结构预测中与 AF2 具有可比的性能，特别是对于短肽、$\alpha $-螺旋片段和同源序列较少的目标。利用 MSA 增强的 Bagging MSA 提高了我们训练的基于 MSA 的基模型的准确性，当二级结构预测中存在较差的同源信息时，这是一种基于 MSA 的方法。我们的研究为生物学家提供了一个深入了解如何为酶工程和肽药物开发选择快速和合适的预测工具的机会。

联系方式：guofei@csu.edu.cn，jj.tang@siat.ac.cn。