利用蛋白质语言模型改进异二聚体蛋白复合物预测。

Improved the heterodimer protein complex prediction with protein language models.

机构信息

Department of Computer Science and Technology, Tsinghua University, Beijing, China.

Toyota Technological Institute at Chicago, Chicago, IL 60637, USA.

出版信息

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad221.

DOI:10.1093/bib/bbad221

PMID:37328552

Abstract

AlphaFold-Multimer has greatly improved the protein complex structure prediction, but its accuracy also depends on the quality of the multiple sequence alignment (MSA) formed by the interacting homologs (i.e. interologs) of the complex under prediction. Here we propose a novel method, ESMPair, that can identify interologs of a complex using protein language models. We show that ESMPair can generate better interologs than the default MSA generation method in AlphaFold-Multimer. Our method results in better complex structure prediction than AlphaFold-Multimer by a large margin (+10.7% in terms of the Top-5 best DockQ), especially when the predicted complex structures have low confidence. We further show that by combining several MSA generation methods, we may yield even better complex structure prediction accuracy than Alphafold-Multimer (+22% in terms of the Top-5 best DockQ). By systematically analyzing the impact factors of our algorithm we find that the diversity of MSA of interologs significantly affects the prediction accuracy. Moreover, we show that ESMPair performs particularly well on complexes in eucaryotes.

摘要

AlphaFold-Multimer 极大地提高了蛋白质复合物结构预测的准确性，但它的准确性也取决于预测复合物的相互作用同源物（即互作同源物）形成的多重序列比对（MSA）的质量。在这里，我们提出了一种新的方法 ESMPair，它可以使用蛋白质语言模型来识别复合物的互作同源物。我们表明，ESMPair 可以生成比 AlphaFold-Multimer 中默认的 MSA 生成方法更好的互作同源物。与 AlphaFold-Multimer 相比，我们的方法大大提高了复合物结构预测的准确性（在 Top-5 最佳 DockQ 方面提高了 10.7%），尤其是当预测的复合物结构置信度较低时。我们进一步表明，通过结合几种 MSA 生成方法，我们可能会获得比 Alphafold-Multimer 更高的复合物结构预测准确性（在 Top-5 最佳 DockQ 方面提高了 22%）。通过系统地分析我们算法的影响因素，我们发现互作同源物的 MSA 多样性对预测准确性有显著影响。此外，我们还表明 ESMPair 在真核生物复合物上的表现尤其出色。