Kshirsagar Meghana, Meller Artur, Humphreys Ian R, Sledzieski Samuel, Xu Yixi, Dodhia Rahul, Horvitz Eric, Berger Bonnie, Bowman Gregory R, Ferres Juan Lavista, Baker David, Baek Minkyung
AI for Good Research Lab, Microsoft Corporation, Redmond, WA, USA.
Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, St. Louis, MO, USA.
Nat Commun. 2025 Feb 27;16(1):2017. doi: 10.1038/s41467-025-57148-3.
The majority of proteins must form higher-order assemblies to perform their biological functions, yet few machine learning models can accurately and rapidly predict the symmetry of assemblies involving multiple copies of the same protein chain. Here, we address this gap by finetuning several classes of protein foundation models, to predict homo-oligomer symmetry. Our best model named Seq2Symm, which utilizes ESM2, outperforms existing template-based and deep learning methods achieving an average AUC-PR of 0.47, 0.44 and 0.49 across homo-oligomer symmetries on three held-out test sets compared to 0.24, 0.24 and 0.25 with template-based search. Seq2Symm uses a single sequence as input and can predict at the rate of ~80,000 proteins/hour. We apply this method to 5 proteomes and ~3.5 million unlabeled protein sequences, showing its promise to be used in conjunction with downstream computationally intensive all-atom structure generation methods such as RoseTTAFold2 and AlphaFold2-multimer. Code, datasets, model are available at: https://github.com/microsoft/seq2symm .
大多数蛋白质必须形成高阶组装体才能发挥其生物学功能,但很少有机器学习模型能够准确、快速地预测涉及同一蛋白质链多个拷贝的组装体的对称性。在这里,我们通过微调几类蛋白质基础模型来解决这一差距,以预测同型寡聚体的对称性。我们最好的模型名为Seq2Symm,它利用ESM2,在三个保留测试集上,在同型寡聚体对称性方面,其性能优于现有的基于模板和深度学习的方法,平均AUC-PR分别为0.47、0.44和0.49,而基于模板搜索的方法分别为0.24、0.24和0.25。Seq2Symm以单序列作为输入,预测速度约为每小时80000个蛋白质。我们将此方法应用于5个蛋白质组和约350万个未标记的蛋白质序列,表明它有望与下游计算量大的全原子结构生成方法(如RoseTTAFold2和AlphaFold2-multimer)结合使用。代码、数据集、模型可在以下网址获取:https://github.com/microsoft/seq2symm 。