Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
从大量进化相关的蛋白质序列集合开始的计算模型可以捕捉蛋白质家族的表示形式,并学习与蛋白质结构和功能相关的约束。因此,它们为生成属于蛋白质家族的新序列提供了可能性。在多序列比对(如 MSA Transformer)上训练的蛋白质语言模型是非常有吸引力的候选者。我们提出并测试了一种迭代方法,该方法直接使用屏蔽语言建模目标来使用 MSA Transformer 生成序列。我们证明,对于同源性、共进化和基于结构的度量,生成的序列与自然序列的得分一样好。对于大型蛋白质家族,与 Potts 模型生成的序列(包括经过实验验证的序列)相比,我们的合成序列具有相似或更好的性质。此外,对于小型蛋白质家族,我们基于 MSA Transformer 的生成方法优于 Potts 模型。与 Potts 模型相比,我们的方法还能更准确地再现自然数据序列空间中的高阶统计和序列分布。因此,MSA Transformer 是蛋白质序列生成和蛋白质设计的有力候选者。