SISSA, Via Bonomea 265, Trieste, Italy.
Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne, Lausanne CH-1015, Switzerland and Swiss Institute of Bioinformatics (SIB), Lausanne CH-1015, Switzerland.
J Chem Phys. 2021 Feb 21;154(7):074114. doi: 10.1063/5.0039240.
Computational protein design has emerged as a powerful tool capable of identifying sequences compatible with pre-defined protein structures. The sequence design protocols, implemented in the Rosetta suite, have become widely used in the protein engineering community. To understand the strengths and limitations of the Rosetta design framework, we tested several design protocols on two distinct folds (SH3-1 and Ubiquitin). The sequence optimization, when started from native structures and natural sequences or polyvaline sequences, converges to sequences that are not recognized as belonging to the fold family of the target protein by standard bioinformatic tools, such as BLAST and Hmmer. The sequences generated from both starting conditions (native and polyvaline) are instead very similar to each other and recognized by Hmmer as belonging to the same "family." This demonstrates the capability of Rosetta to converge to similar sequences, even when sampling from distinct starting conditions, but, on the other hand, shows intrinsic inaccuracy of the scoring function that drifts toward sequences that lack identifiable natural sequence signatures. To address this problem, we developed a protocol embedding Rosetta Design simulations in a genetic algorithm, in which the sequence search is biased to converge to sequences that exist in nature. This protocol allows us to obtain sequences that have recognizable natural sequence signatures and, experimentally, the designed proteins are biochemically well behaved and thermodynamically stable.
计算蛋白质设计已经成为一种强大的工具,能够识别与预定义蛋白质结构兼容的序列。在 Rosetta 套件中实现的序列设计协议已在蛋白质工程界得到广泛应用。为了了解 Rosetta 设计框架的优缺点,我们在两种不同的折叠结构(SH3-1 和泛素)上测试了几种设计协议。当从天然结构和天然序列或多聚缬氨酸序列开始进行序列优化时,优化得到的序列不能被标准生物信息学工具(如 BLAST 和 Hmmer)识别为属于目标蛋白折叠家族的序列。从这两种起始条件(天然和多聚缬氨酸)生成的序列彼此非常相似,并且被 Hmmer 识别为属于相同的“家族”。这表明 Rosetta 能够收敛到相似的序列,即使从不同的起始条件进行采样,但另一方面也表明评分函数存在内在的不准确性,会向缺乏可识别的天然序列特征的序列漂移。为了解决这个问题,我们开发了一种协议,将 Rosetta Design 模拟嵌入遗传算法中,使序列搜索偏向于收敛到自然界中存在的序列。该协议使我们能够获得具有可识别的天然序列特征的序列,并且在实验中,设计的蛋白质具有良好的生物化学性质和热力学稳定性。