Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125;
Protabit, LLC, Pasadena, CA 91106.
Proc Natl Acad Sci U S A. 2019 Aug 13;116(33):16367-16377. doi: 10.1073/pnas.1903888116. Epub 2019 Aug 1.
The accurate prediction of protein stability upon sequence mutation is an important but unsolved challenge in protein engineering. Large mutational datasets are required to train computational predictors, but traditional methods for collecting stability data are either low-throughput or measure protein stability indirectly. Here, we develop an automated method to generate thermodynamic stability data for nearly every single mutant in a small 56-residue protein. Analysis reveals that most single mutants have a neutral effect on stability, mutational sensitivity is largely governed by residue burial, and unexpectedly, hydrophobics are the best tolerated amino acid type. Correlating the output of various stability-prediction algorithms against our data shows that nearly all perform better on boundary and surface positions than for those in the core and are better at predicting large-to-small mutations than small-to-large ones. We show that the most stable variants in the single-mutant landscape are better identified using combinations of 2 prediction algorithms and including more algorithms can provide diminishing returns. In most cases, poor in silico predictions were tied to compositional differences between the data being analyzed and the datasets used to train the algorithm. Finally, we find that strategies to extract stabilities from high-throughput fitness data such as deep mutational scanning are promising and that data produced by these methods may be applicable toward training future stability-prediction tools.
准确预测序列突变后蛋白质的稳定性是蛋白质工程中的一个重要但尚未解决的挑战。需要大型突变数据集来训练计算预测器,但传统的稳定性数据收集方法要么通量低,要么间接测量蛋白质稳定性。在这里,我们开发了一种自动化方法,可以为一个小的 56 残基蛋白质中的几乎每个单个突变体生成热力学稳定性数据。分析表明,大多数单个突变体对稳定性没有影响,突变敏感性主要由残基埋藏决定,出人意料的是,疏水性是最耐受的氨基酸类型。将各种稳定性预测算法的输出与我们的数据进行相关分析表明,几乎所有算法在边界和表面位置的性能都优于核心位置,并且更擅长预测大到小的突变,而不是小到大的突变。我们表明,在单突变体景观中,最稳定的变体可以通过 2 种预测算法的组合更好地识别出来,并且使用更多的算法可以提供递减的回报。在大多数情况下,较差的计算预测与正在分析的数据与用于训练算法的数据之间的组成差异有关。最后,我们发现从高通量适合度数据(如深度突变扫描)中提取稳定性的策略很有前途,并且这些方法产生的数据可能适用于训练未来的稳定性预测工具。