van der Flier Floris, Estell Dave, Pricelius Sina, Dankmeyer Lydia, van Stigt Thans Sander, Mulder Harm, Otsuka Rei, Goedegebuur Frits, Lammerts Laurens, Staphorst Diego, van Dijk Aalt D J, de Ridder Dick, Redestig Henning
Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands.
Health & Biosciences, International Flavors and Fragrances, Palo Alto, 94304 CA, USA.
Comput Struct Biotechnol J. 2024 Oct 2;23:3489-3497. doi: 10.1016/j.csbj.2024.09.007. eCollection 2024 Dec.
Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a novel high-order combinatorial dataset for an enzyme spanning 3,706 variants, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different supervised variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These dependencies were also found in several single mutation enzyme variant datasets, albeit with dataset specific directions. Most importantly, we found that these dependencies were similar for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by current machine learning algorithms. Overall, our findings suggest that improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
蛋白质工程越来越依赖机器学习模型来对有前景的新型候选物进行计算预筛选。尽管机器学习方法已被证明是有效的,但其在前瞻性筛选数据上的性能仍有提升空间;不同蛋白质变体的预测准确性可能差异很大。到目前为止,尚不清楚与大预测误差相关的变体有哪些特征。为了确定结构特征是否影响可预测性,我们为一种酶创建了一个包含3706个变体的新型高阶组合数据集,该数据集可被划分为在仅属于特定结构类别的位置发生突变的变体子集。通过在我们数据的结构划分子集上训练四种不同的监督变体效应预测(VEP)模型,我们发现可预测性强烈依赖于我们测试的所有四个结构特征;埋藏性、接触残基数量、与活性位点的接近程度以及二级结构元件的存在。在几个单突变酶变体数据集中也发现了这些依赖性,尽管具有数据集特定的方向。最重要的是,我们发现我们测试的所有四个模型的这些依赖性都是相似的,这表明存在当前机器学习算法未充分考虑的特定结构和功能决定因素。总体而言,我们的研究结果表明,可以通过探索新的归纳偏差和利用蛋白质变体的不同数据模式来改进VEP模型,并且分层数据集设计可以突出机器学习指导的蛋白质工程的改进领域。