Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Boulevard MS 5B3, Manassas, VA 20110, USA.
Protein Eng Des Sel. 2020 Sep 14;33. doi: 10.1093/protein/gzaa022.
A computational mutagenesis technique was used to characterize the structural effects associated with over 46 000 single and multiple amino acid variants of Aequorea victoria green fluorescent protein (GFP), whose functional effects (fluorescence levels) were recently measured by experimental researchers. For each GFP mutant, the approach generated a single score reflecting the overall change in sequence-structure compatibility relative to native GFP, as well as a vector of environmental perturbation (EP) scores characterizing the impact at all GFP residue positions. A significant GFP structure-function relationship (P < 0.0001) was elucidated by comparing the sequence-structure compatibility scores with the functional data. Next, the computed vectors for GFP mutants were used to train predictive models of fluorescence by implementing random forest (RF) classification and tree regression machine learning algorithms. Classification performance reached 0.93 for sensitivity, 0.91 for precision and 0.90 for balanced accuracy, and regression models led to Pearson's correlation as high as r = 0.83 between experimental and predicted GFP mutant fluorescence. An RF model trained on a subset of over 1000 experimental single residue GFP mutants with measured fluorescence was used for predicting the 3300 remaining unstudied single residue mutants, with results complementing known GFP biochemical and biophysical properties. In addition, models trained on the subset of experimental GFP mutants harboring multiple residue replacements successfully predicted fluorescence of the single residue GFP mutants. The models developed for this study were accurate and efficient, and their predictions outperformed those of several related state-of-the-art methods.
使用计算突变技术来描述与 Aequorea victoria 绿色荧光蛋白 (GFP) 的 46000 多种单氨基酸和多氨基酸变体相关的结构效应,其功能效应 (荧光水平) 最近被实验研究人员测量。对于每个 GFP 突变体,该方法生成一个单一的分数,反映了相对于天然 GFP 的序列-结构兼容性的总体变化,以及一个描述所有 GFP 残基位置影响的环境扰动 (EP) 分数向量。通过比较序列-结构兼容性分数与功能数据,阐明了 GFP 结构-功能关系的显著相关性 (P<0.0001)。接下来,使用 GFP 突变体的计算向量通过实现随机森林 (RF) 分类和树回归机器学习算法来训练荧光预测模型。分类性能达到了 0.93 的灵敏度、0.91 的精确度和 0.90 的平衡准确性,回归模型导致实验和预测 GFP 突变体荧光之间的 Pearson 相关系数高达 r=0.83。使用经过实验测量的荧光的 1000 多个单残基 GFP 突变体子集训练的 RF 模型用于预测其余 3300 个未研究的单残基突变体,结果补充了已知的 GFP 生化和生物物理特性。此外,在含有多个残基替换的实验 GFP 突变体子集上训练的模型成功预测了单残基 GFP 突变体的荧光。为这项研究开发的模型准确且高效,其预测性能优于几种相关的最先进方法。