School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China.
Int J Mol Sci. 2021 Jul 27;22(15):8027. doi: 10.3390/ijms22158027.
Genetic variations have a multitude of effects on proteins. A substantial number of variations affect protein-solvent interactions, either aggregation or solubility. Aggregation is often related to structural alterations, whereas solubilizable proteins in the solid phase can be made again soluble by dilution. Solubility is a central protein property and when reduced can lead to diseases. We developed a prediction method, PON-Sol2, to identify amino acid substitutions that increase, decrease, or have no effect on the protein solubility. The method is a machine learning tool utilizing gradient boosting algorithm and was trained on a large dataset of variants with different outcomes after the selection of features among a large number of tested properties. The method is fast and has high performance. The normalized correct prediction rate for three states is 0.656, and the normalized GC2 score is 0.312 in 10-fold cross-validation. The corresponding numbers in the blind test were 0.545 and 0.157. The performance was superior in comparison to previous methods. The PON-Sol2 predictor is freely available. It can be used to predict the solubility effects of variants for any organism, even in large-scale projects.
遗传变异对蛋白质有多种影响。大量的变异影响蛋白质与溶剂的相互作用,要么导致聚集,要么导致溶解度降低。聚集通常与结构改变有关,而固态中可溶的蛋白质可以通过稀释再次溶解。溶解度是蛋白质的一个重要特性,当溶解度降低时,可能会导致疾病。我们开发了一种预测方法 PON-Sol2,用于识别氨基酸取代,这些取代会增加、减少或对蛋白质的溶解度没有影响。该方法是一种利用梯度提升算法的机器学习工具,在经过大量测试特性的选择后,在具有不同结果的大量变体数据集上进行了训练。该方法速度快,性能高。在 10 倍交叉验证中,三种状态的归一化正确预测率为 0.656,GC2 归一化分数为 0.312。在盲测中,相应的数字为 0.545 和 0.157。与之前的方法相比,该方法的性能更优。PON-Sol2 预测器是免费提供的。它可以用于预测任何生物体的变体对溶解度的影响,即使在大规模项目中也是如此。