Research Group in Cheminformatics & Nutrition, Departament de Bioquímica i Biotecnologia, Campus de Sescelades, Universitat Rovira i Virgili, 43007 Tarragona, Spain.
Department of Biology, University of Turku, 20500 Turku, Finland.
Int J Mol Sci. 2022 Nov 24;23(23):14683. doi: 10.3390/ijms232314683.
Predicting SARS-CoV-2 mutations is difficult, but predicting recurrent mutations driven by the host, such as those caused by host deaminases, is feasible. We used machine learning to predict which positions from the SARS-CoV-2 genome will hold a recurrent mutation and which mutations will be the most recurrent. We used data from April 2021 that we separated into three sets: a training set, a validation set, and an independent test set. For the test set, we obtained a specificity value of 0.69, a sensitivity value of 0.79, and an Area Under the Curve (AUC) of 0.8, showing that the prediction of recurrent SARS-CoV-2 mutations is feasible. Subsequently, we compared our predictions with updated data from January 2022, showing that some of the false positives in our prediction model become true positives later on. The most important variables detected by the model's Shapley Additive exPlanation (SHAP) are the nucleotide that mutates and RNA reactivity. This is consistent with the SARS-CoV-2 mutational bias pattern and the preference of some host deaminases for specific sequences and RNA secondary structures. We extend our investigation by analyzing the mutations from the variants of concern Alpha, Beta, Delta, Gamma, and Omicron. Finally, we analyzed amino acid changes by looking at the predicted recurrent mutations in the M-pro and spike proteins.
预测 SARS-CoV-2 突变是困难的,但预测由宿主驱动的复发性突变,如由宿主脱氨酶引起的突变,是可行的。我们使用机器学习来预测 SARS-CoV-2 基因组中的哪些位置将发生复发性突变,以及哪些突变将是最常见的。我们使用了 2021 年 4 月的数据,将其分为三组:训练集、验证集和独立测试集。对于测试集,我们获得了特异性值为 0.69、敏感性值为 0.79 和曲线下面积(AUC)为 0.8,表明预测 SARS-CoV-2 的复发性突变是可行的。随后,我们将我们的预测与 2022 年 1 月的更新数据进行了比较,结果表明我们的预测模型中的一些假阳性后来变成了真正的阳性。模型的 Shapley Additive exPlanation (SHAP) 检测到的最重要变量是发生突变的核苷酸和 RNA 反应性。这与 SARS-CoV-2 的突变偏向模式以及一些宿主脱氨酶对特定序列和 RNA 二级结构的偏好一致。我们通过分析关注变体 Alpha、Beta、Delta、Gamma 和 Omicron 的突变来扩展我们的研究。最后,我们通过研究 M 蛋白和刺突蛋白中的预测复发性突变来分析氨基酸变化。