Govender Shaylyn, Morgan Emily, Ramahala Rabelani, Lobb Kevin, Bishop Nigel T, Tastan Bishop Özlem
Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa.
Department of Chemistry, Rhodes University, Makhanda 6139, South Africa.
Comput Struct Biotechnol J. 2025 Apr 22;27:1686-1692. doi: 10.1016/j.csbj.2025.04.029. eCollection 2025.
Understanding viral evolution and predicting future mutations are crucial for overcoming drug resistance and developing long-lasting treatments. Previously, we established machine learning (ML) models using dynamic residue network (DRN) metric data and leveraging a vast amount of existing mutation data from the SARS-CoV-2 main protease (M). Here, we sought to assess the generalizability and robustness of the current models across other SARS-CoV-2 proteins. To achieve this, for the first time, we employed a transfer learning (TL) approach, allowing us to determine the extent to which M trained models could be applied to other SARS-CoV-2 proteins. The TL results were highly promising, with artificial neural network (ANN) and random forest (RF) correlation coefficients for M closely matching those of NSP10, NSP16, and PL. The ANN |R| value for M was 0.564, while NSP10, NSP16, and PL had values of 0.533, 0.527, and 0.464, respectively. Similarly, the RF |R| value for M was 0.673, compared to 0.457, 0.460, and 0.437 for NSP10, NSP16, and PL, respectively. Interestingly, we did not observe a strong correlation for the spike (S) protein monomer and its domains. The low p-values that are associated with the correlation |R| values show that the linear correlations between predicted and actual mutation frequencies are statistically significant. This indicates that TL may generalize well across structurally related viral proteins using DRN-derived ML model from M. Overall, we aim to develop a universal ML model for predicting missense mutation frequencies in viral proteins, and this study lays the foundation for that goal.
了解病毒进化并预测未来突变对于克服耐药性和开发持久治疗方法至关重要。此前,我们利用动态残基网络(DRN)度量数据并借助来自严重急性呼吸综合征冠状病毒2(SARS-CoV-2)主要蛋白酶(M)的大量现有突变数据建立了机器学习(ML)模型。在此,我们试图评估当前模型在其他SARS-CoV-2蛋白中的通用性和稳健性。为实现这一目标,我们首次采用了迁移学习(TL)方法,从而能够确定M训练模型可应用于其他SARS-CoV-2蛋白的程度。迁移学习的结果非常有前景,M的人工神经网络(ANN)和随机森林(RF)相关系数与非结构蛋白10(NSP10)、非结构蛋白16(NSP16)和木瓜蛋白酶样蛋白酶(PL)的相关系数紧密匹配。M的ANN |R|值为0.564,而NSP10、NSP16和PL的值分别为0.533、0.527和0.464。同样,M的RF |R|值为0.673,相比之下,NSP10、NSP16和PL的RF |R|值分别为0.457、0.460和0.437。有趣的是,我们未观察到刺突(S)蛋白单体及其结构域之间存在强相关性。与相关|R|值相关的低p值表明预测突变频率与实际突变频率之间的线性相关性具有统计学意义。这表明使用来自M的基于DRN的ML模型,迁移学习可能在结构相关的病毒蛋白中具有良好的通用性。总体而言,我们旨在开发一种通用的ML模型来预测病毒蛋白中的错义突变频率,而本研究为该目标奠定了基础。