Ma Enhao, Guo Xuan, Hu Mingda, Wang Penghua, Wang Xin, Wei Congwen, Cheng Gong
School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China.
Institute of Infectious Diseases, Shenzhen Bay Laboratory, Guangqiao Rd., Guangming District, Shenzhen, Guangdong, 518000, China.
Signal Transduct Target Ther. 2024 Dec 23;9(1):353. doi: 10.1038/s41392-024-02066-x.
Modeling and predicting mutations are critical for COVID-19 and similar pandemic preparedness. However, existing predictive models have yet to integrate the regularity and randomness of viral mutations with minimal data requirements. Here, we develop a non-demanding language model utilizing both regularity and randomness to predict candidate SARS-CoV-2 variants and mutations that might prevail. We constructed the "grammatical frameworks" of the available S1 sequences for dimension reduction and semantic representation to grasp the model's latent regularity. The mutational profile, defined as the frequency of mutations, was introduced into the model to incorporate randomness. With this model, we successfully identified and validated several variants with significantly enhanced viral infectivity and immune evasion by wet-lab experiments. By inputting the sequence data from three different time points, we detected circulating strains or vital mutations for XBB.1.16, EG.5, JN.1, and BA.2.86 strains before their emergence. In addition, our results also predicted the previously unknown variants that may cause future epidemics. With both the data validation and experiment evidence, our study represents a fast-responding, concise, and promising language model, potentially generalizable to other viral pathogens, to forecast viral evolution and detect crucial hot mutation spots, thus warning the emerging variants that might raise public health concern.
对新冠病毒及类似大流行疾病的防范而言,对突变进行建模和预测至关重要。然而,现有的预测模型尚未将病毒突变的规律性和随机性与最少的数据需求相结合。在此,我们开发了一种要求不高的语言模型,它利用规律性和随机性来预测可能流行的新冠病毒变异株和突变。我们构建了可用S1序列的“语法框架”以进行降维和语义表示,从而掌握模型的潜在规律性。将定义为突变频率的突变图谱引入模型以纳入随机性。利用该模型,我们通过湿实验室实验成功识别并验证了几种具有显著增强的病毒感染力和免疫逃逸能力的变异株。通过输入来自三个不同时间点的序列数据,我们在XBB.1.16、EG.5、JN.1和BA.2.86毒株出现之前就检测到了它们的流行毒株或关键突变。此外,我们的结果还预测了可能导致未来疫情的此前未知的变异株。通过数据验证和实验证据,我们的研究展示了一个快速响应、简洁且有前景的语言模型,它可能适用于其他病毒病原体,以预测病毒进化并检测关键的热点突变位点,从而警示可能引发公众健康担忧的新出现变异株。