Feng Yinan, Goldberg Emma E, Kupperman Michael, Zhang Xitong, Lin Youzuo, Ke Ruian
Earth and Environmental Sciences Division, Los Alamos National Laboratory, Los Alamos, NM, United States.
Theoretical Biology and Biophysics, Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, United States.
Virus Evol. 2024 Nov 14;10(1):veae086. doi: 10.1093/ve/veae086. eCollection 2024.
With hundreds of SARS-CoV-2 lineages circulating in the global population, there is an ongoing need for predicting and forecasting lineage frequencies and thus identifying rapidly expanding lineages. Accurate prediction would allow for more focused experimental efforts to understand pathogenicity of future dominating lineages and characterize the extent of their immune escape. Here, we first show that the inherent noise and biases in lineage frequency data make a commonly-used regression-based approach unreliable. To address this weakness, we constructed a machine learning model for SARS-CoV-2 lineage frequency forecasting, called CovTransformer, based on the transformer architecture. We designed our model to navigate challenges such as a limited amount of data with high levels of noise and bias. We first trained and tested the model using data from the UK and the USA, and then tested the generalization ability of the model to many other countries and US states. Remarkably, the trained model makes accurate predictions two months into the future with high levels of accuracy both globally (in 31 countries with high levels of sequencing effort) and at the US-state level. Our model performed substantially better than a widely used forecasting tool, the multinomial regression model implemented in Nextstrain, demonstrating its utility in SARS-CoV-2 monitoring. Assuming a newly emerged lineage is identified and assigned, our test using retrospective data shows that our model is able to identify the dominating lineages 7 weeks in advance on average before they became dominant. Overall, our work demonstrates that transformer models represent a promising approach for SARS-CoV-2 forecasting and pandemic monitoring.
随着数百种新冠病毒谱系在全球人群中传播,持续需要预测和预估谱系频率,从而识别快速扩张的谱系。准确的预测将使实验工作更具针对性,以了解未来主导谱系的致病性并描述其免疫逃逸程度。在此,我们首先表明,谱系频率数据中的固有噪声和偏差使得常用的基于回归的方法不可靠。为解决这一弱点,我们基于Transformer架构构建了一个用于新冠病毒谱系频率预测的机器学习模型,称为CovTransformer。我们设计该模型以应对诸如数据量有限且噪声和偏差水平高之类的挑战。我们首先使用来自英国和美国的数据对模型进行训练和测试,然后测试模型对许多其他国家和美国各州的泛化能力。值得注意的是,训练后的模型能够在未来两个月做出准确预测,在全球范围内(在31个测序工作水平较高的国家)以及在美国州一级都具有很高的准确性。我们的模型表现明显优于一种广泛使用的预测工具,即Nextstrain中实施的多项式回归模型,证明了其在新冠病毒监测中的效用。假设识别并指定了一个新出现的谱系,我们使用回顾性数据进行的测试表明,我们的模型能够在主导谱系成为主导之前平均提前7周识别出它们。总体而言,我们的工作表明Transformer模型是新冠病毒预测和疫情监测的一种有前景的方法。