Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129, Turin, Italy.
Scuola Superiore Meridionale, Largo S. Marcellino 10, 80138, Naples, Italy.
Sci Rep. 2022 Jun 3;12(1):9275. doi: 10.1038/s41598-022-12442-8.
Never before such a vast amount of data, including genome sequencing, has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to trace the virus evolution and to assess the role mutations play in its spread within the population, in real time. To this end, we focused on the Spike protein for its central role in mediating viral outbreak and replication in host cells. Employing the Levenshtein distance on the Spike protein sequences, we designed a machine learning algorithm yielding a temporal clustering of the available dataset. From this, we were able to identify and define emerging persistent variants that are in agreement with known evidences. Our novel algorithm allowed us to define persistent variants as chains that remain stable over time and to highlight emerging variants of epidemiological interest as branching events that occur over time. Hence, we determined the relationship and temporal connection between variants of interest and the ensuing passage to dominance of the current variants of concern. Remarkably, the analysis and the relevant tools introduced in our work serve as an early warning for the emergence of new persistent variants once the associated cluster reaches 1% of the time-binned sequence data. We validated our approach and its effectiveness on the onset of the Alpha variant of concern. We further predict that the recently identified lineage AY.4.2 ('Delta plus') is causing a new emerging variant. Comparing our findings with the epidemiological data we demonstrated that each new wave is dominated by a new emerging variant, thus confirming the hypothesis of the existence of a strong correlation between the birth of variants and the pandemic multi-wave temporal pattern. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group framework.
从未有过如此大量的数据,包括基因组测序,被用于任何病毒大流行,比目前的 COVID-19 病例。这提供了追踪病毒进化的可能性,并实时评估突变在其在人群中的传播中的作用。为此,我们专注于刺突蛋白,因为它在介导病毒爆发和在宿主细胞中复制方面起着核心作用。我们利用 Levenshtein 距离对 Spike 蛋白序列进行操作,设计了一种机器学习算法,对可用数据集进行时间聚类。由此,我们能够识别和定义新兴的持续变异,这些变异与已知的证据一致。我们的新算法允许我们将持续变异定义为随着时间的推移保持稳定的链,并突出随时间发生的具有流行病学意义的新兴变异作为分支事件。因此,我们确定了感兴趣的变体之间的关系和时间联系,以及当前变体的后续主导地位。值得注意的是,一旦相关聚类达到时间分箱序列数据的 1%,我们在工作中引入的分析和相关工具就可以作为新的持续变异出现的早期预警。我们验证了我们的方法及其在关注的 Alpha 变体出现时的有效性。我们进一步预测,最近发现的 AY.4.2 谱系(“Delta plus”)正在引发新的新兴变体。通过将我们的发现与流行病学数据进行比较,我们证明了每一波新的浪潮都由一个新的新兴变体主导,从而证实了变体的出现与大流行多波时间模式之间存在强相关的假设。这使我们能够通过突变流行病学重整化群框架来介绍我们描述的变体的流行病学。