Departement of Electronics, Information, and Bioengineering, Politecnico di Milano, 20133, Milan, Italy.
Sci Rep. 2021 Oct 26;11(1):21068. doi: 10.1038/s41598-021-00496-z.
Since its emergence in late 2019, the diffusion of SARS-CoV-2 is associated with the evolution of its viral genome. The co-occurrence of specific amino acid changes, collectively named 'virus variant', requires scrutiny (as variants may hugely impact the agent's transmission, pathogenesis, or antigenicity); variant evolution is studied using phylogenetics. Yet, never has this problem been tackled by digging into data with ad hoc analysis techniques. Here we show that the emergence of variants can in fact be traced through data-driven methods, further capitalizing on the value of large collections of SARS-CoV-2 sequences. For all countries with sufficient data, we compute weekly counts of amino acid changes, unveil time-varying clusters of changes with similar-rapidly growing-dynamics, and then follow their evolution. Our method succeeds in timely associating clusters to variants of interest/concern, provided their change composition is well characterized. This allows us to detect variants' emergence, rise, peak, and eventual decline under competitive pressure of another variant. Our early warning system, exclusively relying on deposited sequences, shows the power of big data in this context, and concurs to calling for the wide spreading of public SARS-CoV-2 genome sequencing for improved surveillance and control of the COVID-19 pandemic.
自 2019 年底出现以来,SARS-CoV-2 的传播与其病毒基因组的进化有关。特定氨基酸变化的共同发生,统称为“病毒变体”,需要仔细审查(因为变体可能极大地影响病原体的传播、发病机制或抗原性);变体进化是通过系统发生学进行研究的。然而,以前从未有过通过挖掘特定分析技术的数据来解决这个问题。在这里,我们展示了实际上可以通过数据驱动的方法来追踪变体的出现,进一步利用 SARS-CoV-2 序列的大型集合的价值。对于所有数据充足的国家,我们计算每周的氨基酸变化计数,揭示具有相似快速增长动态的变化时变聚类,然后跟踪它们的进化。我们的方法成功地及时将聚类与感兴趣/关注的变体相关联,前提是它们的变化组成得到很好的描述。这使我们能够在另一个变体的竞争压力下检测到变体的出现、上升、峰值和最终下降。我们的预警系统仅依赖于已存储的序列,展示了大数据在这方面的力量,并一致呼吁广泛传播 SARS-CoV-2 基因组测序,以改善对 COVID-19 大流行的监测和控制。