通过可扩展的机器学习方法对 SARS-CoV-2 的重要谱系进行无监督识别。

Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods.

机构信息

Department of Mathematics, The University of Manchester, Manchester M13 9PL, United Kingdom.

United Kingdom Health Security Agency, University of Oxford, Oxford OX3 7LF, United Kingdom.

出版信息

Proc Natl Acad Sci U S A. 2024 Mar 19;121(12):e2317284121. doi: 10.1073/pnas.2317284121. Epub 2024 Mar 13.

DOI:10.1073/pnas.2317284121

PMID:38478692

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10962941/

Abstract

Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and caused multiple waves of infection globally. Novel lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern. They can also cause increased mortality and morbidity if they have increased virulence, as was seen for Alpha and Delta. Phylogenetic methods provide the "gold standard" for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These challenges provide a motivation to develop complementary methods that can incorporate all of the genetic data available without down-sampling to extract meaningful information rapidly and with minimal curation. In this paper, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies. While not serving as a substitute for current phylogenetic analyses, the proposed methods can be used as a complementary, and fully automatable, approach to identify and confirm new emerging variants.

摘要

自 2019 年底出现以来，SARS-CoV-2 已经多样化为许多谱系，并在全球范围内引发了多波感染。如果新的谱系具有更高的内在传染性和/或能够逃避宿主免疫反应，就像关注的 Alpha、Delta 和奥密克戎变体那样，它们有可能迅速在国际上传播。如果它们的毒力增加，就像 Alpha 和 Delta 那样，也会导致死亡率和发病率增加。系统发育方法为代表 SARS-CoV-2 的全球多样性并识别新出现的谱系提供了“金标准”。然而，这些方法计算成本高，当数据集变得太大时难以处理，并且需要手动策展来指定新的谱系。这些挑战提供了开发互补方法的动力，这些方法可以在不进行下采样的情况下整合所有可用的遗传数据，以便快速提取有意义的信息，同时进行最小的策展。在本文中，我们展示了使用基于词统计的算法方法来表示整个序列的效用，为构建遗传拓扑结构带来了速度、可扩展性和可解释性。虽然不能替代当前的系统发育分析，但所提出的方法可以作为一种补充的、完全自动化的方法来识别和确认新出现的变体。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过可扩展的机器学习方法对 SARS-CoV-2 的重要谱系进行无监督识别。

Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

通过可扩展的机器学习方法对 SARS-CoV-2 的重要谱系进行无监督识别。

Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献