Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.
College of Medicine, Drexel University, Philadelphia, PA, USA.
PLoS Comput Biol. 2020 Sep 17;16(9):e1008269. doi: 10.1371/journal.pcbi.1008269. eCollection 2020 Sep.
We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure; and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread. ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic tree-based analysis, such as is done in the Nextstrain project. The developed pipeline dynamically generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic spatiotemporal dynamics, and is available on Github at https://github.com/EESI/ISM (Jupyter notebook), https://github.com/EESI/ncov_ism (command line tool) and via an interactive website at https://covid19-ism.coe.drexel.edu/.
我们提出了一个有效的 SARS-CoV-2 遗传亚型框架,该病毒是导致 COVID-19 大流行的新型冠状病毒。有效的病毒亚型划分能够可视化和建模疾病传播的地理分布和时间动态。因此,亚型划分可以促进有效的控制策略的制定,并且可能促进治疗和疫苗策略的制定。然而,实时识别病毒亚型具有挑战性:SARS-CoV-2 是一种新型病毒,大流行正在迅速蔓延。由于快速进化,病毒亚型可能难以检测;创始效应比选择压力更显著;并且亚型聚类的阈值尚未标准化。我们建议使用基于群体的方法识别现有 SARS-CoV-2 序列的突变特征:熵度量后跟频率分析。这些特征,信息亚型标记 (ISM),定义了一组紧凑的核苷酸位点,这些位点描述了从不同个体中测序的病毒基因组中最可变(因此最具信息量)的位置。通过 ISM 压缩,我们发现某些远距离核苷酸变体相互关联,包括非编码和 ORF1ab 位点与棘突蛋白 D614G 突变相互关联,随着大流行的传播,该突变变得越来越普遍。ISM 对于下游分析也很有用,例如病毒动力学的时空可视化。通过分析 GISAID 数据库中可用的序列数据,我们通过将基于 ISM 的亚型分析与亚洲、欧洲和美国的病毒传播流行病学研究进行比较,验证了基于 ISM 的亚型分析的实用性。此外,我们展示了 ISM 与 SARS-CoV-2 进化系统发育重建的关系,因此,ISM 可以对基于系统发育树的分析起到重要的补充作用,例如在 Nextstrain 项目中所做的那样。开发的管道为新添加的 SARS-CoV-2 序列动态生成 ISM,并更新大流行时空动态的可视化,可在以下网址获得:https://github.com/EESI/ISM(Jupyter 笔记本),https://github.com/EESI/ncov_ism(命令行工具)以及通过交互式网站 https://covid19-ism.coe.drexel.edu/。