Emerging Pathogens Institute, University of Florida, Gainesville, FL, United States.
Department of Pathology, University of Florida, Gainesville, FL, United States.
JMIR Public Health Surveill. 2020 Jun 1;6(2):e19170. doi: 10.2196/19170.
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been growing exponentially, affecting over 4 million people and causing enormous distress to economies and societies worldwide. A plethora of analyses based on viral sequences has already been published both in scientific journals and through non-peer-reviewed channels to investigate the genetic heterogeneity and spatiotemporal dissemination of SARS-CoV-2. However, a systematic investigation of phylogenetic information and sampling bias in the available data is lacking. Although the number of available genome sequences of SARS-CoV-2 is growing daily and the sequences show increasing phylogenetic information, country-specific data still present severe limitations and should be interpreted with caution.
The objective of this study was to determine the quality of the currently available SARS-CoV-2 full genome data in terms of sampling bias as well as phylogenetic and temporal signals to inform and guide the scientific community.
We used maximum likelihood-based methods to assess the presence of sufficient information for robust phylogenetic and phylogeographic studies in several SARS-CoV-2 sequence alignments assembled from GISAID (Global Initiative on Sharing All Influenza Data) data released between March and April 2020.
Although the number of high-quality full genomes is growing daily, and sequence data released in April 2020 contain sufficient phylogenetic information to allow reliable inference of phylogenetic relationships, country-specific SARS-CoV-2 data sets still present severe limitations.
At the present time, studies assessing within-country spread or transmission clusters should be considered preliminary or hypothesis-generating at best. Hence, current reports should be interpreted with caution, and concerted efforts should continue to increase the number and quality of sequences required for robust tracing of the epidemic.
严重急性呼吸综合征冠状病毒 2(SARS-CoV-2)大流行呈指数级增长,已影响超过 400 万人,并给全世界的经济和社会造成巨大痛苦。大量基于病毒序列的分析已经在科学期刊和非同行评审渠道上发表,以研究 SARS-CoV-2 的遗传异质性和时空传播。然而,对于可用数据中的系统进化信息和采样偏差的研究还很缺乏。尽管 SARS-CoV-2 的可用基因组序列数量每天都在增加,并且序列显示出越来越多的系统进化信息,但特定国家的数据仍然存在严重的局限性,应谨慎解释。
本研究旨在确定目前可用的 SARS-CoV-2 全基因组数据在采样偏差以及系统进化和时间信号方面的质量,以为科学界提供信息和指导。
我们使用最大似然法评估了从 2020 年 3 月至 4 月发布的 GISAID(全球流感共享倡议)数据中组装的几个 SARS-CoV-2 序列比对中存在足够信息进行稳健系统进化和系统地理学研究的情况。
尽管高质量的全基因组数量每天都在增加,并且 2020 年 4 月发布的序列数据包含足够的系统进化信息,可以可靠地推断系统进化关系,但特定国家的 SARS-CoV-2 数据集仍然存在严重的局限性。
目前,评估国内传播或传播集群的研究应被认为是初步的或初步产生假设的。因此,当前的报告应谨慎解释,应继续共同努力增加所需的序列数量和质量,以进行稳健的疫情追踪。