Roder A E, Johnson Kee, Knoll M, Khalfan M, Wang B, Schultz-Cherry S, Banakis S, Kreitman A, Mederos C, Youn J-H, Mercado R, Wang W, Ruchnewitz D, Samanovic M I, Mulligan M J, Lassig M, Åuksza M, Das S, Gresham D, Ghedin E
bioRxiv. 2022 Aug 16:2021.05.05.442873. doi: 10.1101/2021.05.05.442873.
High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. However, accurate detection of minority variants in viral sequence data is complicated by errors introduced during sample preparation and data analysis. We used synthetic RNA controls and simulated data to test seven variant calling tools across a range of allele frequencies and simulated coverages. We show that choice of variant caller, and use of replicate sequencing have the most significant impact on single nucleotide variant (SNV) discovery and demonstrate how both allele frequency and coverage thresholds impact both false discovery and false negative rates. We use these parameters to find minority variants in sequencing data from SARS-CoV-2 clinical specimens and provide guidance for studies of intrahost viral diversity using either single replicate data or data from technical replicates. Our study provides a framework for rigorous assessment of technical factors that impact SNV identification in viral samples and establishes heuristics that will inform and improve future studies of intrahost variation, viral diversity, and viral evolution.
When viruses replicate inside a host, the virus replication machinery makes mistakes. Over time, these mistakes create mutations that result in a diverse population of viruses inside the host. Mutations that are neither lethal to the virus, nor strongly beneficial, can lead to minority variants that are minor members of the virus population. However, preparing samples for sequencing can also introduce errors that resemble minority variants, resulting in inclusion of false positive data if not filtered correctly. In this study, we aimed to determine the best methods for identification and quantification of these minority variants by testing the performance of seven commonly used variant calling tools. We used simulated and synthetic data to test their performance against a true set of variants, and then used these studies to inform variant identification in data from clinical SARS-CoV-2 clinical specimens. Together, analyses of our data provide extensive guidance for future studies of viral diversity and evolution.
病毒RNA依赖性RNA聚合酶的高错误率导致感染期间宿主内病毒群体的多样性。复制过程中产生的对病毒没有严重有害影响的错误会导致少数变异体的产生。然而,在病毒序列数据中准确检测少数变异体因样本制备和数据分析过程中引入的错误而变得复杂。我们使用合成RNA对照和模拟数据,在一系列等位基因频率和模拟覆盖度下测试了七种变异体检测工具。我们表明,变异体检测工具的选择以及重复测序的使用对单核苷酸变异(SNV)发现具有最显著的影响,并证明等位基因频率和覆盖度阈值如何影响错误发现率和假阴性率。我们使用这些参数在严重急性呼吸综合征冠状病毒2(SARS-CoV-2)临床标本的测序数据中寻找少数变异体,并为使用单重复数据或技术重复数据进行宿主内病毒多样性研究提供指导。我们的研究提供了一个框架,用于严格评估影响病毒样本中SNV识别的技术因素,并建立启发式方法,为未来宿主内变异、病毒多样性和病毒进化的研究提供信息并加以改进。
当病毒在宿主体内复制时,病毒复制机制会出错。随着时间的推移,这些错误会产生突变,导致宿主体内出现多样化的病毒群体。对病毒既不致命也没有强烈益处的突变会导致少数变异体,它们是病毒群体中的少数成员。然而,为测序制备样本也会引入类似于少数变异体的错误,如果没有正确过滤,就会导致包含假阳性数据。在本研究中,我们旨在通过测试七种常用变异体检测工具的性能,确定识别和定量这些少数变异体的最佳方法。我们使用模拟和合成数据针对一组真实的变异体测试它们的性能,然后利用这些研究为来自SARS-CoV-2临床标本的数据中的变异体识别提供信息。总之,对我们数据的分析为未来病毒多样性和进化研究提供了广泛的指导。