The Pirbright Institute, Woking, Surrey GU24 0NF, UK.
Department of Microbial and Cellular Sciences, Faculty of Health and Medical Sciences, School of Biosciences and Medicine, University of Surrey, Guildford GU2 7XH, UK.
Viruses. 2020 Oct 20;12(10):1187. doi: 10.3390/v12101187.
High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA "populations" were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (10 copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (10 copies) required more technical replicates to maintain accuracy, while low RNA inputs (10 copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants.
高通量测序,如 Illumina 提供的测序,是了解病毒群体中序列变异的有效方法。然而,在区分过程引入的错误与生物变异方面存在挑战,这极大地影响了我们识别亚共识单核苷酸变异 (SNV) 的能力。在这里,我们采用了一种系统的方法来评估实验室和生物信息学管道,以准确识别病毒群体中的低频 SNV。通过在 Illumina MiSeq 平台上测序之前,将已知的 SNV 以预定的频率引入模板核酸中,从而创建人工 DNA 和 RNA“群体”。这些被用来评估丰度和起始输入材料类型、技术重复、读长和质量、短读序列比对器以及频率阈值百分比对准确调用变体的能力的影响。分析表明,输入核酸的丰度和类型对 SNV 调用的准确性影响最大,这可以通过微平均马修斯相关系数得分来衡量,其中 DNA 和高 RNA 输入(10 个拷贝)允许以 0.2%的频率调用变体。减少的输入 RNA(10 个拷贝)需要更多的技术重复来保持准确性,而低 RNA 输入(10 个拷贝)则受到共识水平错误的影响。在所有技术重复中确定的特定基序中识别出的碱基错误也可以被排除,以进一步提高 SNV 调用准确性。这些发现表明,应排除 RNA 输入量低的样本进行 SNV 调用,并强调优化用于准确识别序列变异的管道中的技术和生物信息学步骤的重要性。