Department of Pediatrics and Rady Children's Hospital, University of California San Diego, San Diego, USA.
BMC Bioinformatics. 2014 May 2;15:125. doi: 10.1186/1471-2105-15-125.
Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK's recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone.
The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes.
The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.
下一代测序研究中产生的基因型存在错误,这些错误会显著影响常见和罕见变异关联测试中信号的检测能力。这些基因分型错误并未被标准 GATK 变异质量评分重新校准(VQSR)工具明确过滤,因此仍然是遵循 GATK 推荐最佳实践的外显子组测序(WES)项目中的错误源。因此,在进行复杂表型的关联分析之前,需要额外的数据过滤方法来有效地去除这些错误。在这里,我们通过经验得出了基因型和变体过滤器的阈值,当与 VQSR 工具一起使用时,与单独使用 VQSR 相比,可以实现更高的数据质量。
详细的过滤策略将测序基因型与阵列基因型的一致性从 99.33%提高到 99.77%;将去除的不一致基因型比例从 10.5%提高到 69.5%;将 Ti/Tv 比值从 2.63 提高到 2.75。我们还证明,通过根据不同的靶向捕获和测序化学协议将样本分开来管理批次效应,最终数据集包含 40.9%更多的高质量变体。此外,插补是 WES 研究的重要组成部分,用于估计常见变体基因型,以生成额外的标记进行关联分析。因此,我们展示了用于插补数据的过滤方法,这些方法可以将基因型一致性从 79.3%提高到 99.8%,同时去除 99.5%的不一致基因型。
所描述的过滤方法对于旨在识别与复杂疾病相关的常见和罕见变异的大型基于人群的 WES 研究是有利的。与通过标准实践处理的数据相比,这些策略可显著提高常见和罕见关联分析的数据质量。