从基于人群的全外显子组测序研究中提高数据质量的有效筛选策略。

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies.

机构信息

Department of Pediatrics and Rady Children's Hospital, University of California San Diego, San Diego, USA.

出版信息

BMC Bioinformatics. 2014 May 2;15:125. doi: 10.1186/1471-2105-15-125.

DOI:10.1186/1471-2105-15-125

PMID:24884706

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4098776/

Abstract

BACKGROUND

Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK's recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone.

RESULTS

The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes.

CONCLUSIONS

The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.

摘要

背景

下一代测序研究中产生的基因型存在错误，这些错误会显著影响常见和罕见变异关联测试中信号的检测能力。这些基因分型错误并未被标准 GATK 变异质量评分重新校准（VQSR）工具明确过滤，因此仍然是遵循 GATK 推荐最佳实践的外显子组测序（WES）项目中的错误源。因此，在进行复杂表型的关联分析之前，需要额外的数据过滤方法来有效地去除这些错误。在这里，我们通过经验得出了基因型和变体过滤器的阈值，当与 VQSR 工具一起使用时，与单独使用 VQSR 相比，可以实现更高的数据质量。

结果

详细的过滤策略将测序基因型与阵列基因型的一致性从 99.33%提高到 99.77%；将去除的不一致基因型比例从 10.5%提高到 69.5%；将 Ti/Tv 比值从 2.63 提高到 2.75。我们还证明，通过根据不同的靶向捕获和测序化学协议将样本分开来管理批次效应，最终数据集包含 40.9%更多的高质量变体。此外，插补是 WES 研究的重要组成部分，用于估计常见变体基因型，以生成额外的标记进行关联分析。因此，我们展示了用于插补数据的过滤方法，这些方法可以将基因型一致性从 79.3%提高到 99.8%，同时去除 99.5%的不一致基因型。

结论

所描述的过滤方法对于旨在识别与复杂疾病相关的常见和罕见变异的大型基于人群的 WES 研究是有利的。与通过标准实践处理的数据相比，这些策略可显著提高常见和罕见关联分析的数据质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa1/4098776/0b2754807a48/1471-2105-15-125-1.jpg

相似文献

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies.

BMC Bioinformatics. 2014 May 2;15:125. doi: 10.1186/1471-2105-15-125.

Ionizing Radiation Alters the Transition/Transversion Ratio in the Exome of Human Gingiva Fibroblasts.

Health Phys. 2020 Jul;119(1):109-117. doi: 10.1097/HP.0000000000001251.

Low-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accuracy of imputation.

BMC Genet. 2017 Apr 5;18(1):32. doi: 10.1186/s12863-017-0501-y.

Impact of variant-level batch effects on identification of genetic risk factors in large sequencing studies.

PLoS One. 2021 Apr 16;16(4):e0249305. doi: 10.1371/journal.pone.0249305. eCollection 2021.

An efficient and tunable parameter to improve variant calling for whole genome and exome sequencing data.

Genes Genomics. 2018 Jan;40(1):39-47. doi: 10.1007/s13258-017-0608-6. Epub 2017 Aug 29.

Improved variant calling accuracy by merging replicates in whole-exome sequencing studies.

Biomed Res Int. 2014;2014:319534. doi: 10.1155/2014/319534. Epub 2014 Aug 4.

Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken.

BMC Genomics. 2015 Oct 21;16:824. doi: 10.1186/s12864-015-2059-2.

Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance.

Sci Rep. 2019 Nov 6;9(1):16156. doi: 10.1038/s41598-019-52614-7.

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data.

BMC Bioinformatics. 2017 Mar 23;18(Suppl 5):119. doi: 10.1186/s12859-017-1537-8.

A new strategy for enhancing imputation quality of rare variants from next-generation sequencing data via combining SNP and exome chip data.

BMC Genomics. 2015 Dec 29;16:1109. doi: 10.1186/s12864-015-2192-y.

引用本文的文献

Genetic variation among progeny shapes symbiosis in a basidiomycete with poplar.

New Phytol. 2025 Oct;248(1):157-177. doi: 10.1111/nph.70395. Epub 2025 Aug 7.

Development of an Integrated Multifunctional Column for Rapid Pretreatment and Determination of Trichothecenes in Cereals and Feeds with HPLC-MS/MS.

Foods. 2025 Apr 23;14(9):1466. doi: 10.3390/foods14091466.

The Parkinson's disease risk gene cathepsin B promotes fibrillar alpha-synuclein clearance, lysosomal function and glucocerebrosidase activity in dopaminergic neurons.

Mol Neurodegener. 2024 Nov 25;19(1):88. doi: 10.1186/s13024-024-00779-9.

Are rare heterozygous SYNJ1 variants associated with Parkinson's disease?

NPJ Parkinsons Dis. 2024 Oct 25;10(1):201. doi: 10.1038/s41531-024-00809-9.

Resilience of genetic diversity in forest trees over the Quaternary.

Nat Commun. 2024 Oct 14;15(1):8538. doi: 10.1038/s41467-024-52612-y.

Genomic Insights into Idiopathic Granulomatous Mastitis through Whole-Exome Sequencing: A Case Report of Eight Patients.

Int J Mol Sci. 2024 Aug 21;25(16):9058. doi: 10.3390/ijms25169058.

Lack of genetic evidence for NLRP3 inflammasome involvement in Parkinson's disease pathogenesis.

NPJ Parkinsons Dis. 2024 Aug 5;10(1):145. doi: 10.1038/s41531-024-00744-9.

Sampling strategies for genotyping common bean ( L.) Genebank accessions with DArTseq: a comparison of single plants, multiple plants, and DNA pools.

Front Plant Sci. 2024 Jul 11;15:1338332. doi: 10.3389/fpls.2024.1338332. eCollection 2024.

Clinical Validation of the Somatic FANCD2 Mutation (c.2022-5C>T) as a Novel Molecular Biomarker for Early Disease Progression in Chronic Myeloid Leukemia: A Case-Control Study.

Hematol Rep. 2024 Jul 8;16(3):465-478. doi: 10.3390/hematolrep16030045.

Missing genotype imputation in non-model species using self-organizing maps.

Mol Ecol Resour. 2025 Apr;25(3):e13992. doi: 10.1111/1755-0998.13992. Epub 2024 Jul 6.

本文引用的文献

Variant callers for next-generation sequencing data: a comparison study.

PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.

Whole-genome DNA/RNA sequencing identifies truncating mutations in RBCK1 in a novel Mendelian disease with neuromuscular and cardiac involvement.

Genome Med. 2013 Jul 26;5(7):67. doi: 10.1186/gm471. eCollection 2013.

Identifying rare variants associated with complex traits via sequencing.

Curr Protoc Hum Genet. 2013 Jul;Chapter 1:Unit 1.26. doi: 10.1002/0471142905.hg0126s78.

Whole-exome sequencing and imaging genetics identify functional variants for rate of change in hippocampal volume in mild cognitive impairment.

Mol Psychiatry. 2013 Jul;18(7):781-7. doi: 10.1038/mp.2013.24. Epub 2013 Apr 23.

Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing.

Genome Med. 2013 Mar 27;5(3):28. doi: 10.1186/gm432. eCollection 2013.

Assessing the impact of differential genotyping errors on rare variant tests of association.

PLoS One. 2013;8(3):e56626. doi: 10.1371/journal.pone.0056626. Epub 2013 Mar 5.

Assessment of genotype imputation performance using 1000 Genomes in African American studies.

PLoS One. 2012;7(11):e50610. doi: 10.1371/journal.pone.0050610. Epub 2012 Nov 30.

Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome Sequencing Project.

Am J Hum Genet. 2012 Nov 2;91(5):794-808. doi: 10.1016/j.ajhg.2012.08.031. Epub 2012 Oct 25.

Identification of a novel mutation in the CDHR1 gene in a family with recessive retinal degeneration.

Arch Ophthalmol. 2012 Oct;130(10):1301-8. doi: 10.1001/archophthalmol.2012.1906.

Comprehensive molecular portraits of human breast tumours.

Nature. 2012 Oct 4;490(7418):61-70. doi: 10.1038/nature11412. Epub 2012 Sep 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

从基于人群的全外显子组测序研究中提高数据质量的有效筛选策略。

Effective filtering strategies to improve data quality from population-based whole exome sequencing studies.

机构信息