下一代测序变异检测流程的验证与评估

Validation and assessment of variant calling pipelines for next-generation sequencing.

作者信息

Pirooznia Mehdi, Kramer Melissa, Parla Jennifer, Goes Fernando S, Potash James B, McCombie W Richard, Zandi Peter P

机构信息

Department of Psychiatry and Behavioral Sciences, Johns Hopkins University, Baltimore, MD 21205, USA.

出版信息

Hum Genomics. 2014 Jul 30;8(1):14. doi: 10.1186/1479-7364-8-14.

DOI:10.1186/1479-7364-8-14

PMID:25078893

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4129436/

Abstract

BACKGROUND

The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.

RESULTS

We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.

CONCLUSIONS

Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.

摘要

背景

下一代测序（NGS）实验产生的大规模数据的处理和分析具有挑战性，是新方法开发的一个新兴领域。已经开发了几种新的生物信息学工具来从NGS数据中调用序列变异。在此，我们验证这些工具的变异调用，并比较它们的相对准确性，以确定哪种数据处理流程是最佳的。

结果

我们开发了一个用于处理NGS数据的统一流程，该流程包括四个模块：映射、过滤、重新比对和重新校准以及变异调用。我们通过这个流程处理了一项正在进行的全外显子组测序研究中的130名受试者。为了评估每个模块的准确性，我们对NGS数据中的单核苷酸变异（SNV）调用与总共700个变异的金标准桑格测序或总共9935个单核苷酸多态性的阵列基因分型数据进行了一系列比较。直接比较表明，基因组分析工具包（GATK）提供的调用比SAMtools更准确（阳性预测值分别为92.55%和80.35%）。在SNV调用之前对映射读取进行重新比对和对碱基质量分数进行重新校准被证明对准确的变异调用至关重要。用于变异调用的GATK单倍型调用算法优于统一基因型算法。我们还展示了映射质量、读取深度和等位基因平衡与SNV调用准确性之间的关系。然而，如果在数据处理中使用最佳实践，那么基于这些指标的额外过滤几乎没有增益，并且可以实现>99%的准确性。

结论

我们的发现将有助于确定处理NGS数据以可靠地调用变异进行下游分析的最佳方法。为了使其他人能够实施和复制我们的结果，我们所有的代码都可在http://metamoodics.org/wes上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4062/4129436/d7ee9c4473fd/1479-7364-8-14-1.jpg

相似文献

Validation and assessment of variant calling pipelines for next-generation sequencing.

Hum Genomics. 2014 Jul 30;8(1):14. doi: 10.1186/1479-7364-8-14.

Variant callers for next-generation sequencing data: a comparison study.

PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.

Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels.

BMC Bioinformatics. 2021 Apr 28;22(1):218. doi: 10.1186/s12859-021-04144-1.

Impact of post-alignment processing in variant discovery from whole exome data.

BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.

Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers.

BMC Bioinformatics. 2017 Jan 3;18(1):8. doi: 10.1186/s12859-016-1417-7.

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery.

BMC Genomics. 2022 Feb 22;23(1):155. doi: 10.1186/s12864-022-08365-3.

VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering.

BMC Genomics. 2015 Oct 28;16:875. doi: 10.1186/s12864-015-2050-y.

Evaluation of variant calling tools for large plant genome re-sequencing.

BMC Bioinformatics. 2020 Aug 17;21(1):360. doi: 10.1186/s12859-020-03704-1.

tarSVM: Improving the accuracy of variant calls derived from microfluidic PCR-based targeted next generation sequencing using a support vector machine.

BMC Bioinformatics. 2016 Jun 10;17(1):233. doi: 10.1186/s12859-016-1108-4.

Evaluation of an optimized germline exomes pipeline using BWA-MEM2 and Dragen-GATK tools.

PLoS One. 2023 Aug 3;18(8):e0288371. doi: 10.1371/journal.pone.0288371. eCollection 2023.

引用本文的文献

Benchmarking Genomic Variant Calling Tools in Inbred Mouse Strains: Recommendations and Considerations.

bioRxiv. 2025 May 31:2025.05.28.656711. doi: 10.1101/2025.05.28.656711.

Short-Read and Long-Read Whole Genome Sequencing for SARS-CoV-2 Variants Identification.

Viruses. 2025 Apr 18;17(4):584. doi: 10.3390/v17040584.

A comprehensive catalog of single nucleotide polymorphisms (SNPs) from the black pepper (Piper nigrum L.) genome.

BMC Genomics. 2025 Mar 17;26(1):256. doi: 10.1186/s12864-025-11414-2.

Genomic signatures of adaptation in native lizards exposed to human-introduced fire ants.

Nat Commun. 2025 Jan 2;16(1):89. doi: 10.1038/s41467-024-55020-4.

Reversing the decline of threatened koala () populations in New South Wales: Using genomics to enhance conservation outcomes.

Ecol Evol. 2024 Jul 31;14(8):e11700. doi: 10.1002/ece3.11700. eCollection 2024 Aug.

FishSNP: a high quality cross-species SNP database of fishes.

Sci Data. 2024 Mar 9;11(1):286. doi: 10.1038/s41597-024-03111-8.

Artificial intelligence and database for NGS-based diagnosis in rare disease.

Front Genet. 2024 Jan 25;14:1258083. doi: 10.3389/fgene.2023.1258083. eCollection 2023.

Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data.

Sci Rep. 2023 Nov 22;13(1):20444. doi: 10.1038/s41598-023-47135-3.

mutation rate: an approach.

Front Microbiol. 2023 Aug 9;14:1223293. doi: 10.3389/fmicb.2023.1223293. eCollection 2023.

Evaluation of Type 2 Diabetes Risk Variants (Alleles) in the Pashtun Ethnic Population of Pakistan.

J ASEAN Fed Endocr Soc. 2023;38(1):48-54. doi: 10.15605/jafes.037.S3. Epub 2021 Dec 4.

本文引用的文献

The promise and challenges of next-generation genome sequencing for clinical care.

JAMA Intern Med. 2014 Feb 1;174(2):275-80. doi: 10.1001/jamainternmed.2013.12048.

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers.

Genome Med. 2013 Oct 11;5(10):91. doi: 10.1186/gm495. eCollection 2013.

Variant callers for next-generation sequencing data: a comparison study.

PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.

The role and challenges of exome sequencing in studies of human diseases.

Front Genet. 2013 Aug 26;4:160. doi: 10.3389/fgene.2013.00160.

In search of low-frequency and rare variants affecting complex traits.

Hum Mol Genet. 2013 Oct 15;22(R1):R16-21. doi: 10.1093/hmg/ddt376. Epub 2013 Aug 6.

Next-generation sequencing reveals high concordance of recurrent somatic alterations between primary tumor and metastases from patients with non-small-cell lung cancer.

J Clin Oncol. 2013 Jun 10;31(17):2167-72. doi: 10.1200/JCO.2012.47.7737. Epub 2013 Apr 29.

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.

Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing.

Genome Med. 2013 Mar 27;5(3):28. doi: 10.1186/gm432. eCollection 2013.

Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data.

BMC Genomics. 2012;13 Suppl 8(Suppl 8):S8. doi: 10.1186/1471-2164-13-S8-S8. Epub 2012 Dec 17.

Olorin: combining gene flow with exome sequencing in large family studies of complex disease.

Bioinformatics. 2012 Dec 15;28(24):3320-1. doi: 10.1093/bioinformatics/bts609. Epub 2012 Oct 10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

下一代测序变异检测流程的验证与评估

Validation and assessment of variant calling pipelines for next-generation sequencing.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献