使用合成数据和基因组分割比较选定变异调用程序的性能。

Comparing the performance of selected variant callers using synthetic data and genome segmentation.

机构信息

Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, 20850, USA.

Cancer Genomics Research Laboratory(CGR), Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research sponsored by the National Cancer Institute, 8717 Grovemont Circle, Gaithersburg, MD, 20877, USA.

出版信息

BMC Bioinformatics. 2018 Nov 19;19(1):429. doi: 10.1186/s12859-018-2440-7.

DOI:10.1186/s12859-018-2440-7

PMID:30453880

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6245711/

Abstract

BACKGROUND

High-throughput sequencing has rapidly become an essential part of precision cancer medicine. But validating results obtained from analyzing and interpreting genomic data remains a rate-limiting factor. The gold standard, of course, remains manual validation by expert panels, which is not without its weaknesses, namely high costs in both funding and time as well as the necessarily selective nature of manual validation. But it may be possible to develop more economical, complementary means of validation. In this study we employed four synthetic data sets (variants with known mutations spiked into specific genomic locations) of increasing complexity to assess the sensitivity, specificity, and balanced accuracy of five open-source variant callers: FreeBayes v1.0, VarDict v11.5.1, MuTect v1.1.7, MuTect2, and MuSE v1.0rc. FreeBayes, VarDict, and MuTect were run in bcbio-next gen, and the results were integrated into a single Ensemble call set. The known mutations provided a level of "ground truth" against which we evaluated variant-caller performance. We further facilitated the comparison and evaluation by segmenting the whole genome into 10,000,000 base-pair fragments which yielded 316 segments.

RESULTS

Differences among the numbers of true positives were small among the callers, but the numbers of false positives varied much more when the tools were used to analyze sets one through three. Both FreeBayes and VarDict produced strikingly more false positives than did the others, although VarDict, somewhat paradoxically also produced the highest number of true positives. The Ensemble approach yielded results characterized by higher specificity and balanced accuracy and fewer false positives than did any of the five tools used alone. Sensitivity and specificity, however, declined for all five callers as the complexity of the data sets increased, but we did not uncover anything more than limited, weak correlations between caller performance and certain DNA structural features: gene density and guanine-cytosine content. Altogether, MuTect2 performed the best among the callers tested, followed by MuSE and MuTect.

CONCLUSIONS

Spiking data sets with specific mutations -single-nucleotide variations (SNVs), single-nucleotide polymorphisms (SNPs), or structural variations (SVs) in this study-at known locations in the genome provides an effective and economical way to compare data analyzed by variant callers with ground truth. The method constitutes a viable alternative to the prolonged, expensive, and noncomprehensive assessment by expert panels. It should be further developed and refined, as should other comparatively "lightweight" methods of assessing accuracy. Given that the scientific community has not yet established gold standards for validating NGS-related technologies such as variant callers, developing multiple alternative means for verifying variant-caller accuracy will eventually lead to the establishment of higher-quality standards than could be achieved by prematurely limiting the range of innovative methods explored by members of the community.

摘要

背景

高通量测序已迅速成为精准癌症医学的重要组成部分。但验证分析和解释基因组数据的结果仍然是一个限制因素。当然，金标准仍然是专家小组的手动验证，这并非没有其弱点，即资金和时间成本都很高，而且手动验证具有选择性。但是，可能开发出更经济、互补的验证方法。在这项研究中，我们使用了四个越来越复杂的合成数据集（在特定基因组位置中添加已知突变的变体），以评估五种开源变体调用者的敏感性、特异性和平衡准确性：FreeBayes v1.0、VarDict v11.5.1、MuTect v1.1.7、MuTect2 和 MuSE v1.0rc。FreeBayes、VarDict 和 MuTect 在 bcbio-next gen 中运行，结果整合到一个单独的 Ensemble 调用集。已知突变提供了一个“真实”的水平，我们可以根据该水平评估变体调用者的性能。我们通过将整个基因组分割成 1000 万个碱基对片段（产生 316 个片段），进一步促进了比较和评估。

结果

在分析数据集一到三时，调用者之间的真阳性数量差异很小，但假阳性数量差异较大。尽管 VarDict 产生了最高数量的真阳性，但它产生的假阳性数量明显多于其他工具。与单独使用的五个工具中的任何一个相比，Ensemble 方法产生的结果具有更高的特异性和平衡准确性，并且假阳性数量更少。然而，随着数据集复杂性的增加，所有五个调用者的敏感性和特异性都有所下降，但我们没有发现除了某些 DNA 结构特征之外，调用者性能与某些 DNA 结构特征之间存在任何更有限、更弱的相关性：基因密度和鸟嘌呤-胞嘧啶含量。总的来说，在测试的调用者中，MuTect2 的性能最好，其次是 MuSE 和 MuTect。

结论

在基因组的已知位置向基因组中特定位置的特定突变（本研究中的单核苷酸变异（SNV）、单核苷酸多态性（SNP）或结构变异（SV））添加突变的 Spike 数据集提供了一种有效且经济的方法，可以将通过变体调用者分析的数据与真实数据进行比较。该方法是专家小组进行长期、昂贵和非全面评估的可行替代方法。它应该进一步开发和完善，其他用于评估准确性的相对“轻量级”方法也应该如此。鉴于科学界尚未为验证与 NGS 相关的技术（例如变体调用者）建立黄金标准，开发多种替代方法来验证变体调用者的准确性最终将导致建立比社区成员探索的创新性方法范围过早限制所能达到的更高质量标准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6d67/6245711/3f5130485e3f/12859_2018_2440_Fig1_HTML.jpg

相似文献

Comparing the performance of selected variant callers using synthetic data and genome segmentation.使用合成数据和基因组分割比较选定变异调用程序的性能。

BMC Bioinformatics. 2018 Nov 19;19(1):429. doi: 10.1186/s12859-018-2440-7.

Accuracy and reproducibility of somatic point mutation calling in clinical-type targeted sequencing data.临床型靶向测序数据中体细胞点突变calling 的准确性和可重复性。

BMC Med Genomics. 2020 Oct 15;13(1):156. doi: 10.1186/s12920-020-00803-z.

Benchmarking UMI-aware and standard variant callers for low frequency ctDNA variant detection.基于 UMIs 的低频 ctDNA 变异检测与标准变异 caller 的基准测试

BMC Genomics. 2024 Sep 3;25(1):827. doi: 10.1186/s12864-024-10737-w.

Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken.鸡中三种变异检测工具的比较以及从SNP芯片数据到全基因组序列水平的填充准确性评估。

BMC Genomics. 2015 Oct 21;16:824. doi: 10.1186/s12864-015-2059-2.

Evaluating the performance of tools used to call minority variants from whole genome short-read data.评估用于从全基因组短读数据中检测罕见变异的工具的性能。

Wellcome Open Res. 2018 Sep 13;3:21. doi: 10.12688/wellcomeopenres.13538.2. eCollection 2018.

Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.基准测试显示深度学习变异调用程序在细菌纳米孔测序数据上的优越性。

Elife. 2024 Oct 10;13:RP98300. doi: 10.7554/eLife.98300.

Variant callers for next-generation sequencing data: a comparison study.下一代测序数据的变异调用者：一项比较研究。

PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.

svclassify: a method to establish benchmark structural variant calls.svclassify：一种建立基准结构变异调用的方法。

BMC Genomics. 2016 Jan 16;17:64. doi: 10.1186/s12864-016-2366-2.

Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers.癌症外显子组测序数据的详细模拟揭示了变异检测工具的差异和常见局限性。

BMC Bioinformatics. 2017 Jan 3;18(1):8. doi: 10.1186/s12859-016-1417-7.

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果，并改进来自深度覆盖全基因组测序数据的变异检测集。

Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.

引用本文的文献

Benchmarking Genomic Variant Calling Tools in Inbred Mouse Strains: Recommendations and Considerations.近交系小鼠品系中基因组变异检测工具的基准测试：建议与注意事项

bioRxiv. 2025 May 31:2025.05.28.656711. doi: 10.1101/2025.05.28.656711.

Deleterious mitochondrial heteroplasmies exhibit increased longitudinal change in variant allele fraction.有害的线粒体异质性在变异等位基因比例上表现出更大的纵向变化。

iScience. 2025 May 6;28(6):112590. doi: 10.1016/j.isci.2025.112590. eCollection 2025 Jun 20.

Benchmarking UMI-aware and standard variant callers for low frequency ctDNA variant detection.基于 UMIs 的低频 ctDNA 变异检测与标准变异 caller 的基准测试

BMC Genomics. 2024 Sep 3;25(1):827. doi: 10.1186/s12864-024-10737-w.

Placental somatic mutation in human stillbirth and live birth: A pilot case-control study of paired placental, fetal, and maternal whole genomes.胎盘体细胞突变与人类死产和活产：胎盘、胎儿和母体全基因组配对的初步病例对照研究。

Placenta. 2024 Sep 2;154:137-144. doi: 10.1016/j.placenta.2024.06.017. Epub 2024 Jun 22.

Utilizing immunogenomic approaches to prioritize targetable neoantigens for personalized cancer immunotherapy.利用免疫基因组学方法对可靶向的新抗原进行优先级排序，以实现个性化癌症免疫治疗。

Front Immunol. 2023 Dec 12;14:1301100. doi: 10.3389/fimmu.2023.1301100. eCollection 2023.

A Bioinformatics Toolkit for Next-Generation Sequencing in Clinical Oncology.临床肿瘤学中用于下一代测序的生物信息学工具包。

Curr Issues Mol Biol. 2023 Dec 4;45(12):9737-9752. doi: 10.3390/cimb45120608.

Comprehensive and realistic simulation of tumour genomic sequencing data.肿瘤基因组测序数据的全面且真实的模拟

NAR Cancer. 2023 Sep 22;5(3):zcad051. doi: 10.1093/narcan/zcad051. eCollection 2023 Sep.

Identifying high-confidence variants in human cytomegalovirus genomes sequenced from clinical samples.从临床样本中测序的人巨细胞病毒基因组中鉴定高可信度变异体。

Virus Evol. 2022 Dec 5;8(2):veac114. doi: 10.1093/ve/veac114. eCollection 2022.

Evaluation of variant calling algorithms for wastewater-based epidemiology using mixed populations of SARS-CoV-2 variants in synthetic and wastewater samples.利用 SARS-CoV-2 变异株在合成和废水样本中的混合群体评估基于废水的流行病学中的变异呼叫算法。

Microb Genom. 2023 Apr;9(4). doi: 10.1099/mgen.0.000933.

Optimizing Insertion and Deletion Detection Using Next-Generation Sequencing in the Clinical Laboratory.利用下一代测序技术在临床实验室中优化插入和缺失检测。

J Mol Diagn. 2022 Dec;24(12):1217-1231. doi: 10.1016/j.jmoldx.2022.08.006. Epub 2022 Sep 24.

本文引用的文献

Best practices for benchmarking germline small-variant calls in human genomes.人类基因组中小变异calls 的基准测试最佳实践。

Nat Biotechnol. 2019 May;37(5):555-560. doi: 10.1038/s41587-019-0054-x. Epub 2019 Mar 11.

A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data.用于下一代测序数据的体细胞单核苷酸变异检测算法综述。

Comput Struct Biotechnol J. 2018 Feb 6;16:15-24. doi: 10.1016/j.csbj.2018.01.003. eCollection 2018.

Mutation matters in precision medicine: A future to believe in.突变在精准医学中至关重要：一个值得期待的未来。

Cancer Treat Rev. 2017 Apr;55:136-149. doi: 10.1016/j.ctrv.2017.03.002. Epub 2017 Mar 16.

Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data.评估用于非配对下一代测序数据的变异调用工具。

Sci Rep. 2017 Feb 24;7:43169. doi: 10.1038/srep43169.

BMC Bioinformatics. 2017 Jan 3;18(1):8. doi: 10.1186/s12859-016-1417-7.

In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data.基于不同肿瘤下一代测序深度数据的体细胞点突变检测工具的深入比较。

Sci Rep. 2016 Nov 22;6:36540. doi: 10.1038/srep36540.

MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data.MuSE：使用样本特异性误差模型考虑肿瘤异质性可提高从测序数据中检测突变的灵敏度和特异性。

Genome Biol. 2016 Aug 24;17(1):178. doi: 10.1186/s13059-016-1029-6.

Towards precision medicine.迈向精准医学。

Nat Rev Genet. 2016 Aug 16;17(9):507-22. doi: 10.1038/nrg.2016.86.

VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research.VarDict：一种用于癌症研究中下一代测序的新型多功能变异检测工具。

Nucleic Acids Res. 2016 Jun 20;44(11):e108. doi: 10.1093/nar/gkw227. Epub 2016 Apr 7.

Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data.评估九种体细胞变异检测工具在全外显子组测序和靶向深度测序数据中检测体细胞突变的性能

PLoS One. 2016 Mar 22;11(3):e0151664. doi: 10.1371/journal.pone.0151664. eCollection 2016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用合成数据和基因组分割比较选定变异调用程序的性能。

Comparing the performance of selected variant callers using synthetic data and genome segmentation.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献