Suppr超能文献

使用合成数据和基因组分割比较选定变异调用程序的性能。

Comparing the performance of selected variant callers using synthetic data and genome segmentation.

机构信息

Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, 20850, USA.

Cancer Genomics Research Laboratory(CGR), Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research sponsored by the National Cancer Institute, 8717 Grovemont Circle, Gaithersburg, MD, 20877, USA.

出版信息

BMC Bioinformatics. 2018 Nov 19;19(1):429. doi: 10.1186/s12859-018-2440-7.

Abstract

BACKGROUND

High-throughput sequencing has rapidly become an essential part of precision cancer medicine. But validating results obtained from analyzing and interpreting genomic data remains a rate-limiting factor. The gold standard, of course, remains manual validation by expert panels, which is not without its weaknesses, namely high costs in both funding and time as well as the necessarily selective nature of manual validation. But it may be possible to develop more economical, complementary means of validation. In this study we employed four synthetic data sets (variants with known mutations spiked into specific genomic locations) of increasing complexity to assess the sensitivity, specificity, and balanced accuracy of five open-source variant callers: FreeBayes v1.0, VarDict v11.5.1, MuTect v1.1.7, MuTect2, and MuSE v1.0rc. FreeBayes, VarDict, and MuTect were run in bcbio-next gen, and the results were integrated into a single Ensemble call set. The known mutations provided a level of "ground truth" against which we evaluated variant-caller performance. We further facilitated the comparison and evaluation by segmenting the whole genome into 10,000,000 base-pair fragments which yielded 316 segments.

RESULTS

Differences among the numbers of true positives were small among the callers, but the numbers of false positives varied much more when the tools were used to analyze sets one through three. Both FreeBayes and VarDict produced strikingly more false positives than did the others, although VarDict, somewhat paradoxically also produced the highest number of true positives. The Ensemble approach yielded results characterized by higher specificity and balanced accuracy and fewer false positives than did any of the five tools used alone. Sensitivity and specificity, however, declined for all five callers as the complexity of the data sets increased, but we did not uncover anything more than limited, weak correlations between caller performance and certain DNA structural features: gene density and guanine-cytosine content. Altogether, MuTect2 performed the best among the callers tested, followed by MuSE and MuTect.

CONCLUSIONS

Spiking data sets with specific mutations -single-nucleotide variations (SNVs), single-nucleotide polymorphisms (SNPs), or structural variations (SVs) in this study-at known locations in the genome provides an effective and economical way to compare data analyzed by variant callers with ground truth. The method constitutes a viable alternative to the prolonged, expensive, and noncomprehensive assessment by expert panels. It should be further developed and refined, as should other comparatively "lightweight" methods of assessing accuracy. Given that the scientific community has not yet established gold standards for validating NGS-related technologies such as variant callers, developing multiple alternative means for verifying variant-caller accuracy will eventually lead to the establishment of higher-quality standards than could be achieved by prematurely limiting the range of innovative methods explored by members of the community.

摘要

背景

高通量测序已迅速成为精准癌症医学的重要组成部分。但验证分析和解释基因组数据的结果仍然是一个限制因素。当然,金标准仍然是专家小组的手动验证,这并非没有其弱点,即资金和时间成本都很高,而且手动验证具有选择性。但是,可能开发出更经济、互补的验证方法。在这项研究中,我们使用了四个越来越复杂的合成数据集(在特定基因组位置中添加已知突变的变体),以评估五种开源变体调用者的敏感性、特异性和平衡准确性:FreeBayes v1.0、VarDict v11.5.1、MuTect v1.1.7、MuTect2 和 MuSE v1.0rc。FreeBayes、VarDict 和 MuTect 在 bcbio-next gen 中运行,结果整合到一个单独的 Ensemble 调用集。已知突变提供了一个“真实”的水平,我们可以根据该水平评估变体调用者的性能。我们通过将整个基因组分割成 1000 万个碱基对片段(产生 316 个片段),进一步促进了比较和评估。

结果

在分析数据集一到三时,调用者之间的真阳性数量差异很小,但假阳性数量差异较大。尽管 VarDict 产生了最高数量的真阳性,但它产生的假阳性数量明显多于其他工具。与单独使用的五个工具中的任何一个相比,Ensemble 方法产生的结果具有更高的特异性和平衡准确性,并且假阳性数量更少。然而,随着数据集复杂性的增加,所有五个调用者的敏感性和特异性都有所下降,但我们没有发现除了某些 DNA 结构特征之外,调用者性能与某些 DNA 结构特征之间存在任何更有限、更弱的相关性:基因密度和鸟嘌呤-胞嘧啶含量。总的来说,在测试的调用者中,MuTect2 的性能最好,其次是 MuSE 和 MuTect。

结论

在基因组的已知位置向基因组中特定位置的特定突变(本研究中的单核苷酸变异(SNV)、单核苷酸多态性(SNP)或结构变异(SV))添加突变的 Spike 数据集提供了一种有效且经济的方法,可以将通过变体调用者分析的数据与真实数据进行比较。该方法是专家小组进行长期、昂贵和非全面评估的可行替代方法。它应该进一步开发和完善,其他用于评估准确性的相对“轻量级”方法也应该如此。鉴于科学界尚未为验证与 NGS 相关的技术(例如变体调用者)建立黄金标准,开发多种替代方法来验证变体调用者的准确性最终将导致建立比社区成员探索的创新性方法范围过早限制所能达到的更高质量标准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6d67/6245711/3f5130485e3f/12859_2018_2440_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验