利用父母-子女基因型不匹配的频率评估变异calling 方法的准确性。

Evaluating the accuracy of variant calling methods using the frequency of parent-offspring genotype mismatch.

机构信息

Department of Biological Sciences, University of Calgary, Calgary, Alberta, Canada.

Aquatic Ecology and Evolution, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland.

出版信息

Mol Ecol Resour. 2022 Oct;22(7):2524-2533. doi: 10.1111/1755-0998.13628. Epub 2022 May 22.

DOI:10.1111/1755-0998.13628

PMID:35510784

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9544674/

Abstract

The use of next-generation sequencing (NGS) data sets has increased dramatically over the last decade, but there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single lodgepole pine (Pinus contorta) parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the single nucleotide polymorphism (SNP) genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded more SNPs and higher error rates by one to two orders of magnitude, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.

摘要

在过去的十年中，下一代测序（NGS）数据集的使用量急剧增加，但很少有系统的分析来量化常用变异调用程序的准确性。在这里，我们使用了一种由单个二倍体花旗松（Pinus contorta）亲本的二倍体组织和 106 个全同胞后代的母系单倍体组织组成的家族设计，只有突变或生物信息学错误才会导致这些组织之间的不匹配。鉴于突变的罕见性，我们使用亲本和后代基因型之间的不匹配率来推断 FreeBayes、HaplotypeCaller、SAMtools、UnifiedGenotyper 和 VarScan 的单核苷酸多态性（SNP）基因分型错误率。在基线过滤条件下，HaplotypeCaller 和 UnifiedGenotyper 产生的 SNP 数量更多，错误率高出一到两个数量级，而 FreeBayes、SAMtools 和 VarScan 产生的 SNP 数量较少，错误率适中。为了便于在变异调用者之间进行比较，我们使用额外的过滤将每个 SNP 集标准化到相同数量的 SNP，其中 UnifiedGenotyper 始终产生最小比例的基因型错误，其次是 HaplotypeCaller、VarScan、SAMtools 和 FreeBayes。此外，我们发现，由多个变异调用者调用的 SNPs 的错误率最小。最后，我们评估了各种常用过滤指标在 SNP 调用中的性能。我们的分析提供了对五种广泛使用的变异调用程序准确性的定量评估，并为变异调用程序和过滤指标的选择提供了有价值的见解，特别是对于使用非模型研究系统的研究人员。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2411/9544674/ca99b9a0cc5e/MEN-22-2524-g001.jpg

相似文献

Evaluating the accuracy of variant calling methods using the frequency of parent-offspring genotype mismatch.利用父母-子女基因型不匹配的频率评估变异calling 方法的准确性。

Mol Ecol Resour. 2022 Oct;22(7):2524-2533. doi: 10.1111/1755-0998.13628. Epub 2022 May 22.

Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken.鸡中三种变异检测工具的比较以及从SNP芯片数据到全基因组序列水平的填充准确性评估。

BMC Genomics. 2015 Oct 21;16:824. doi: 10.1186/s12864-015-2059-2.

Detailed comparison of two popular variant calling packages for exome and targeted exon studies.详细比较两种用于外显子组和靶向外显子研究的流行变异调用包。

PeerJ. 2014 Sep 30;2:e600. doi: 10.7717/peerj.600. eCollection 2014.

Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms.利用改良的简化代表性测序和 SNP 调用算法的直接比较，生成猩猩群体基因组学的 SNP 数据集。

BMC Genomics. 2014 Jan 10;15:16. doi: 10.1186/1471-2164-15-16.

Impact of post-alignment processing in variant discovery from whole exome data.全外显子数据变异发现中比对后处理的影响

BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.

VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering.变异元调用器：用于基于定量、精确性筛选的变异调用流程的自动融合。

BMC Genomics. 2015 Oct 28;16:875. doi: 10.1186/s12864-015-2050-y.

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果，并改进来自深度覆盖全基因组测序数据的变异检测集。

Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.

Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data.评估用于非配对下一代测序数据的变异调用工具。

Sci Rep. 2017 Feb 24;7:43169. doi: 10.1038/srep43169.

Comparing the performance of selected variant callers using synthetic data and genome segmentation.使用合成数据和基因组分割比较选定变异调用程序的性能。

BMC Bioinformatics. 2018 Nov 19;19(1):429. doi: 10.1186/s12859-018-2440-7.

Variant callers for next-generation sequencing data: a comparison study.下一代测序数据的变异调用者：一项比较研究。

PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.

引用本文的文献

Same trait, different genes: pelvic spine loss in three brook stickleback populations in Alberta, Canada.相同性状，不同基因：加拿大艾伯塔省三个溪鳉种群的骨盆棘丢失情况

Evol Lett. 2024 Oct 18;9(1):115-124. doi: 10.1093/evlett/qrae053. eCollection 2025 Feb.

Repeated global adaptation across plant species.跨植物物种的反复全球适应。

Proc Natl Acad Sci U S A. 2024 Dec 24;121(52):e2406832121. doi: 10.1073/pnas.2406832121. Epub 2024 Dec 20.

Whole chloroplast genome sequence and phylogenetic analysis of (Orchidaceae).兰科植物的全叶绿体基因组序列及系统发育分析

Mitochondrial DNA B Resour. 2024 Oct 3;9(10):1345-1349. doi: 10.1080/23802359.2024.2411376. eCollection 2024.

Exploring the impact of sequence context on errors in SNP genotype calling with whole genome sequencing data using AI-based autoencoder approach.使用基于人工智能的自动编码器方法，利用全基因组测序数据探索序列上下文对单核苷酸多态性（SNP）基因型分型错误的影响。

NAR Genom Bioinform. 2024 Sep 24;6(3):lqae131. doi: 10.1093/nargab/lqae131. eCollection 2024 Sep.

The genetic architecture of repeated local adaptation to climate in distantly related plants.不同亲缘植物对气候进行的重复局部适应的遗传结构。

Nat Ecol Evol. 2024 Oct;8(10):1933-1947. doi: 10.1038/s41559-024-02514-5. Epub 2024 Aug 26.

Kuura-An automated workflow for analyzing WES and WGS data.Kuura—一种用于分析 WES 和 WGS 数据的自动化工作流程。

PLoS One. 2024 Jan 18;19(1):e0296785. doi: 10.1371/journal.pone.0296785. eCollection 2024.

Non-random mating within an Island rookery of Hawaiian hawksbill turtles: demographic discontinuity at a small coastline scale.夏威夷玳瑁海龟岛屿繁殖地内的非随机交配：小海岸线尺度上的种群统计学间断性

R Soc Open Sci. 2023 May 17;10(5):221547. doi: 10.1098/rsos.221547. eCollection 2023 May.

本文引用的文献

Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species.单体型、二倍体和合并外显子捕获再现了两种非模式树种生物学和同源性特征。

Mol Ecol Resour. 2022 Jan;22(1):225-238. doi: 10.1111/1755-0998.13474. Epub 2021 Aug 14.

Comparative Gene Expression Analysis Reveals Mechanism of Response to the Fungal Pathogen .比较基因表达分析揭示了对真菌病原体反应的机制。

Mol Plant Microbe Interact. 2021 Apr;34(4):397-409. doi: 10.1094/MPMI-10-20-0282-R. Epub 2021 Mar 26.

A Reference Genome Sequence for Giant Sequoia.巨杉的参考基因组序列。

G3 (Bethesda). 2020 Nov 5;10(11):3907-3919. doi: 10.1534/g3.120.401612.

Identification of Transposable Elements in Conifer and Their Potential Application in Breeding.针叶树中转座元件的鉴定及其在育种中的潜在应用。

Evol Bioinform Online. 2020 Jun 15;16:1176934320930263. doi: 10.1177/1176934320930263. eCollection 2020.

Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers.跨多种下一代测序仪的种系变异调用管道的系统比较。

Sci Rep. 2019 Jun 27;9(1):9345. doi: 10.1038/s41598-019-45835-3.

An Ultra-Dense Haploid Genetic Map for Evaluating the Highly Fragmented Genome Assembly of Norway Spruce ).用于评估挪威云杉高度碎片化基因组组装的超密集单倍型遗传图谱。

G3 (Bethesda). 2019 May 7;9(5):1623-1632. doi: 10.1534/g3.118.200840.

ConTEdb: a comprehensive database of transposable elements in conifers.ConTEdb：一个松柏类植物转座元件的综合数据库。

Database (Oxford). 2018 Jan 1;2018:bay131. doi: 10.1093/database/bay131.

Comparing the performance of selected variant callers using synthetic data and genome segmentation.使用合成数据和基因组分割比较选定变异调用程序的性能。

BMC Bioinformatics. 2018 Nov 19;19(1):429. doi: 10.1186/s12859-018-2440-7.

fastp: an ultra-fast all-in-one FASTQ preprocessor.fastp：一个超快速的一体化 FASTQ 预处理程序。

Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560.

An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing.利用长读长单分子测序技术对火炬松超大基因组进行的改进组装。

Gigascience. 2017 Jan 1;6(1):1-4. doi: 10.1093/gigascience/giw016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用父母-子女基因型不匹配的频率评估变异calling 方法的准确性。

Evaluating the accuracy of variant calling methods using the frequency of parent-offspring genotype mismatch.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献