Suppr超能文献

利用父母-子女基因型不匹配的频率评估变异calling 方法的准确性。

Evaluating the accuracy of variant calling methods using the frequency of parent-offspring genotype mismatch.

机构信息

Department of Biological Sciences, University of Calgary, Calgary, Alberta, Canada.

Aquatic Ecology and Evolution, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland.

出版信息

Mol Ecol Resour. 2022 Oct;22(7):2524-2533. doi: 10.1111/1755-0998.13628. Epub 2022 May 22.

Abstract

The use of next-generation sequencing (NGS) data sets has increased dramatically over the last decade, but there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single lodgepole pine (Pinus contorta) parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the single nucleotide polymorphism (SNP) genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded more SNPs and higher error rates by one to two orders of magnitude, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.

摘要

在过去的十年中,下一代测序(NGS)数据集的使用量急剧增加,但很少有系统的分析来量化常用变异调用程序的准确性。在这里,我们使用了一种由单个二倍体花旗松(Pinus contorta)亲本的二倍体组织和 106 个全同胞后代的母系单倍体组织组成的家族设计,只有突变或生物信息学错误才会导致这些组织之间的不匹配。鉴于突变的罕见性,我们使用亲本和后代基因型之间的不匹配率来推断 FreeBayes、HaplotypeCaller、SAMtools、UnifiedGenotyper 和 VarScan 的单核苷酸多态性(SNP)基因分型错误率。在基线过滤条件下,HaplotypeCaller 和 UnifiedGenotyper 产生的 SNP 数量更多,错误率高出一到两个数量级,而 FreeBayes、SAMtools 和 VarScan 产生的 SNP 数量较少,错误率适中。为了便于在变异调用者之间进行比较,我们使用额外的过滤将每个 SNP 集标准化到相同数量的 SNP,其中 UnifiedGenotyper 始终产生最小比例的基因型错误,其次是 HaplotypeCaller、VarScan、SAMtools 和 FreeBayes。此外,我们发现,由多个变异调用者调用的 SNPs 的错误率最小。最后,我们评估了各种常用过滤指标在 SNP 调用中的性能。我们的分析提供了对五种广泛使用的变异调用程序准确性的定量评估,并为变异调用程序和过滤指标的选择提供了有价值的见解,特别是对于使用非模型研究系统的研究人员。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2411/9544674/ca99b9a0cc5e/MEN-22-2524-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验