Altinkaya Isin, Nielsen Rasmus, Korneliussen Thorfinn Sand
Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen K, 1350, Denmark.
Departments of Integrative Biology and Statistics, University of California, Berkeley, CA, 94720, United States.
Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf098.
Accurate quantification of genotype uncertainty is pivotal in ensuring the reliability of genetic inferences drawn from NGS data. Genotype uncertainty is typically modeled using Genotype Likelihoods (GLs), which can help propagate measures of statistical uncertainty in base calls to downstream analyses. However, the effects of errors and biases in the estimation of GLs, introduced by biases in the original base call quality scores or the discretization of quality scores, as well as the choice of the GL model, remain under-explored.
We present vcfgl, a versatile tool for simulating genotype likelihoods associated with simulated read data. It offers a framework for researchers to simulate and investigate the uncertainties and biases associated with the quantification of uncertainty, thereby facilitating a deeper understanding of their impacts on downstream analytical methods. Through simulations, we demonstrate the utility of vcfgl in benchmarking GL-based methods. The program can calculate GLs using various widely used genotype likelihood models and can simulate the errors in quality scores using a Beta distribution. It is compatible with modern simulators such as msprime and SLiM, and can output data in pileup, Variant Call Format (VCF)/BCF, and genomic VCF file formats, supporting a wide range of applications. The vcfgl program is freely available as an efficient and user-friendly software written in C/C++.
vcfgl is freely available at https://github.com/isinaltinkaya/vcfgl.
准确量化基因型不确定性对于确保从NGS数据得出的遗传推断的可靠性至关重要。基因型不确定性通常使用基因型似然性(GLs)进行建模,这有助于将碱基调用中的统计不确定性度量传播到下游分析中。然而,由原始碱基调用质量得分中的偏差或质量得分的离散化以及GL模型的选择所引入的GL估计中的误差和偏差的影响仍未得到充分探索。
我们提出了vcfgl,这是一种用于模拟与模拟读取数据相关的基因型似然性的通用工具。它为研究人员提供了一个框架,用于模拟和研究与不确定性量化相关的不确定性和偏差,从而有助于更深入地了解它们对下游分析方法的影响。通过模拟,我们展示了vcfgl在基于GL的方法基准测试中的效用。该程序可以使用各种广泛使用的基因型似然模型计算GL,并可以使用贝塔分布模拟质量得分中的误差。它与诸如msprime和SLiM等现代模拟器兼容,并且可以以堆积格式、变异调用格式(VCF)/BCF以及基因组VCF文件格式输出数据,支持广泛的应用。vcfgl程序作为一个用C/C++编写的高效且用户友好的软件可免费获得。