学习用于非人灵长类动物基因组变异分析的优化模型。

Learning a refinement model for variant analysis in non-human primate genomes.

作者信息

Choi Jeonghoon, Zhou Bo, Song Giltae

机构信息

Division of Artificial Intelligence, School of Computer Science and Engineering, Pusan National University, Busan, South Korea.

Department of Biochemistry & Biophysics, Texas A&M University, College Station, TX, 77843, USA.

出版信息

BMC Genomics. 2025 Aug 25;26(1):775. doi: 10.1186/s12864-025-11921-2.

DOI:10.1186/s12864-025-11921-2

PMID:40855258

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12379468/

Abstract

BACKGROUND

Accurate variant calling is essential for genomic studies but is highly dependent on sequence alignment (SA) quality. In non-human primates, the lack of well-curated variant resources limits alignment postprocessing, leading to suboptimal SA and increased miscalls. DeepVariant, a leading variant caller, demonstrates high accuracy in human genomes but exhibits performance degradation under suboptimal SA conditions.

RESULTS

To address this, we developed a decision tree-based refinement model that integrates alignment quality metrics and DeepVariant confidence scores to filter miscalls effectively. We defined suboptimal SA and optimal SA based on the presence or absence of postprocessing steps and confirmed that suboptimal SA significantly increases miscalls in both human and rhesus macaque genomes. Applying the refinement model to human suboptimal SA reduced the miscalling ratio (MR) by 52.54%, demonstrating its effectiveness. When applied to rhesus macaque genomes, the model achieved a 76.20% MR reduction, showing its potential for non-human primate studies. Alternative base ratio (ABR) analysis further revealed that the model refines homozygous SNVs more effectively than heterozygous SNVs, improving variant classification reliability.

CONCLUSIONS

Our refinement model significantly improves variant calling in suboptimal SA conditions, which is particularly beneficial for non-human primate studies where alignment postprocessing is often limited. We packaged our model into the Genome Variant Refinement Pipeline (GVRP), providing for researchers working on population genetics and molecular evolution. This work establishes a framework for enhancing variant calling accuracy in species with limited genomic resources.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s12864-025-11921-2.

摘要

背景

准确的变异位点检测对于基因组研究至关重要，但高度依赖于序列比对（SA）质量。在非人类灵长类动物中，缺乏精心整理的变异资源限制了比对后处理，导致次优的序列比对并增加了错误检测。DeepVariant是领先的变异位点检测工具，在人类基因组中显示出高准确性，但在次优的序列比对条件下表现会下降。

结果

为了解决这个问题，我们开发了一种基于决策树的优化模型，该模型整合了比对质量指标和DeepVariant置信度分数，以有效过滤错误检测。我们根据是否存在后处理步骤定义了次优序列比对和最优序列比对，并证实次优序列比对在人类和恒河猴基因组中均显著增加错误检测。将优化模型应用于人类次优序列比对时，错误检测率（MR）降低了52.54%，证明了其有效性。当应用于恒河猴基因组时，该模型实现了76.20%的错误检测率降低，显示出其在非人类灵长类动物研究中的潜力。替代碱基比率（ABR）分析进一步表明，该模型对纯合单核苷酸变异的优化比对杂合单核苷酸变异更有效，提高了变异分类的可靠性。