Kalleberg Jenna, Rissman Jacob, Schnabel Robert D
Division of Animal Sciences, University of Missouri, Columbia, Missouri 65201, USA.
Division of Animal Sciences, University of Missouri, Columbia, Missouri 65201, USA;
Genome Res. 2025 Aug 1;35(8):1859-1874. doi: 10.1101/gr.279542.124.
Generating high-quality variant callsets across diverse species remains challenging as most bioinformatic tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human species. Here, we use bovine genomes to assess the limits of using human genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multispecies-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian inheritance error (MIE) rate by a factor of two compared with the default (DV). Checkpoint 28 has a mean MIE rate of 0.03% in three bovine interspecies cross genomes. These results illustrate that a multispecies, trio-based training strategy reduces inheritance errors during single-sample variant calling. Although exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.
在不同物种中生成高质量的变异体调用集仍然具有挑战性,因为大多数生物信息学工具默认基于人类基因组的假设。深度变异体(DV)在不进行联合基因分型的情况下表现出色,同时实施障碍较少。然而,一种“通用”算法日益增长的吸引力放大了其在用于非人类物种时的未知影响。在这里,我们使用牛基因组来评估使用人类基因组训练的变异体调用工具的局限性,包括等位基因频率通道(DV-AF)和联合调用器深度三联体(DT)。我们的新方法TrioTrain,使用区域洗牌方法来减轻基于SLURM的集群的障碍,自动为缺乏瓶中基因组(GIAB)资源的二倍体物种扩展DV。在训练DV对后代进行正确基因分型之前,精心策划不完美的动物真值标签以去除孟德尔不一致位点。通过TrioTrain,我们使用牛、牦牛和野牛三联体创建了第一个多物种训练的DV-AF检查点。尽管不完整的牛真值集限制了在具有挑战性的重复区域内的召回率,但在GIAB基准测试期间,我们在新的检查点中观察到平均单核苷酸变异F1分数>0.990。对于HG002,一个经过牛训练的检查点(28)与默认值(DV)相比,将孟德尔遗传错误(MIE)率降低了两倍。检查点28在三个牛种间杂交基因组中的平均MIE率为0.03%。这些结果表明,基于多物种、三联体的训练策略可减少单样本变异体调用期间的遗传错误。尽管仅使用人类基因组进行训练阻碍了将基于深度学习的变异体调用转移到新物种,但我们利用牛科动物内部的不同祖先来说明需要为比较基因组学设计的先进工具。